Online College Math Teacher

Saturday, July 11, 2026

Testing racial predictions with BISG

Regulators expect insurance companies, banks, and others not to discriminate by race, but businesses are not allowed to collect the race of their customers. One solution for testing discrimination in the aggregate is to use U.S. Census data to estimate the probable race of each customer based on the customer’s name and geographic location. (The Census race data is self identified and excludes a multi racial category; I will leave it to the sociologists to discuss issues with these things and whether Hispanic is a race.)

The common algorithm for estimating race from names and geography is BISG (Bayesian Improved Surname Geocoding). As an example, suppose there is a customer Mary Johnson who lives in Essex County, NJ. If we want to predict her race only from her surname, we use nationwide Census probabilities such as P(Black ∣ Johnson) = 0.3441, P(White ∣ Johnson) = 0.5438, etc.

To improve the prediction, we add her county, using geographic weights such as the relative density of each race in Essex compared with nationwide. The resulting probability P(Black | Johnson & Essex) = (Black national prob × Black geographic weight) / ∑_race (race national prob × race geographic weight) = 0.7393.

Finally, we add her first name. Under a Naive Bayes assumption that surname, location, and first name are conditionally independent given race, Bayes’ rule allows us to multiply the components: P(race ∣ surname & geo & first ) ≈ P(race) x P(surname ∣ race) x P(geo ∣ race) x P(first ∣ race). The probability P(Black ∣ Johnson & Mary & Essex) becomes 0.7158.

A summary of these probabilities is:

Model Type	White	Black	Hispanic	Asian	Other
Surname Only	.5438	.3441	.0272	.0074	.0774
Surname+Geo	.1766	.7393	.0244	.0046	.0551
Surname+First+Geo	.2481	.7158	.0042	.0013	.0306

These calculations can be done with the R package wru (Who are you?). You will need a Census API key to download the Census data, available from https://api.census.gov/data/key_ssignup.html .

I was interested in measuring the accuracy of the BISG predictions. It is not easy to find real (not simulated) data linking name and race. However, several states make their voter registration records public. I requested and downloaded Florida voting registration data from Harvard Dataverse ( https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UBIG3F ). I limited my study to a single Florida county, Lee, which had over 500,000 registered voters.

The Florida voting data required considerable data cleaning. Among the amusing data issues is that some Lee County voters registered with a specialty post office box zip code, or a zip code with a typo, or an out-of-state zip code. The first two were mapped to county level, and the third category was dropped.

There were 545,187 registered Lee County voters, distributed by race as follows:

Asian	Black	Hispanic	Other	White	NA
6912	32407	66677	9435	422333	7423

I performed the same BISG algorithm as above, against the Florida voter registration data for Lee County. The overall accuracy is 88%, but this is misleading because the Florida data is highly imbalanced toward the White race. A better accuracy measure is one that is calculated separately for each race, and I chose Balanced Accuracy = (Sensitivity + Specificity)/2. Sensitivity measures how well the algorithm finds people of that specific race, and specificity measures how well the algorithm avoids misclassifying people of other races into that specific race.

My Balanced Accuracy results of comparing BISG race predictions versus Florida self-identified race data is as follows:

White	Black	Hispanic	Asian	Other
.8226	.7179	.9007	.7251	.5041

I think these results are pretty good for Hispanic and White, not as good for Black and Asian, and of course pretty poor for Other. Other includes Native American, Native Hawaiian, and others, and are a small overall percentage of the county population.

The citation for the Florida voter data is: Sood, Gaurav, 2017, "Florida Voter Registration Data (2017 and 2022)", https://doi.org/10.7910/DVN/UBIG3F, Harvard Dataverse, V2.

Here is my R code to predict Mary Johnson.


library(wru)
library(dplyr)

# 1. Create a 1-row dataset for Mary Johnson in Essex County, NJ (FIPS "013")
test_voter <- data.frame(
  surname = "JOHNSON",
  first = "MARY",
  state = "NJ",
  county = "013", # Essex County FIPS
  stringsAsFactors = FALSE
)

# 2. Grab the live Census geographic data for New Jersey
nj_census <- get_census_data(
  key = "xxxx",
  states = "NJ",
  age = FALSE,
  sex = FALSE
)

# 3. Generate the 3 prediction variations
pred_baseline  <- predict_race(voter.file = test_voter, surname.only = TRUE)
pred_geo       <- predict_race(voter.file = test_voter, census.data = nj_census, census.geo = "county")
pred_first_geo <- predict_race(voter.file = test_voter, census.data = nj_census, census.geo = "county", names.to.use = "surname, first")

# 4. Format and clean output to perfectly fit the RStudio layout
combined_predictions <- bind_rows(
  pred_baseline  %>% mutate(model_type = "Surname Only"),
  pred_geo       %>% mutate(model_type = "Surname+Geo"),
  pred_first_geo %>% mutate(model_type = "Surname+First+Geo")
) %>%
  rename(
    White = pred.whi,
    Black = pred.bla,
    Hispan = pred.his,
    Asian = pred.asi,
    Other = pred.oth
  ) %>%
  mutate(across(c(White, Black, Hispan, Asian, Other), ~ round(., 4))) %>%
  select(model_type, surname, first, White, Black, Hispan, Asian, Other)

# 5. Display the output summary
print(combined_predictions, row.names = FALSE)
# End of Mary Johnson prediction
##################################################################################

Here is my R code to measure the accuracy of the BISG predictions against Lee County voters. A considerable portion of the code is for data cleaning of the Florida voter registration data.


library(data.table)
library(dplyr)
library(stringr)
library(caret)
library(wru)

voter_file_path <- "xxxx/LEE_20220621.txt"
Sys.setenv(CENSUS_API_KEY = "xxxx")

raw_lines <- fread(
  file = voter_file_path,
  sep = NULL, # Read the row as a single string to lock layout shifting
  header = FALSE,
  col.names = "raw_row"
)

wru_ready_data <- raw_lines %>%
  mutate(
    # Extract structural components securely out of text blocks
    surname = str_match(raw_row, "\\d{9}\\s+(\\w+)")[,2],
    first = str_match(raw_row, "\\d{9}\\s+\\w+\\s+(\\w+)")[,2],
    zipcode = str_extract(raw_row, "\\b3\\d{4}\\b"),
    raw_race = str_match(raw_row, "\\b([MF])\\s+(\\d)\\b")[,3]
  ) %>%
  # Filter out empty records or text extraction errors
  filter(!is.na(surname) & !is.na(raw_race)) %>%
  # Map Florida's state administrative codes to wru categories
  mutate(
    true_race = case_when(
      raw_race == "5" ~ "white",
      raw_race == "3" ~ "black",
      raw_race == "4" ~ "hispanic",
      raw_race == "2" ~ "asian",
      raw_race %in% c("1", "6", "7") ~ "other",
      TRUE ~ NA_character_ # Automatically standardizes missing responses to NA
    )
  )

# Clean data of invalid zip codes and reroute PO Boxes to residential ZCTAs
wru_ready_data <- wru_ready_data %>%
  # Exclude out-of-state ZIP codes (Georgia 30xxx/31xxx, Alabama 35xxx/36xxx)
  filter(!substr(zipcode, 1, 2) %in% c("30", "31", "35", "36")) %>%
  # Map missing/typos and PO boxes directly to matching valid residential ZCTAs
  mutate(
    zipcode = case_when(
      # Defunct / Typos / Missing data map to Central Lee County Baseline
      zipcode %in% c("fl-NA", "33929", "33932", "33945", "33970") ~ "33901",
     
      # PO Box explicit routing to residential census counterparts
      zipcode == "33902" ~ "33901",  # Fort Myers PO Box -> Fort Myers Residential
      zipcode == "33906" ~ "33907",  # Fort Myers PO Box -> South Fort Myers Residential
      zipcode == "33910" ~ "33904",  # Cape Coral PO Box -> Cape Coral Residential
      zipcode == "33915" ~ "33919",  # Fort Myers PO Box -> Cypress Lake Residential
      zipcode == "33918" ~ "33903",  # N. Fort Myers PO Box -> N. Fort Myers Residential
      zipcode == "33994" ~ "33928",  # Bonita Springs PO Box -> Estero/Bonita Residential
      zipcode == "34106" ~ "34102",  # Naples PO Box -> Naples Residential
      zipcode == "34133" ~ "34135",  # Bonita Springs PO Box -> Bonita Residential
      zipcode == "34136" ~ "34135",  # Bonita Springs PO Box -> Bonita Residential
     
      TRUE ~ zipcode # Keep all other valid residential ZCTAs as they are
    ),
    # Ensure wru recognizes the geographic county boundary for the fallback imputation
    county = "12071" # FIPS code for Lee County, FL
  )

nrow(wru_ready_data)
table(wru_ready_data$true_race, useNA = "always")

lee_county_test <- as.data.table(wru_ready_data) %>%
  filter(!is.na(true_race)) %>%
  mutate(
    state = "fl",
    surname = as.character(surname),
    first = as.character(first),
    zcta = as.character(zipcode) # Required column label mapping for ZCTA use
  )

florida_zcta_layers <- wru::get_census_data(
  key = Sys.getenv("CENSUS_API_KEY"),
  state = "FL",
  age = FALSE,
  sex = FALSE,
  census.geo = "zcta",
  county.list = NULL # Required parameter boundary choice for independent ZCTAs
)

predicted_lee_county <- wru::predict_race(
  voter.file = lee_county_test,
  census.surname = TRUE,
  surname.only = FALSE,
  census.geo = "zcta",
  census.data = florida_zcta_layers,
  impute.missing = TRUE, # Forces wru to fall back to County priors if needed
  skip_bad_geos = TRUE
)

# "53233 (10.1%) individuals' last names were not matched, but ZCTA's baseline geographic racial demographics to calculate the prediction.

evaluation_ready_data <- predicted_lee_county %>%
  rowwise() %>%
  mutate(
    # Isolate which of the five columns held the highest posterior probability string
    max_prob_col = c("white", "black", "hispanic", "asian", "other")[
      which.max(c(pred.whi, pred.bla, pred.his, pred.asi, pred.oth))
    ]
  ) %>%
  ungroup() %>%
  # Convert fields into factor arrays to prevent tracking sequence errors
  mutate(
    true_race = factor(true_race, levels = c("white", "black", "hispanic", "asian", "other")),
    predicted_map = factor(max_prob_col, levels = c("white", "black", "hispanic", "asian", "other"))
  )

# Compute the final metric confusion matrix tracking loop
accuracy_report <- confusionMatrix(
  data = evaluation_ready_data$predicted_map,
  reference = evaluation_ready_data$true_race
)

print(accuracy_report$overall["Accuracy"]) # Overall model precision
print(accuracy_report$byClass[, "Balanced Accuracy"]) # Success broken down per group

End

Saturday, June 27, 2026

RSA Encryption

When I was in high school, students would pass notes to each other in class. I even had an English teacher who was rumored to be having an affair with another teacher, and she would pass notes to him, using a student to deliver the note. Nowadays, students would use their phones and text each other.

I recently learned a small amount about encryption for a discrete math class. Encryption is the process of scrambling readable data into unreadable code using a secret mathematical key. Cryptography is the general subject of securing information by converting it into unreadable code. During World War II, mathematician Alan Turing played a crucial role in deciphering German messages. Also during World War II, Native American Code Talkers created unbreakable code that enemy cryptographers could never decipher. Today encryption is used in protecting usernames, passwords, and personal data from interception, in many areas such as financial transactions.

Cryptography is a vast subject. I will limit this essay to the RSA algorithm which is reasonably simple, although it becomes computationally complex for large files. I am also intentionally using small numbers; real RSA uses enormous numbers that make it hard to hack. In fact, RSA is being phased out starting in 2030 due to the threat of future quantum computers.

Briefly, if you and I agree to use the RSA encryption algorithm, and you have the public key number (n,e) where n is the product of two prime numbers and you have the private key number d, then I can give you an encrypted message and you can decrypt it.

Suppose the public key is (n,e) = (33,3) and the private key is d = 7. I have a message, and I encrypt it. Suppose my encrypted message is 3 32 12 3 11 26 32 16 9 21 15 32 28 1 5 31 24 1 32 8 21 24 5 28 . Prior to encryption, 1 is a, 2 is b, etc., 27 is comma, and 32 is space. If you can decrypt this, you will know a secret of my youth :)

Recall the modulo operator "mod": if x and y are integers, x mod y is the remainder after x/y. For example, 17 mod 12 = 5, because 17/12 = 1 + remainder 5.

The following is a brief summary of RSA encryption with small numbers:

The recipient has a public key (n,e) that is available to the sender, eg (33,3); and a private key, eg d = 7. (The recipient first chooses primes p and q, eg 3 and 11, and computes n = p*q. The recipient computes Totient φ = (p-1)(q-1)=20, and selects an e that is coprime to φ. Then the recipient performs a modular inverse to solve for d in e*d = 1 mod φ, eg 3*d = 1 mod 20; this can be done using the Extended Euclidean Algorithm, although R package numbers does this as d <- modinv(e,φ) . Here d is 7.)

The sender converts the message from letter to plaintext m, eg m=13; and then from m to encrypted ciphertext = (m^e)(mod n), eg c = (13^3)(mod 33) = 19. (The R expression is (m^e) %% n . However, this may fail if numbers get too large. One alternative is modpower(13,3,33) from the numbers package.)

The recipient decrypts c to recover m, m = (c^d)mod n, eg m = (19^7)(mod 33) = 13.

As an example, if the message is “math”, the plaintext m is (13 1 20 8), the encrypted ciphertext c is (19 1 14 17), the decrypted plaintext m is (13 1 20 8), and the decrypted message is (m a t h).

Here is my R code.


library(numbers)

# --- RECIPIENT KEYS ---
p <- 3
q <- 11
n <- p * q
phi <- (p - 1) * (q - 1)
e <- 3
d <- modinv(e, phi)

# --- SENDER SIDE ---
m_text <- "MATH"
m_text <- tolower(m_text)
print(paste("The unencrypted message m is:", m_text))

char_vector <- unlist(strsplit(m_text, split = ""))

# Mapping: letters → 1–26, punctuation → custom codes
special_lookup <- c("," = 27, "." = 28, " " = 32)

m_tokens <- sapply(char_vector, function(char) {
  if (char %in% names(special_lookup)) {
    special_lookup[char]
  } else if (char %in% letters) {
    match(char, letters)
  } else {
    stop(paste("Unsupported character:", char))
  }
})

# Encryption
c <- sapply(m_tokens, function(msg_num) modpower(msg_num, e, n))
print(paste("The encrypted ciphertext c is:", paste(c, collapse = " ")))

# Decryption
m_decrypted <- sapply(c, function(cipher_num) modpower(cipher_num, d, n))

# Reverse mapping
result <- sapply(m_decrypted, function(num) {
  if (num %in% special_lookup) {
    names(special_lookup)[special_lookup == num]
  } else {
    letters[num]
  }
})

print(paste("The decrypted ciphertext integers are:", paste(m_decrypted, collapse = " ")))
print(paste("The decrypted message is:", paste(result, collapse = " ")))

End

Tuesday, March 31, 2026

Transgender Day of Visibility

March 31 is Transgender Day of Visibility. I’m not transgender myself, but I have friends, acquaintances, and family members who are. Chances are you do too, whether you realize it or not.

My understanding is that transgender is an umbrella term for people whose gender identity or expression differs from the sex they were assigned at birth. It reflects a deeply held internal sense of self—something not defined by appearance, clothing, or medical procedures. Being transgender is about identity, not sexual orientation.

What I’ve learned over time is that many transgender people face challenges most of us never have to think about. These include discrimination, gaps in legal protection, denial of medical care, and even physical violence. There are also everyday barriers that rarely make headlines, like difficulty obtaining accurate driver’s licenses or passports—documents most of us take for granted.

I found this resource helpful: Understanding the Transgender Community .

I wish every transgender person could live their life openly, safely, and without being hassled for who they are.

The five color Transgender Pride Flag was designed by Monica Helms in 1999. I made this flag in computer language R. Here is the R code.

########################



library(ggplot2)

# Define the colors in order: Blue, Pink, White, Pink, Blue
trans_colors <- c("#5BCEFA", "#F5A9B8", "#FFFFFF", "#F5A9B8", "#5BCEFA")

# Create a data frame for the 5 stripes
flag_data <- data.frame(
  stripe = factor(1:5),
  height = rep(1, 5)
)

# Plot the flag
ggplot(flag_data, aes(x = 1, y = height, fill = stripe)) +
  geom_bar(stat = "identity", width = 1, color = NA) +
  scale_fill_manual(values = rev(trans_colors)) + # Reverse to stack correctly
  theme_void() + # Remove axes and labels
  theme(legend.position = "none") +
  coord_cartesian(expand = FALSE)

End

Friday, March 13, 2026

Does every finite string of numbers appear within π ?

In honor of Pi Day (March 14), I offer the following: Does every finite string of numbers like your social security number eventually appear somewhere within the decimal expansion of π ?

So said Harold Finch, the computer genius on TV show "Person of Interest" ("You are being watched ... ") in a 2013 episode where he poses as a substitute high school math teacher.

"Pi... keeps on going, forever, without ever repeating. Which means that contained within this string of decimals, is every single other number. Your birthdate, combination to your locker, your social security number, it's all in there, somewhere. And if you convert these decimals into letters, you would have every word that ever existed... all of the world's infinite possibilities rest within this one simple circle."

We know π is irrational; it cannot be expressed as the quotient of two integers. Video

An irrational number has an infinite, non-repeating decimal expansion.

A number is defined as normal if any finite string of numbers like your social security number will eventually appear somewhere in its decimal expansion. (The is a simpler version of a more complicated mathematical definition , but it is sufficient for this discussion.) Mr. Finch is claiming π is a normal number. However, mathematicians have not yet proved this. (I once attempted to add this last statement and the fact that Finch was wrong, to Wikipedia's Person of Interest page, but the Wikipedia editors declined it.)

Although π is defined geometrically as the ratio of a circle's circumference to its diameter, mathematicians calculate long sequences of π using infinite series, rather than physical measurements. Various websites such as angio provide files containing up to 1 million digits of π.

I wrote some R code that would take a number (or a word, converted to a number) and search for where that number could be found within π. I decided the first 100,000 digits of π were enough for me. Of course, the fact that a number can not be found within the first 100,000 digits, does not preclude that it can be found in the next 100,000 digits, or the next 1 million digits. And so on. Infinitely many digits goes on for a long time.

I tested these examples:

"Eggs". e = 5, g = 7, g = 7, s = 19. 57719 appears starting at the 6026 th digit of π.
"0123" appears starting at the 27846 th digit of π.
"01234567" was not found in the first 100k digits. But according to The Pi Search Page it can be found starting at the 112,099,767 digit.

I did not try "0123456789", but another source says it was not found in the first 2 billion digits.

Here is my R code. Happy Pi Day!


# Symbol for pi is \u03c0
library(stringr)
library(readr)

find_in_pi <- function(input_value) {
  # 1. Load the Pi data
  data.raw <- readr::read_file("https://assets.angio.net/100000.txt")
  data.vec <- unlist(str_split(data.raw, pattern = ""))
  data.vec <- data.vec[-c(1, 2)] # remove '3.'
  pi_string <- paste(as.character(data.vec), collapse = "")
  
  # 2. Process Input: Determine if it's a word or a numeric string
  # Convert to character and lowercase for uniformity
  clean_input <- tolower(as.character(input_value))
  
  if (grepl("[a-z]", clean_input)) {
    # It's a word: Convert letters to numbers (1-26)
    word_only <- gsub("[^a-z]", "", clean_input)
    z <- unlist(str_split(word_only, ""))
    q <- match(z, letters)
    search_target <- paste(q, collapse = "")
  } else {
    # It's already a number: Just strip non-digits (like decimals or spaces)
    search_target <- gsub("[^0-9]", "", clean_input)
  }
  
  # 3. Search Pi
  pos <- regexpr(search_target, pi_string)
  
  # 4. Handle Results
  if (pos[1] == -1) {
    return(paste0("The sequence '", search_target, "' was not found in the first 100k digits."))
  } else {
    start <- pos[1]
    end <- start + nchar(search_target) - 1
    return(list(
      found = search_target,
      start_index = start,
      end_index = end,
      index_sequence = seq(from = start, to = end)
    ))
  }
}

# Search for a word
find_in_pi("Eggs")

# Search for a number;  enclose in quotes if a leading zero
find_in_pi("01234")
find_in_pi("01234567")

End

Monday, February 23, 2026

Years having a month with exactly four weeks

A year having a month with exactly four weeks will occur in 2026, 2037, 2043, 2054.

Such a year has February 1 on a Sunday, and is a non-Leap year so February has exactly 28 days.

Here is the R code:


library(lubridate)

date_sequence <- seq(
  as.Date("2026-02-01"),
  as.Date("2056-02-01"),
  by = "years"
)

df <- data.frame(
  Feb_1 = date_sequence,
  day_of_week = weekdays(date_sequence), 
  is_leap = leap_year(date_sequence)
)

df <- subset(df, day_of_week == "Sunday" & is_leap == FALSE)

cat("The following years have a month (Feb) with exactly four weeks:", (year(df$Feb_1)))

# 2026, 2037, 2043, 2054

End

Sunday, February 1, 2026

My other girlfriend, Irmaa. (Not a typo: two a's)

She is lovely, isn't she?

Irmaa sat alone at a corner table in the kind of restaurant where the silverware had weight and the waiters moved like well trained ghosts. A single glass of Sancerre caught the light, throwing a pale shimmer across the white tablecloth. She liked that — the quiet glow, the sense that even photons paused to admire her.

Her simple but elegant hairstyle, with a few intentional strands of silver, framed a face that could shift from stern to playful in a heartbeat. Tonight it hovered somewhere in between. She tapped a manicured finger against the stem of her glass, scanning the room with the calm confidence of someone who knew she brought in more revenue than most small nations.

The maître d’ approached with a deferential smile. “Will anyone be joining you this evening, Ms. Irmaa?”

She returned the smile, subtle and knowing. “Oh, people join me every year,” she said, “but never at the same table.”

So who is Irmaa?

My real significant other girlfriend knows one of my life philosophies is "You can only have one Honey." When I am watching TV or a movie, and a character who already has a partner starts looking for a second partner, I predict bad things will happen. Male or female, straight or gay, it doesn't matter. "You can only have one Honey."

And true to my philosophy, I have, reluctantly, added Irmaa to my life, and bad things have happened. Sigh.

IRMAA stands for Income-Related Monthly Adjustment Amount. It is a surcharge directly deducted from your Social Security check, and added to Medicare Part B and Part D premiums for beneficiaries with higher incomes. IRMAA generates tens of billions of dollars each year. IRMAA premiums are slated to rise 30 percent from 2026 to 2030, according to the 2025 Medicare Trustees Report. In comparison the Social Security cost of living adjustment was set at 2.8 percent for 2026. See IRMAA.

My IRMAA surcharge is quite large relative to my standard Medicare premium.

If you imagine IRMAA as a single individual collecting all those surcharges herself, she’d be pulling in revenue on the scale of tech billionaires. IRMAA as a person would be staggeringly wealthy.

I think of her as my girlfriend named Irmaa. I am one of those who pays for her extravagant lifestyle. She takes and takes, and does not give much back. Of course, she has quite a few other boyfriends. And girlfriends.

Irma (single a) is a quite uncommon name nowadays. Irma is currently #2609 in U.S. births according to The Bump's Data. The World Meteorological Organization retired Irma as an Atlantic hurricane name after its disasterous 2017 hurricane, feeling the future use of the name would be insensitive.

. Meanwhile Irma calmly enjoys her expensive meals, confidently knowing over 5 million Medicare beneficiaries - men and women - support her lifestyle. She's not your girlfriend yet? Just wait a few years.

(Photo from Google Gemini, description of Irmaa by me in consultation with Microsoft Copilot.)

Sunday, January 25, 2026

Do prices feel like they are rising more than the CPI says?

Do prices feel like they are rising more than the CPI (Consumer Price Index) says? Yes.

For 2025 the overall CPI increased by a pretty moderate 2.7% (CPI-U for all urban consumers, December to December). The CPI measures the price change of a basket of goods over eight major categories: food & beverages, housing, apparel, transportation, medical care, recreation, education & communication, and other, over a large number of metropolitan statitical areas (MSA). Each month Bureau of Labor Statistics data collectors collect about 94,000 prices. For some items, prices are adjusted to remove the effect of a change of quality (for example today's Toyota Corolla is a much different car than ten years ago). There are separate indices for over 200 items countrywide, and separate indices for the eight major categories by MSA.

Of course what matters is price changes on the items YOU buy, and especially the items you buy frequently like food. Each individual item has its own reasons for its price changes - egg prices are affected by bird flu, coffee prices are affected by weather and geopolitical conditions in other countries, etc. These factors are beyond our control.

I was interested in comparing food price changes for common frequently purchased items. This is based on CPIs for countrywide data in categories like eggs. Of course I am not measuring YOUR specific egg purchase (such as store brand, organic, medium size, white, 12 count, at CTown in Intercourse, Pennsylvania).

I began with 11 years of December values of ten separate food categories and also the overall CPI-U, seasonally adjusted. The values I used are as follows (although I understand values change as there are changes to seasonal adjustment factors, weightings, etc.)

For comparison purposes I set all eleven indices to a 100 index value at 2015, and I plotted them on a line graph. Each year's value is now relative to 2015.

On a ten-year basis, ground beef, eggs, and lettuce are running higher than the overall CPI. Those three food indices have experienced a lot of volatility. The pandemic year 2022 was the year of the largest increase for many foods. Interestingly, apple prices have gone down!

Another way to look at this is to look at bar graphs with percentage changes. Here are graphs with percentage changes 2015 to 2025, 2024 to 2025, and 2021 to 2022:

For 2025 compared with 2015, I was not aware how much the prices had risen for ground beef, lettuce, and eggs.

For 2025 compared to 2024, the increase in prices for ground beef, eggs, and coffee have been much larger than the 2.7% overall increase in the CPI. This is probably no surprise if you are the household member doing the grocery shopping.

For 2022 versus 2021, the increase in all foods except ground beef exceeded the overall CPI. We probably remember this during the Covid pandemic years.

In the insurance world, the goods and services that insurance pays for have also increased by more than the overall CPI. For example, medical care costs, auto maintenance & repair costs, and building construction & repair costs have consistently outpaced the overall CPI. So don't be surprised to see health insurance, auto insurance, and home insurance prices to continue to increase.

Here is the R code:


library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)

df <- read_excel("C:/Users/Jerry/Desktop/R_files/CPI_10_years.xlsx")
print(df, n = Inf, width = Inf)

# 1. Convert data to 'long' format for easier plotting
df_long <- df %>%
  pivot_longer(cols = -Year, names_to = "Category", values_to = "Index")

# 2. Re-base every category so 2015 = 100
df_normalized <- df_long %>%
  group_by(Category) %>%
  mutate(Index_100 = (Index / Index[Year == 2015]) * 100) %>%
  ungroup()

# 3. Plot line graph "Race to the Top"
ggplot(df_normalized, aes(x = Year, y = Index_100, color = Category)) +
  geom_line(aes(size = Category == "Overall CPI"), show.legend = FALSE) +
  # Use subset to only label the last point (2025)
  geom_text(data = subset(df_normalized, Year == 2025), 
            aes(label = Category), 
            hjust = -0.1,  # Push text to the right of the point
            size = 4, fontface="bold",
            show.legend = FALSE) +
  # Fix X-Axis: No decimals, every year labeled
  scale_x_continuous(breaks = seq(2015, 2025, 1), limits = c(2015, 2027)) +
  # Highlight the Overall CPI in black/thicker line
  scale_size_manual(values = c("TRUE" = 2.0, "FALSE" = 0.8)) +
  scale_color_manual(values = c("Overall CPI" = "black", 
                                "Eggs" = "red", 
                                "Ground Beef" = "darkred",
                                "Apples" = "forestgreen",
                                "Milk" = "blue",
                                "Bread" = "orange",
                                "Coffee" = "brown",
                                "Lettuce" = "lightgreen",
                                "Soup" = "#008080",
                                "Breakfast Cereal" = "purple",
                                "Other Fresh Veg" = "gray")) +
  theme_minimal() +
  theme(
    legend.position = "none",        # Remove legend
    plot.title = element_text(face = "bold", size = 16), 
    axis.title = element_text(face = "bold"), 
    axis.text = element_text(face = "bold"),
    plot.margin = margin(r = 100),   # Add space on the right for labels
    panel.grid.minor = element_blank()
  ) +
  coord_cartesian(clip = 'off') +    # Prevent labels from being cut off
  labs(title = "Cumulative Food Price Changes (2015 = 100)",
       subtitle = "Food categories vs. Overall CPI Baseline",
       y = "Index (2015 = 100)",
       x = "Year")

#4 Plot bar graph with % changes
# filter data for start and end years 

plot_cpi_change <- function(data, start_yr, end_yr) {
  
  # 1. Prepare dynamic column names for the mutation step
  start_col <- paste0("Yr", start_yr)
  end_col <- paste0("Yr", end_yr)
  
  # 2. Data Processing
  df_bar <- data %>% 
    filter(Year %in% c(start_yr, end_yr)) %>%
    pivot_longer(cols = -Year, names_to = "Category", values_to = "Index") %>% 
    pivot_wider(names_from = Year, names_prefix = "Yr", values_from = Index) %>%
    # Use .data[[]] to dynamically reference the columns created by pivot_wider
    mutate(Pct_Change = ((.data[[end_col]] - .data[[start_col]]) / .data[[start_col]]) * 100) %>%
    mutate(Category = reorder(Category, -Pct_Change))
  
  # 3. Visualization
  ggplot(df_bar, aes(x = Category, y = Pct_Change, fill = Category)) +
    geom_bar(stat = "identity", color = "black", size = 0.5) +
    # Logic for text placement based on positive/negative change
    geom_text(aes(label = paste0(round(Pct_Change, 1), "%"),
                  vjust = ifelse(Pct_Change >= 0, -0.5, 1.5)), 
              fontface = "bold", size = 4.5) +
    scale_fill_manual(values = c("Overall CPI" = "black", "Soup" = "#008080", "Eggs" = "red", 
                                 "Ground Beef" = "darkred", "Apples" = "forestgreen", "Milk" = "blue", 
                                 "Bread" = "orange", "Coffee" = "brown", "Lettuce" = "lightgreen", 
                                 "Breakfast Cereal" = "purple", "Other Fresh Veg" = "gray")) +
    theme_minimal() + 
    theme(legend.position = "none", 
          plot.title = element_text(face = "bold", size = 18, hjust = 0.5), 
          axis.title = element_text(face = "bold", size = 14), 
          axis.text.x = element_text(face = "bold", size = 11, angle = 45, hjust = 1), 
          axis.text.y = element_text(face = "bold", size = 11), 
          panel.grid.major.x = element_blank()) +
    coord_cartesian(clip = 'off') +
    labs(title = paste("CPI FOOD PRICE CHANGES (", start_yr, "TO", end_yr, ")"), 
         y = "PERCENT CHANGE (%)", x = "CATEGORY")
}

plot_cpi_change(df, 2015, 2025)
plot_cpi_change(df, 2024, 2025)
plot_cpi_change(df, 2021, 2022)

End

Friday, January 2, 2026

As the red line clearly demonstrates, ...

I have been at lots of business meetings where the speaker would say something like, "As the red line clearly demonstrates, ..."

I would quickly look up at two particular co-workers. We all had that same rolling of the eyes look. When you have color-challenged issues, that red line does not clearly demonstrate anything to us. (And yes, the three of us knew we all had color issues.)

I wrote the R code for this graph. But I used a random number to decide which line is red and which is green, so I don't know which line is which.

I am not color-blind. I am color-challenged. I see some colors, but probably not as many as most people see. I don't distinguish between red and green well. I also don't distinguish between blue and purple well because of the red element of purple, and similarly for other combinations.

In my earliest elementary school years I had a box of 8 Crayola crayons, and I did just fine. My problems began when Crayola started adding more and more colors including combinations like green yellow and yellow green. Here are the original eight colors, which I still identify just fine. (There is no meaning to the heights of the bars.)

One source suggests approximately 1 in 12 men have color issues, but only 1 in 200 women.

Color-challenged people often have coping strategies. I can see when the top traffic light is on, so I will stop my car. But in my heart, I don't think it is red, but more of an orange. Similarly I can see when the bottom traffic light is on, but I think it is more of an off-white than green. As long as no one flips the positions, I am fine. A single flashing light is a probllem though.

When I see an 8-sided stop sign, that's what I consider red. And I recognize Taylor Swift wears what I consider a very red lipstick, but I assume she wears more than one shade and I cannot distinguish among them.

The speaker with the two-line graph could have used the different shapes, or different graph line types dashes and dots, in the talk. Or the speaker could have used color-blind friendly color palettes. If you are creating visualizations for others, please think about these things.

And ladies, that purple lipstick looks very nice on you. Oh, it's not purple ... ?

The R code is as follows:





# Two lines with random colors
x <- seq(1:5)
y <- seq(1:5)
z <- seq(5,1, by = -1)

color_1 <- "#00B050"
color_2 <- "#FF0000" 

r <- runif(1,0,1)
r
if (r < .5){
  col_y <- color_1
  col_z <- color_2
} else{
  col_y <- color_2
  col_z <- color_1
}

plot(x,y, type="b", pch = 18, cex = 2.5, lwd = 2.5, col=col_y, ylab="", xlab="", 
     main="As the red line clearly demonstrates ...", cex.main = 1.5,
     sub="If r<.5 then (green, red), else (red, green)",
     font.sub = 2, cex.sub = 1.5)
par(new=TRUE)
plot(x,z, type="b", pch=20, cex = 2.5, lwd = 2.5, col=col_z, ylab="", xlab="")
legend("topright", legend= c("Increasing", "Decreasing"), pch = c(18, 20),
       col = c(col_y, col_z), cex=1.5, text.font=2)

# Original 8 Crayola Crayons
title <- c("Original 8 Crayola Crayons")
subtitle <- c("According to https://en.wikipedia.org/wiki/History_of_Crayola_crayons")
temp <- c(5,7,6,4,8,5,2,5)  # meaningless
colors <- c("red", "yellow", "blue", "green", "orange", "brown", "violet", "black")
hex <- c("#FF0000", "#FFFF00", "#0000FF", "#008001", "#FF6600", "#964B00", "#6A0DAD", "#000000")
barplot(temp, col=hex, main=title, sub=subtitle, names.arg=colors,
          cex.main = 1.5, font.axis = 2, font.sub = 2, cex.sub = .75 )

End

Tuesday, December 9, 2025

"Does anybody really know what time it is? Does anybody really care?”

“Does anybody really know what time it is? Does anybody really care?” (Chicago, 1970)

At my annual physical exam, the nurse gave me a sheet of paper with a large circle, and she asked me to draw the time 11:10. I understand this is called the Clock Drawing Test, and it is a screening tool to help detect early signs of dementia.

I drew the two numbers 11 and 2 in the correct positions, and 12. I felt no need to draw all the other numbers. Then I drew a little hand (hour hand) approximately on 11, and a big hand (minute hand) on 2.

Of course, only I would then worry about the angle that the two hands make.

There are twelve evenly spaced integers on a conventional clock. At three o’clock, the hands are on the 12 and the 3, which are three numbers apart, and the hands make a 90 degree angle. For every difference of 1 on the clock, the hands are 360 times 1/12 = 30 degrees apart. My first thought was that at 11:10, the hands are also three numbers apart, so I assumed the hands also make a 90 degree angle. The 90 is the sum of 30 degrees between 11 and 12, plus 60 degrees between 12 and 2.

More precisely, at 11:10, the hour hand is somewhat past the 11. It is 10/60 = 1/6 of its way toward 12. So my initial assumption needs to be modified. Now I assume the angle that the two hands make should be 60 degrees (between 12 and 2), plus 30 reduced by a factor of 1/6 (between 11 1/6 and 12), or 60 plus 30 times 5/6, or 85 degrees. I do own a protractor (what do you mean – doesn’t everyone?), but I didn’t bring it with me for my physical exam. So I drew the two hands at something less than a 90 degree angle, although not as exact as I would have liked.

I did not share any of my thinking on this with the nurse, assuming she would not be interested. She may have wondered why it took me so long to draw the two hands.

It turns out there is a general formula. For a time of h hours and m minutes (with h∈{0,1,…,11}): "Angle"= ∣ 30h-5.5m ∣. At 11:10, ∣ 30*11 −5.5*10 ∣ = ∣ 330 − 55 ∣ = 275, and the smaller angle is 360 – 275 = 85.

END

Sunday, November 30, 2025

Auto accidents and drowsiness

I've been thinking about Thanksgiving and auto accidents.

I hope we all had a good Thanksgiving. But we often feel drowsy after the Thanksgiving meal. The feeling of sleepiness may be due to overeating high-carbohydrate and high-fat foods and to alcohol. Turkey's effect on melatonin (the sleep-regulating hormone) is debatable.

I wondered if there are more auto accidents due to drowsiness on Thanksgiving. The only source I could find says Thanksgiving itself is usually safer than a normal weekday because many people are off the road visiting family. The same source says the "crash rate" (accident frequency?) is higher the Sunday after Thanksgiving compared with the previous Sunday.

The AAA quotes a study that suggests about 9.5% of all car crashes involve a drowsy driver. Many drivers may under-estimate thier own degree of fatigue.

My 2023 Hyundai gives me a monthly numerical grade on my driving (yes, really!), including on my lane changes. While my car probably knows too much about me ("I know you were eating that muffin while you were driving ... "), I don't think my car comes with a drowsiness alert system that other cars do have.

Not exactly related to drowsiness, but I've also been thinking about the need to concentrate for long periods of time. For me, as the years go on, I find this more difficult to do. I try to commit to continuing education in my various professional organizations, but it does seem harder to concentrate. I continue to take some online courses, but I would be hard pressed to have to memorize a large quantity of material for a traditional exam.

But like a lot of things, maybe times have changed. I came across a website discussing dietary supplements and preparing for math exams.

End

Saturday, November 22, 2025

Twenty Questions and Decision Trees

Most of us have probably played the game Twenty Questions. The answerer chooses something, and the other players try to guess it by asking yes or no questions. The TV show "What's My Line" is an example of this where the players try to guess a contestant's occupation.

Contrary to my kids' attempts, the strategy should be to ask questions that split the remaining possibilities roughly in half each time.

This seems very similar to a machine learning Decision Tree, although with an interesting distinction.

A decision tree cheats. The decision tree algorithm knows the answer (df$target = 1). The algorithm attempts to find the best feature and split value to separate df$target = 1 from df$target = 0 at each node, but it needs to know the right answer to ask the best questions. This is why, if the game is played say with different US Presidents multiple times, the algorithm may choose different features and split values.

Nevertheless, I thought it would be fun to program a decision tree model with the US Presidents. I found some data on Presidents. I decided some variables had too many values (high cardinality - there were a lot of political party names in the 1800s), so I grouped some values to reduce the number of unique values.

I initially began with a random integer between 1 and 47 to select a President, which selected President Hoover, but I found a different President would create a tree closer to the questions I would have asked if I were a player. So I selected President Reagan to get a more interesting tree.

(I considered selecting President Garfield to be able to ask the question, "Is the president credited with a unique proof of the Pythagorean Theorem?", but I decided that was a little quirky, even for me.)

Here is the resulting tree for President Reagan:

Here is the resulting variable importance plot. Note that the variables are not in the same order as the tree splits. I understand that variable importance is based on the some of the improvements in all nodes where the variable was used as a splitter.

Here is my R code:



library(dplyr)
library(rpart)
library(rpart.plot)
library(ggplot2)
df <- read.csv("prez.csv", header=TRUE)   
# data file available at github:  
prez.csv
set.seed(123)
# r <- sample(1:nrow(df),1)
r <- 40   # deliberate choice to get longer tree
answer <- df$LastName[r]
print(paste("The target president is:", answer))
df$target <- rep(0, nrow(df))
df$target[r] <- 1

# Feature engineering:
df <- df %>%
  # A. Categorical Reduction
  mutate(
    Party = case_when(
        Party %in% c("Democratic") ~ "Democratic", 
        Party %in% c("Republican") ~ "Republican", 
        TRUE ~ "Other"),
    Occupation = case_when(
        Occupation %in% c("Businessman", "Lawyer") ~ Occupation, 
        TRUE ~ "Other"),
    State = case_when(
        State %in% c("New York") ~ "NY", 
        State %in% c("Ohio") ~ "OH", 
        State %in% c("Virginia") ~ "VA", 
        State %in% c("Massachusetts") ~ "MA", 
        State %in% c("Texas") ~ "TX", TRUE ~ "Other"),
    Religion = case_when(
        Religion %in% c("Episcopalian", "Presbyterian", "Unitarian", "Baptist", "Methodist") ~ "Main_Prot", 
        TRUE ~ "Other"),
    
    # B. Year/Century Binning using cut()
    DOB = cut(DOB, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    DOD = cut(DOD, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Began = cut(Began, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Ended = cut(Ended, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE)
  ) %>%

  # C. Convert all new/existing binary/categorical variables to factor
  mutate_at(vars(Party, State, Occupation, Religion, Assassinated, Military, Terms_GT_1, Pres_During_War, Was_Veep, DOB, DOD, Began, Ended), as.factor)


# selected variables 
formula <- as.formula(target ~ Began + State + Party + Occupation + Pres_During_War)
# Using aggressive control settings to force a maximal, unpruned tree
prez_tree <- rpart(formula, data = df, method = "class",
                   control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
rpart.plot(prez_tree, type = 4, extra = 101, main = "President Twenty Questions")

# check Reagan
df %>% filter(Began == "20th century" & 
              !State %in% c("MA", "NY", "OH", "TX") &
              Party == "Republican" &
              !Occupation %in% c( "Businessman", "Lawyer"))

variable_importance <- prez_tree$variable.importance
importance_df <- data.frame(
  Variable = names(variable_importance),
  Importance = variable_importance
)

importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

common_theme <- theme(
        legend.position="NULL",
        plot.title = element_text(size=15, face="bold"),
        plot.subtitle = element_text(size=12.5, face="bold"),
        axis.title = element_text(size=15, face="bold"),
        axis.text = element_text(size=15, face="bold"),
        legend.title = element_text(size=15, face="bold"),
        legend.text = element_text(size=15, face="bold"))


ggplot(importance_df, aes(x = factor(Variable, levels = rev(Variable)), y = Importance)) +
  geom_col(aes(fill = Variable)) + 
  coord_flip() +
  labs(title = "20 Questions Variable Importance",
       x = "Variable",
       y = "Mean Decrease Gini") +
  common_theme

# loop through all presidents to see different first split vars
library(purrr)

### 1. Define the Analysis Function ###
# The function is modified to return a data frame row for clarity
get_first_split_row <- function(df, r) {
  # Temporarily set the target for the current president
  df$target <- 0
  df$target[r] <- 1
  tree <- rpart(formula, data = df, method = "class",
                control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
  frame <- tree$frame
  
  # Determine the result
  if (nrow(frame) > 1) {
    first_split_var <- as.character(frame$var[1])
  } else {
    first_split_var <- "No split"
  }
  
  # Return a single-row data frame
  return(data.frame(
    President = df$LastName[r],
    First_Split_Variable = first_split_var
  ))
}

### 2. Run the Analysis and Combine Results ###
# Create a list of row indices to iterate over
indices_to_run <- 1:nrow(df)

# Use map_dfr to apply the function to every index and combine the results 
# into a single data frame (dfr = data frame row bind)
first_split_results_df <- map_dfr(indices_to_run, ~ get_first_split_row(df, .x))

### 3. Display the Table and Original Analysis ###
# Display the resulting table:
print(first_split_results_df)

print(table(first_split_results_df$First_Split_Variable))

End

Thursday, October 23, 2025

Change is good, so don't change my change

One of my former insurance colleagues once said, "When a company makes a change, it's probably not going to benefit you."

Assuming the above photo is legitimate, and of course nowadays who knows, McDonald's says it will be rounding cash change to the nearest five cents. McD is probably not the only one doing it. So if your change has last digit {1, 2, 6, 7} McD will round down and you will lose 1 or 2 cents; if your change has last digit {3, 4, 8, 9} McD will round up and you will gain 1 or 2 cents; if your change has last digit {0, 5} there is no effect of rounding. So in 40% of the last digits you lose, in 40% you gain, and in 20%, there is no effect. The issue is, are the ten last digits 0 through 9 equally likely?

All I need is a nice sample of McD receipts. On cash transactions. Well, that's not going to happen. Let's try a different approach.

Similar to other retail prices, I assume many meal prices end in 99 cents, such as $5.99, to appear less expensive. (I was never asked to do this as an actuary.) Marketers know there is a left-digit effect, where customers focus on the leftmost number, so a price like $5.99 feels significantly cheaper than $6.00. See Psychological pricing. However, even if a majority of meal prices end in 99 cents, I believe the effect of that majority will be neutralized by the following.

I have not been to a McDonald's in years. But I do know there are many different menu items, I have no idea what the frequency is of say a Big Mac versus Chicken McNuggets purchase, I have no idea of the frequency of permutations of all possible menu item purchases, prices vary based on location because franchises set their own pricing, and there is also a sales tax that differs by state and sometimes by city within state. No doubt a McD data analyst has access to all this data, but I don't. So what follows is not an exact analysis, but rather an approach that is intended to be unbiased.

At first I thought Benford's Law might be useful. This law applies to the distribution of first digits, not last digits, and says under certain conditions (such as, when values are distributed across multiple orders of magnitude, which is probably not true for a place like McDonald's), P(1) = 30.1%, P(2) = 17.6%, P(3) = 12.5%, etc. This is interesting, but not useful here. See Benford for more on Benford.

In How to Solve It, George Polya suggests, "If you cannot solve the proposed problem, try to solve first some related problem."

I discovered if I limit the problem to a single meal item, a McDonald's Big Mac Meal (MBMM), I could get an average (well, more of a representative) price by state. I could also apply an average (again, representative) sales tax by state.

For each state, I took the MBMM price, added tax, and applied the McD rounding rule to the last digit. For each state I define Cents as after-tax cents portion of the MBMM price, CentsRounded as the cents portion after the McD rounding rule, and Rounding Difference = CentsRounded - Cents. The unweighted state average Rounding Difference was .04 cents. Not 4 cents, but 4% of a penny. This is positive, indicating a slight gain to McDonald's, but barely greater than zero. This .04 cents is per transaction. Of course, McDonald's does sell a lot of hamburgers.

This study had a lot of limitations, some of which I noted above, so it is certainly not an exhaustive study. But to the extent I did what I could, my friend in the opening paragraph was right: "When a company makes a change, it's probably not going to benefit you."

The R code is as follows:


library(dplyr)

# --- 1. Big Mac Meal Prices by State (approx, USD) ---
meal_prices <- tibble::tribble(
  ~State, ~Price,
  "Alabama", 9.49, "Alaska", 11.59, "Arizona", 12.99, "Arkansas", 8.79,
  "California", 10.69, "Colorado", 9.89, "Connecticut", 9.79, "Delaware", 8.99,
  "Florida", 9.39, "Georgia", 9.49, "Hawaii", 10.99, "Idaho", 9.29,
  "Illinois", 9.59, "Indiana", 8.99, "Iowa", 8.89, "Kansas", 9.09,
  "Kentucky", 7.79, "Louisiana", 9.69, "Maine", 9.19, "Maryland", 9.49,
  "Massachusetts", 9.99, "Michigan", 8.59, "Minnesota", 9.19, "Mississippi", 9.29,
  "Missouri", 8.99, "Montana", 9.09, "Nebraska", 8.59, "Nevada", 9.69,
  "New Hampshire", 8.99, "New Jersey", 9.49, "New Mexico", 9.09, "New York", 9.89,
  "North Carolina", 9.29, "North Dakota", 10.59, "Ohio", 8.89, "Oklahoma", 8.99,
  "Oregon", 10.69, "Pennsylvania", 9.19, "Rhode Island", 9.49, "South Carolina", 9.29,
  "South Dakota", 9.09, "Tennessee", 9.79, "Texas", 9.19, "Utah", 9.39,
  "Vermont", 9.19, "Virginia", 8.99, "Washington", 9.69, "West Virginia", 8.99,
  "Wisconsin", 9.19, "Wyoming", 8.99
)

# --- 2. Combined State + Local Sales Tax Rates (approx, fraction) ---
tax_rates <- tibble::tribble(
  ~State, ~TaxRate,
  "Alabama",0.0944,"Alaska",0.0182,"Arizona",0.0837,"Arkansas",0.0948,
  "California",0.0885,"Colorado",0.0780,"Connecticut",0.0635,"Delaware",0.0000,
  "Florida",0.0700,"Georgia",0.0739,"Hawaii",0.0450,"Idaho",0.0602,
  "Illinois",0.0874,"Indiana",0.0700,"Iowa",0.0689,"Kansas",0.0874,
  "Kentucky",0.0600,"Louisiana",0.1011,"Maine",0.0550,"Maryland",0.0600,
  "Massachusetts",0.0625,"Michigan",0.0600,"Minnesota",0.0749,"Mississippi",0.0707,
  "Missouri",0.0813,"Montana",0.0000,"Nebraska",0.0696,"Nevada",0.0849,
  "New Hampshire",0.0000,"New Jersey",0.0660,"New Mexico",0.0777,"New York",0.0852,
  "North Carolina",0.0698,"North Dakota",0.0696,"Ohio",0.0724,"Oklahoma",0.0908,
  "Oregon",0.0000,"Pennsylvania",0.0634,"Rhode Island",0.0700,"South Carolina",0.0744,
  "South Dakota",0.0640,"Tennessee",0.0961,"Texas",0.0819,"Utah",0.0702,
  "Vermont",0.0636,"Virginia",0.0567,"Washington",0.0947,"West Virginia",0.0648,
  "Wisconsin",0.0572,"Wyoming",0.0556
)

# --- 3. Merge datasets ---
df <- inner_join(meal_prices, tax_rates, by = "State")

# --- 4. Compute totals and apply rounding rule ---
options(dplyr.width = Inf)
df <- df %>%
  mutate(
    # Total in cents, rounded to nearest cent
    Total_cents = round(Price * (1 + TaxRate) * 100),
    Dollars = Total_cents %/% 100,    # whole dollars
    Cents = Total_cents %% 100,       # cents 0-99
    
    # Apply 5-cent rounding rule
    CentsRounded = sapply(Cents, function(x) {
      last_digit <- x %% 10
      if (last_digit %in% c(1,2,6,7)) {
        return(floor(x / 5) * 5)
      } else if (last_digit %in% c(3,4,8,9)) {
        return(ceiling(x / 5) * 5)
      } else {
        return(x)
      }
    }),
    
    # Final total in dollars
    TotalRounded = Dollars + CentsRounded / 100,
    
    # Rounding difference relative to nearest-cent total (in cents)
    RoundingDiff = CentsRounded - Cents
  )
head(df)

# --- 5. Summaries ---
mean_diff <- mean(df$RoundingDiff)   # positive is benefit to company
sd_diff   <- sd(df$RoundingDiff)
avg_abs   <- mean(abs(df$RoundingDiff))

cat("Average rounding difference (¢):", round(mean_diff,3), "\n")
cat("SD of rounding difference (¢):", round(sd_diff,3), "\n")
cat("Average absolute rounding (¢):", round(avg_abs,3), "\n\n")

# Distribution of rounded cents
print(table(df$CentsRounded))

# --- 6. Histogram of rounding differences ---
bin_colors <- c("red", "green", "blue", "yellow", "purple")
hist(df$RoundingDiff,
     breaks = seq(-2.5, 2.5, 0.5),
     col = bin_colors,
     main = "Distribution of Rounding Differences\nPositive is benefit to company",
     xlab = "Rounding Difference (¢)",
     ylab = "Number of States",
     font.lab = 2)

# --- 7. State-by-state table of rounding effects ---
state_table <- as.data.frame(df) %>%
  select(State, Price, TaxRate, Total_cents, TotalRounded, RoundingDiff) %>%
  arrange(desc(RoundingDiff))

print(state_table)

End

Sunday, September 7, 2025

Davey Johnson, former Mets manager, an early proponent of sabermetrics

Davey Johnson, former baseball player and manager, recently passed away. He was notable in the math and data analysis world for being an early proponent of sabermetrics, the use of statistical analysis of baseball to provide insights into player and team performance, long before the 2003 Moneyball book and 2011 movie brought sabermetrics to the public.

Johnson was a math major from Trinity University in San Antonio, TX. In his early twenties while playing for the Baltimore Orioles, Johnson was writing simulations of the Orioles batting order, and he thought the 1969 lineup that was used most often was the sixth worst possible lineup (baseballhall.org/discover-more/news/johnson-davey). He was unable to convince manager Earl Weaver that he should bat second in the lineup.

In 1984, his first year as Mets manager, he used dBase II to compile data on each opposing pitcher and each Mets hitting record against that pitcher. This is routine today, but it was innovative in 1984.

He was particularly interested in the statistic On-Base Percentage (OBP), which measures how often a batter reaches base, and thus creates scoring opportunities. It is more comprehensive than simple Batting Average, and it incorporates the adage, “A walk is (almost) as good as a hit.” OBP equals (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies). As a result of his analysis he decided star Mookie Wilson was not the optimal leadoff hitter, and Johnson replaced Wilson at leadoff with Wally Backman.

All the key Mets players improved their OBP in 1984 versus 1983 (except Hernandez who was over .400 in both yeara), but Backman's 1984 OPS of .360 was much higher than Wilson's .308, especially due to Backman's higher number of walks.

Johnson was the Mets manager from 1984 to early 1990, he won at least 90 games in each of his first five seasons, and finished first or second in the NL East all six years. The 1983 Mets had won only 68 games. Part of his success was due to the emergence of Darryl Strawberry and Dwight Gooden, and the acquisitions of Keith Hernandez and Gary Carter. Johnson was known for platooning his players, even his stars. It is debatable how many wins are due to the manager (sportslawblogger.com says at most 5 wins per season), but Johnson's managerial record with the Mets is formidable. His lifteime winning percentage including managerial stints elsewhere is .562.

As a player Johnson was a second baseman from 1965 through 1978, most notably for the Baltimore Orioles and the Atlanta Braves. In Baltimore he was part of a great defensive infield along with Brooks Robinson and Mark Belanger. In Atlanta, as part of a lineup including Hank Aaron and Darrell Evans, he had a season with 43 home runs (42 as second baseman, plus 1 as pinch hitter), tying Rogers Hornsby’s record for most home runs by a second baseman.

Johnson's player statistics are as follows. They are not Hall of Fame calibre, but they are pretty good:

Billy Beane, the subject of the book Moneyball that popularized sabermetrics, played five games for Johnson's 1984 Mets and eight games for Johnson's 1985 Mets. You have to wonder how much Beane was influenced by Johnson.

The R code is as follows:


  
library(Lahman)
library(tidyverse)
library(ggplot2)
data(Batting)
data(People)
data(Teams)

common_theme <- theme(
    legend.position="right",
    plot.title = element_text(size=15, face="bold"),
    plot.subtitle = element_text(size=12.5, face="bold"),
    axis.title = element_text(size=15, face="bold"),
    axis.text = element_text(size=15, face="bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_text(size=15, face="bold"),
    legend.text = element_text(size=15, face="bold"))

# Mets key hitters 1983, 1984:
stars <- c("Backman", "Brooks", "Hernandez", "Foster", "Strawberry", "Wilson")
df <- Batting %>% 
  filter((yearID == 1983 | yearID == 1984) & teamID == "NYN") %>%
  left_join(People, by = "playerID") %>%
  filter(nameLast %in% stars | (nameLast == "Hernandez" & nameFirst == "Keith")) %>% 
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(nameFirst, nameLast, yearID, H, BB, HBP, AB, SF, BA, OBP) %>%
  arrange(yearID, desc(OBP)) 

ggplot(df, aes(x = nameLast, y = OBP, group = factor(yearID), fill = factor(yearID))) +
  geom_bar(stat = "identity", position = "dodge", width = .9) +
  scale_fill_manual(values = c("1983" = "red", "1984" = "blue")) +
  labs(title = "Mets Key Player OBP 1983, 1984",
       x = "Player",
       y = "On Base Percentage",
       fill = "Year") +
  common_theme

# NY Mets number of wins
mets_records <- Teams %>%
  filter(teamID == "NYN", yearID %in% seq(from = 1982, to = 1989, by = 1)) %>% 
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(yearID, W, L, H, HR, BA, OBP)
print(mets_records)

mets_records$mgr <- ifelse(mets_records$yearID == 1982 | mets_records$yearID == 1983, "Not Davey", "Davey")
mets_records$mgr <- factor(mets_records$mgr, levels = c("Not Davey", "Davey"))
ggplot(mets_records, aes(x = yearID, y = W, group = factor(mgr), fill = factor(mgr))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Not Davey" = "red", "Davey" = "blue")) +
  labs(title = "Mets Number of Wins by Year",
       x = "Year",
       y = "Number of Wins") +
       fill = "Manager") +
  common_theme

# find Davey Johnson in People:
People %>% filter(nameFirst == "Davey" & nameLast == "Johnson")
# his People %>% filter(nameFirst == "Davey" & nameLast == "Johnson")
# his playerID is johnsda02
df <- Batting %>% filter(playerID == "johnsda02") %>%
  left_join(People, by = "playerID") %>%
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(nameFirst, nameLast, yearID, teamID, H, BB, HBP, AB, SF, HR, BA, OBP)

# Calculate the totals row
totals_row <- df %>%
  summarize(
    nameFirst = "Totals",
    nameLast = "",
    yearID = NA_integer_, # NA is used for columns without a meaningful sum
    H = sum(H),
    BB = sum(BB),
    HBP = sum(HBP),
    AB = sum(AB),
    SF = sum(SF),
    HR = sum(HR),
    BA = round(sum(H) / sum(AB), 3),
    OBP = round(sum(H + BB + HBP) / sum(AB + BB + HBP + SF), 3)
  )
df_with_totals <- bind_rows(df, totals_row)
print(df_with_totals)

End

Saturday, June 21, 2025

It's a linear world - approximately; and the "Rule of 72"

You are probably familiar with the "Rule of 72" in investing: if an investment compounds at annual interest rate i, then the number of years for money to double is approximately 72 / i. For example, if an investment compounds at annual interest rate of 9%, then the investment will double in approximately 72/9 = 8 years. The compound interest formula confirms this: 1 * 1.09 ⁸ = 1.992 (rounded), which is approximately 2.

For fun, ask your banker or investment advisor why the Rule of 72 works.

It works because in the short run, it's a linear world - approximately. A little more mathematical is that, in a small enough neighborhood, any differentiable function is approximately linear.

For the Rule of 72 we are solving for n in the equation 1 * (1 + i/100) ⁿ = 2. We take natural logarithms of both sides:

1 * (1 + i/100) ⁿ = 2
LN (1 + i/100) ⁿ = LN 2
n * LN (1 + i/100) = LN 2
n = LN 2 / LN (1 + i/100)
n ≈ .693 / LN (1 + i/100)
n ≈ .693 / (i/100)
n ≈ 69.3 / i
n ≈ 72 / i

The key step three lines above is that for small values of x, LN (1 + x) is approximately equal to x. One way to see this is that the Taylor series (sorry, not invented by Taylor Swift) expansion around a = 0 is:

LN (1+x) = x - x ² / 2 + x ³ / 3 - x ⁴ / 4 + ... .

For small x, LN (1+x) is approximately equal to x.

In the following plots, the plot on the left shows that the logarithmic function 1 of y = LN (1 + x) is surely different than the linear function 2 of y = LN (1.05) + (1/1.05)*(x - .05). Function 2 is the equation of the tangent line to function 1 at x = .05. (To derive the tangent line, recall from calculus that if y = LN(1 + x), then dy/dx = 1/(1 + x) ). The plot on the right is of the same two functions, but with the x range shrunk to (0.0, 0.1). In this small range the two functions are indistinguishable. In this small neighborhood, the differentiable function y = LN(1 + X) is indistinguishable from its linear tangent.

Of course a "Rule of 69.3", where LN 2 is .693 to three decimal places, would be a better approximation to the compound interest formula than the "Rule of 72", but 72 is close enough and is useful because 72 has so many integer divisors.

In the real world we often assume linearity holds within a small range. For example in cooking a turkey (not that I have ever done this), one website says to allow about 15 minutes per pound (on average!) at 350°F to cook a stuffed turkey. But their time estimates by pound are not linear, so for example their time estimate to cook a 24-pounder is less than twice as long to cook a 12-pounder. Perhaps a better approximation is a power curve y = a * (x ^b) . Presumably there are some physics considerations that are non-linear.

The Newton-Raphson method, x _n+1 = x _n - f (x _n ) / f ^' (x _n ), for finding approximations to roots of a function, is an example of a mathematical use of linearity. N-R iteratively finds the x-intercept of the tangent of the graph of f. A number of R packages have N-R functions.

And before we end the subject of "it's a linear world - approximately", you can certainly measure the straight line distance between two cities, but if you travel by airplane, the airplane is flying the distance between two points on a sphere. Given the latitude and longitude coordinates and the central angle between the two points, the spherical distance can be calculated by the Haversine Formula.

R makes it easy not to settle for linearity. But for those of us who are not doing something where great precision is required, linearity may be just fine,

Here is the R-code to plot the graphs. Note that in R, log(x) is the natural log function. This is also true in Python and Excel.



library(ggplot2)

common_theme <- theme(
     legend.position="right",
     plot.title = element_text(size=15, face="bold"),
     plot.subtitle = element_text(size=12.5, face="bold"),
     axis.title = element_text(size=15, face="bold"),
     axis.text = element_text(size=15, face="bold"),
     axis.title.y = element_text(angle = 0),
     legend.title = element_text(size=15, face="bold"),
     legend.text = element_text(size=15, face="bold"))

a <- 0.0
b <- 1.0
x_values <- seq(a, b, by = 0.01)
func1 <- function(x) {log(1 + x)}
func2 <- function(x) {log(1.05) + (1/1.05)*(x - .05)}
df1 <- data.frame(x = x_values,
           y = log(1 +x_values),
           func = "Function 1")
df2 <- data.frame(x = x_values,
           y = log(1.05) + (1/1.05)*(x_values - .05),
           func = "Function 2")
combined_df <- rbind(df1, df2)

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   ylim(c(0, 1)) + 
   labs(title = "Plot of LN and Linear Functions",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   xlim(c(0, 0.1)) +
   ylim(c(0, 0.11)) + 
   labs(title = "Plot of LN Function and its Tangent Line at x = .05",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

End

Sunday, June 15, 2025

Hebrew Gematria in R

Gematria is a Greek word for the practice of assigning numerical values to letters. In Hebrew it has been used to interpret Jewish texts, particularly the Bible, by attempting to discover hidden meanings or connections between words and concepts.

One example of gematria is in Genesis 14:14 when Abram takes 318 men to rescue his relative. The gematria values of Abram's servant Eliezer and of the Hebrew word for speaking are both 318. Perhaps there is more to this passage than just the 318 men. See https://stljewishlight.org/arts-entertainment/understanding-hebrew-numerology-and-the-secrets-of-the-torah/#:~:text=One%20famous%20example%20of%20gematria,connects%20to%20the%20heavenly%20universe .

The table of 22 Hebrew letters and their gematria values is the following. I have omitted the five Hebrew letters that are formed differently when they appear as the last letter of a word, but their values are the same as the corresponding non last letter. Gematria does not use the dots and dashes vowel symbols that appear below the Hebrew letter.

Although gematria is not universally accepted by Jews and Jewish scholars, it is common in Jewish culture to make donations in multiples of $18: 18 is the gematria value of the Hebrew word "chai" (חי) which means "life". ח (chet equals 8) + י (yod equals 10).

There is an R package gemmatria (two MM's), but it is not currently available on CRAN due to encoding issues about non-ASCII characters. It is installable from GitHub using the devtools package with the following command: devtools::install_github("benyamindsmith/gemmatria", force=TRUE) . As an example of the package, the R code for the gematria value of yaakov, יעקב , is

library(gemmatria)
get_gemmatria("יעקב")

which gives 182. This is calculated as the sum (yod = 10) + (ayin = 70) + (qof = 100) + (bet = 2).

One online source that will accept a Hebrew word and calculate its gematria value is https://www.torahcalc.com/tools/gematria .

Although the above website and gemmatria package are sufficient for most uses, I thought it would be instructive to build my own gematria calculator in R.

I began by creating a dataframe of the above table. I found the right-to-left nature of entering 27 Hebrew letters, surrounded by quotation marks and separated by commas, too cumbersome. Instead I used their Unicode hex equivalents, such as א is u05D0 ; print("\u05D0") will print alef, א.

Then given a Hebrew word, I split the word into its individual characters, and I used the match command to look up and sum the gematria values. Note that the Hebrew word must not contain vowels, or else there will be no match.

I also added a little exploration of five Biblical descendants of Abraham and the number 26. 26 has a lot of significance in the Bible.

26 is the gematria value of the four letter unspeakable Hebrew name for G-d: Yod + He + Vav + He = 10 + 5 + 6 + 5 = 26.
26 is the number of generations from Adam to Moses.
26 is the number of generations from David to Jesus according to one geneaology (but interpretations differ).
There are many examples of Biblical gematria values of 26.

26 also has some interesting mathematical properties such as it is the sum of a cube's 6 faces, 12 edges, and 8 vertices, but I will leave such mathematical properties for another time.

For five of the Biblical descendants of Abraham:

Gematria of Isaac יצחק is 208 is multiple of 26.
Gematria of Jacob יעקב is 182 is multiple of 26.
Gematria of Joseph יוסף is 156 is multiple of 26.
Gematria of Ishmael ישמעאל is 451 is not multiple of 26.
Gematria of Esau עשו is 376 is not multiple of 26.

I will leave an interpretation of the above to the reader.

There are many other examples of the use of gematria to help explain the Bible. Also, I have limited this to Standard Gematria; there are other forms.

Is gematria clever wordplay and coincidence? Or were the biblical writers trying to tell us something?

The R code is as follows:


# Define Hebrew letters, names, and gematria numbers
df <- data.frame(
  hebrew = c("\u05D0", "\u05D1", "\u05D2", "\u05D3", "\u05D4",
             "\u05D5", "\u05D6", "\u05D7", "\u05D8", "\u05D9",
             "\u05DA", "\u05DB", "\u05DC", "\u05DD", "\u05DE",
             "\u05DF", "\u05E0", "\u05E1", "\u05E2", "\u05E3",
             "\u05E4", "\u05E5", "\u05E6", "\u05E7", "\u05E8",
             "\u05E9", "\u05EA"),
  
  name = c("Alef", "Bet", "Gimel", "Dalet", "He", "Vav", "Zayin", 
           "Chet", "Tet", "Yod", "Final_Kaf", "Kaf", "Lamed", "Final_Mem", 
           "Mem", "Final_Nun", "Nun", "Samech", "Ayin", "Final_Pe", "Pe", 
           "Final_Tsadi", "Tsadi", "Qof", "Resh", "Shin", "Tav"),
  
  number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 20, 30, 40, 40, 
             50, 50, 60, 70, 80, 80, 90, 90, 100, 200, 300, 400)
)

# Define words;  no vowels!
yitzchak <- "יצחק"
yaakov <- "יעקב"
yosef <- "יוסף"
ishmael <- "ישמעאל"
esau <- "עשו"


words <- c(yitzchak, yaakov, yosef, ishmael, esau)

gematria <- function(word) {
   # Split word into characters
   v <- strsplit(word, split = "")[[1]]
   # Calculate Gematria of word using vectorized lookup
   gematria_value <- sum(df$number[match(v, df$hebrew)])
   if (gematria_value %% 26 == 0) divisibility = "is multiple of 26" 
     else divisibility = "is not multiple of 26"
  # Output result
  cat("Gematria of", word, "is", gematria_value, divisibility, "\n")
}

for (word in words) {
  gematria(word)
}

End

Monday, June 2, 2025

LGBTQ+ Pride Month

This is the start of Pride month. I have friends, acquaintances, and family members in most if not all LGBTQ+ categories, and probably you do too although you may not know it. I wish they all could live lives of being who they want to be without being socially or legally hassled. I made this flag in computer language R.

Here is the R code:

 # Draw empty plot
plot(NULL, col = "white", xlim = c(0, 12), ylim = c(0, 8), xlab="", xaxt="n", ylab="", yaxt="n", 
    main="LGBTQ+ Rainbow Flag" ) 

polygon(x = c(3.33, 12, 12, 2), y = c(6.67, 6.67, 8, 8), col = "#e40303")  # coordinates counter-
    clockwise; red                  
polygon(x = c(4.67, 12, 12, 3.33), y = c(5.33, 5.33, 6.67, 6.67), col = "#ff8c00") # orange
polygon(x = c(6, 12, 12, 4.67), y = c(4, 4, 5.33, 5.33), col = "#ffed00") # yellow
polygon(x = c(4.67, 12, 12, 6), y = c(2.67, 2.67, 4, 4), col = "#008026") # green
polygon(x = c(3.33, 12, 12, 4.67), y = c(1.33, 1.33, 2.67, 2.67), col = "#004dff") # blue
polygon(x = c(2, 12, 12, 3.33), y = c(0, 0, 1.33, 1.33), col = "#750787") # purple

polygon(x = c(1, 2, 6, 2, 1, 5), y = c(0, 0, 4, 8, 8,4), col = "#000000") # black
polygon(x = c(0, 1, 5, 1, 0, 4), y = c(0, 0, 4, 8, 8, 4), col = "#613915") # brown
polygon(x = c(0, 0, 4, 0, 0, 3), y = c(1.33, 0, 4, 8, 6.67, 4), col = "#74d7ee") # light blue
polygon(x = c(0, 0, 3, 0, 0, 2), y = c(2.67, 1.33, 4, 6.67, 5.33, 4), col = "#ffafc8") # pink
polygon(x = c(0, 0, 2), y = c(5.33, 2.67, 4), col = "#ffffff") # white

End

How do I count thee? Let me count the ways?

Saturday, July 11, 2026

Saturday, June 27, 2026

Tuesday, March 31, 2026

Friday, March 13, 2026

Monday, February 23, 2026

Sunday, February 1, 2026

So who is Irmaa?

Sunday, January 25, 2026

Friday, January 2, 2026

Tuesday, December 9, 2025

Sunday, November 30, 2025

Saturday, November 22, 2025

Thursday, October 23, 2025

Sunday, September 7, 2025

Saturday, June 21, 2025

Sunday, June 15, 2025

Monday, June 2, 2025