How do I count thee? Let me count the ways?

Auto accidents and drowsiness

      I've been thinking about Thanksgiving and auto accidents.       I hope we all had a good Thanksgiving. But we ofte...

Sunday, November 30, 2025

Auto accidents and drowsiness

      I've been thinking about Thanksgiving and auto accidents.

      I hope we all had a good Thanksgiving. But we often feel drowsy after the Thanksgiving meal. The feeling of sleepiness may be due to overeating high-carbohydrate and high-fat foods and to alcohol. Turkey's effect on melatonin (the sleep-regulating hormone) is debatable.

      I wondered if there are more auto accidents due to drowsiness on Thanksgiving. The only source I could find says Thanksgiving itself is usually safer than a normal weekday because many people are off the road visiting family. The same source says the "crash rate" (accident frequency?) is higher the Sunday after Thanksgiving compared with the previous Sunday.

      The AAA quotes a study that suggests about 9.5% of all car crashes involve a drowsy driver. Many drivers may under-estimate thier own degree of fatigue.

      My 2023 Hyundai gives me a monthly numerical grade on my driving (yes, really!), including on my lane changes. While my car probably knows too much about me ("I know you were eating that muffin while you were driving ... "), I don't think my car comes with a drowsiness alert system that other cars do have.

      Not exactly related to drowsiness, but I've also been thinking about the need to concentrate for long periods of time. For me, as the years go on, I find this more difficult to do. I try to commit to continuing education in my various professional organizations, but it does seem harder to concentrate. I continue to take some online courses, but I would be hard pressed to have to memorize a large quantity of material for a traditional exam.

      But like a lot of things, maybe times have changed. I came across a website discussing dietary supplements and preparing for math exams.

End

Saturday, November 22, 2025

Twenty Questions and Decision Trees

      Most of us have probably played the game Twenty Questions. The answerer chooses something, and the other players try to guess it by asking yes or no questions. The TV show "What's My Line" is an example of this where the players try to guess a contestant's occupation.

      Contrary to my kids' attempts, the strategy should be to ask questions that split the remaining possibilities roughly in half each time.

      This seems very similar to a machine learning Decision Tree, although with an interesting distinction.

      A decision tree cheats. The decision tree algorithm knows the answer (df$target = 1). The algorithm attempts to find the best feature and split value to separate df$target = 1 from df$target = 0 at each node, but it needs to know the right answer to ask the best questions. This is why, if the game is played say with different US Presidents multiple times, the algorithm may choose different features and split values.

      Nevertheless, I thought it would be fun to program a decision tree model with the US Presidents. I found some data on Presidents. I decided some variables had too many values (high cardinality - there were a lot of political party names in the 1800s), so I grouped some values to reduce the number of unique values.

      I initially began with a random integer between 1 and 47 to select a President, which selected President Hoover, but I found a different President would create a tree closer to the questions I would have asked if I were a player. So I selected President Reagan to get a more interesting tree.

      (I considered selecting President Garfield to be able to ask the question, "Is the president credited with a unique proof of the Pythagorean Theorem?", but I decided that was a little quirky, even for me.)

      Here is the resulting tree for President Reagan:

      Here is the resulting variable importance plot. Note that the variables are not in the same order as the tree splits. I understand that variable importance is based on the some of the improvements in all nodes where the variable was used as a splitter.

      Here is my R code:


library(dplyr)
library(rpart)
library(rpart.plot)
library(ggplot2)
df <- read.csv("prez.csv", header=TRUE)   
# data file available at github:  
prez.csv
set.seed(123)
# r <- sample(1:nrow(df),1)
r <- 40   # deliberate choice to get longer tree
answer <- df$LastName[r]
print(paste("The target president is:", answer))
df$target <- rep(0, nrow(df))
df$target[r] <- 1

# Feature engineering:
df <- df %>%
  # A. Categorical Reduction
  mutate(
    Party = case_when(
        Party %in% c("Democratic") ~ "Democratic", 
        Party %in% c("Republican") ~ "Republican", 
        TRUE ~ "Other"),
    Occupation = case_when(
        Occupation %in% c("Businessman", "Lawyer") ~ Occupation, 
        TRUE ~ "Other"),
    State = case_when(
        State %in% c("New York") ~ "NY", 
        State %in% c("Ohio") ~ "OH", 
        State %in% c("Virginia") ~ "VA", 
        State %in% c("Massachusetts") ~ "MA", 
        State %in% c("Texas") ~ "TX", TRUE ~ "Other"),
    Religion = case_when(
        Religion %in% c("Episcopalian", "Presbyterian", "Unitarian", "Baptist", "Methodist") ~ "Main_Prot", 
        TRUE ~ "Other"),
    
    # B. Year/Century Binning using cut()
    DOB = cut(DOB, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    DOD = cut(DOD, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Began = cut(Began, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE),
    Ended = cut(Ended, breaks = c(-Inf, 1800, 1900, 2000, Inf), 
        labels = c("18th century", "19th century", "20th century", "21st century"), right = FALSE)
  ) %>%

  # C. Convert all new/existing binary/categorical variables to factor
  mutate_at(vars(Party, State, Occupation, Religion, Assassinated, Military, Terms_GT_1, Pres_During_War, Was_Veep, DOB, DOD, Began, Ended), as.factor)


# selected variables 
formula <- as.formula(target ~ Began + State + Party + Occupation + Pres_During_War)
# Using aggressive control settings to force a maximal, unpruned tree
prez_tree <- rpart(formula, data = df, method = "class",
                   control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
rpart.plot(prez_tree, type = 4, extra = 101, main = "President Twenty Questions")

# check Reagan
df %>% filter(Began == "20th century" & 
              !State %in% c("MA", "NY", "OH", "TX") &
              Party == "Republican" &
              !Occupation %in% c( "Businessman", "Lawyer"))

variable_importance <- prez_tree$variable.importance
importance_df <- data.frame(
  Variable = names(variable_importance),
  Importance = variable_importance
)

importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

common_theme <- theme(
        legend.position="NULL",
        plot.title = element_text(size=15, face="bold"),
        plot.subtitle = element_text(size=12.5, face="bold"),
        axis.title = element_text(size=15, face="bold"),
        axis.text = element_text(size=15, face="bold"),
        legend.title = element_text(size=15, face="bold"),
        legend.text = element_text(size=15, face="bold"))


ggplot(importance_df, aes(x = factor(Variable, levels = rev(Variable)), y = Importance)) +
  geom_col(aes(fill = Variable)) + 
  coord_flip() +
  labs(title = "20 Questions Variable Importance",
       x = "Variable",
       y = "Mean Decrease Gini") +
  common_theme

# loop through all presidents to see different first split vars
library(purrr)

### 1. Define the Analysis Function ###
# The function is modified to return a data frame row for clarity
get_first_split_row <- function(df, r) {
  # Temporarily set the target for the current president
  df$target <- 0
  df$target[r] <- 1
  tree <- rpart(formula, data = df, method = "class",
                control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))
  frame <- tree$frame
  
  # Determine the result
  if (nrow(frame) > 1) {
    first_split_var <- as.character(frame$var[1])
  } else {
    first_split_var <- "No split"
  }
  
  # Return a single-row data frame
  return(data.frame(
    President = df$LastName[r],
    First_Split_Variable = first_split_var
  ))
}

### 2. Run the Analysis and Combine Results ###
# Create a list of row indices to iterate over
indices_to_run <- 1:nrow(df)

# Use map_dfr to apply the function to every index and combine the results 
# into a single data frame (dfr = data frame row bind)
first_split_results_df <- map_dfr(indices_to_run, ~ get_first_split_row(df, .x))

### 3. Display the Table and Original Analysis ###
# Display the resulting table:
print(first_split_results_df)

print(table(first_split_results_df$First_Split_Variable))
  
  
End

Thursday, October 23, 2025

Change is good, so don't change my change

      One of my former insurance colleagues once said, "When a company makes a change, it's probably not going to benefit you."

      Assuming the above photo is legitimate, and of course nowadays who knows, McDonald's says it will be rounding cash change to the nearest five cents. McD is probably not the only one doing it. So if your change has last digit {1, 2, 6, 7} McD will round down and you will lose 1 or 2 cents; if your change has last digit {3, 4, 8, 9} McD will round up and you will gain 1 or 2 cents; if your change has last digit {0, 5} there is no effect of rounding. So in 40% of the last digits you lose, in 40% you gain, and in 20%, there is no effect. The issue is, are the ten last digits 0 through 9 equally likely?

      All I need is a nice sample of McD receipts. On cash transactions. Well, that's not going to happen. Let's try a different approach.

      Similar to other retail prices, I assume many meal prices end in 99 cents, such as $5.99, to appear less expensive. (I was never asked to do this as an actuary.) Marketers know there is a left-digit effect, where customers focus on the leftmost number, so a price like $5.99 feels significantly cheaper than $6.00. See Psychological pricing. However, even if a majority of meal prices end in 99 cents, I believe the effect of that majority will be neutralized by the following.

      I have not been to a McDonald's in years. But I do know there are many different menu items, I have no idea what the frequency is of say a Big Mac versus Chicken McNuggets purchase, I have no idea of the frequency of permutations of all possible menu item purchases, prices vary based on location because franchises set their own pricing, and there is also a sales tax that differs by state and sometimes by city within state. No doubt a McD data analyst has access to all this data, but I don't. So what follows is not an exact analysis, but rather an approach that is intended to be unbiased.

      At first I thought Benford's Law might be useful. This law applies to the distribution of first digits, not last digits, and says under certain conditions (such as, when values are distributed across multiple orders of magnitude, which is probably not true for a place like McDonald's), P(1) = 30.1%, P(2) = 17.6%, P(3) = 12.5%, etc. This is interesting, but not useful here. See Benford for more on Benford.

      In How to Solve It, George Polya suggests, "If you cannot solve the proposed problem, try to solve first some related problem."

      I discovered if I limit the problem to a single meal item, a McDonald's Big Mac Meal (MBMM), I could get an average (well, more of a representative) price by state. I could also apply an average (again, representative) sales tax by state.

      For each state, I took the MBMM price, added tax, and applied the McD rounding rule to the last digit. For each state I define Cents as after-tax cents portion of the MBMM price, CentsRounded as the cents portion after the McD rounding rule, and Rounding Difference = CentsRounded - Cents. The unweighted state average Rounding Difference was .04 cents. Not 4 cents, but 4% of a penny. This is positive, indicating a slight gain to McDonald's, but barely greater than zero. This .04 cents is per transaction. Of course, McDonald's does sell a lot of hamburgers.

      This study had a lot of limitations, some of which I noted above, so it is certainly not an exhaustive study. But to the extent I did what I could, my friend in the opening paragraph was right: "When a company makes a change, it's probably not going to benefit you."

      The R code is as follows:


library(dplyr)

# --- 1. Big Mac Meal Prices by State (approx, USD) ---
meal_prices <- tibble::tribble(
  ~State, ~Price,
  "Alabama", 9.49, "Alaska", 11.59, "Arizona", 12.99, "Arkansas", 8.79,
  "California", 10.69, "Colorado", 9.89, "Connecticut", 9.79, "Delaware", 8.99,
  "Florida", 9.39, "Georgia", 9.49, "Hawaii", 10.99, "Idaho", 9.29,
  "Illinois", 9.59, "Indiana", 8.99, "Iowa", 8.89, "Kansas", 9.09,
  "Kentucky", 7.79, "Louisiana", 9.69, "Maine", 9.19, "Maryland", 9.49,
  "Massachusetts", 9.99, "Michigan", 8.59, "Minnesota", 9.19, "Mississippi", 9.29,
  "Missouri", 8.99, "Montana", 9.09, "Nebraska", 8.59, "Nevada", 9.69,
  "New Hampshire", 8.99, "New Jersey", 9.49, "New Mexico", 9.09, "New York", 9.89,
  "North Carolina", 9.29, "North Dakota", 10.59, "Ohio", 8.89, "Oklahoma", 8.99,
  "Oregon", 10.69, "Pennsylvania", 9.19, "Rhode Island", 9.49, "South Carolina", 9.29,
  "South Dakota", 9.09, "Tennessee", 9.79, "Texas", 9.19, "Utah", 9.39,
  "Vermont", 9.19, "Virginia", 8.99, "Washington", 9.69, "West Virginia", 8.99,
  "Wisconsin", 9.19, "Wyoming", 8.99
)

# --- 2. Combined State + Local Sales Tax Rates (approx, fraction) ---
tax_rates <- tibble::tribble(
  ~State, ~TaxRate,
  "Alabama",0.0944,"Alaska",0.0182,"Arizona",0.0837,"Arkansas",0.0948,
  "California",0.0885,"Colorado",0.0780,"Connecticut",0.0635,"Delaware",0.0000,
  "Florida",0.0700,"Georgia",0.0739,"Hawaii",0.0450,"Idaho",0.0602,
  "Illinois",0.0874,"Indiana",0.0700,"Iowa",0.0689,"Kansas",0.0874,
  "Kentucky",0.0600,"Louisiana",0.1011,"Maine",0.0550,"Maryland",0.0600,
  "Massachusetts",0.0625,"Michigan",0.0600,"Minnesota",0.0749,"Mississippi",0.0707,
  "Missouri",0.0813,"Montana",0.0000,"Nebraska",0.0696,"Nevada",0.0849,
  "New Hampshire",0.0000,"New Jersey",0.0660,"New Mexico",0.0777,"New York",0.0852,
  "North Carolina",0.0698,"North Dakota",0.0696,"Ohio",0.0724,"Oklahoma",0.0908,
  "Oregon",0.0000,"Pennsylvania",0.0634,"Rhode Island",0.0700,"South Carolina",0.0744,
  "South Dakota",0.0640,"Tennessee",0.0961,"Texas",0.0819,"Utah",0.0702,
  "Vermont",0.0636,"Virginia",0.0567,"Washington",0.0947,"West Virginia",0.0648,
  "Wisconsin",0.0572,"Wyoming",0.0556
)

# --- 3. Merge datasets ---
df <- inner_join(meal_prices, tax_rates, by = "State")

# --- 4. Compute totals and apply rounding rule ---
options(dplyr.width = Inf)
df <- df %>%
  mutate(
    # Total in cents, rounded to nearest cent
    Total_cents = round(Price * (1 + TaxRate) * 100),
    Dollars = Total_cents %/% 100,    # whole dollars
    Cents = Total_cents %% 100,       # cents 0-99
    
    # Apply 5-cent rounding rule
    CentsRounded = sapply(Cents, function(x) {
      last_digit <- x %% 10
      if (last_digit %in% c(1,2,6,7)) {
        return(floor(x / 5) * 5)
      } else if (last_digit %in% c(3,4,8,9)) {
        return(ceiling(x / 5) * 5)
      } else {
        return(x)
      }
    }),
    
    # Final total in dollars
    TotalRounded = Dollars + CentsRounded / 100,
    
    # Rounding difference relative to nearest-cent total (in cents)
    RoundingDiff = CentsRounded - Cents
  )
head(df)

# --- 5. Summaries ---
mean_diff <- mean(df$RoundingDiff)   # positive is benefit to company
sd_diff   <- sd(df$RoundingDiff)
avg_abs   <- mean(abs(df$RoundingDiff))

cat("Average rounding difference (¢):", round(mean_diff,3), "\n")
cat("SD of rounding difference (¢):", round(sd_diff,3), "\n")
cat("Average absolute rounding (¢):", round(avg_abs,3), "\n\n")

# Distribution of rounded cents
print(table(df$CentsRounded))

# --- 6. Histogram of rounding differences ---
bin_colors <- c("red", "green", "blue", "yellow", "purple")
hist(df$RoundingDiff,
     breaks = seq(-2.5, 2.5, 0.5),
     col = bin_colors,
     main = "Distribution of Rounding Differences\nPositive is benefit to company",
     xlab = "Rounding Difference (¢)",
     ylab = "Number of States",
     font.lab = 2)

# --- 7. State-by-state table of rounding effects ---
state_table <- as.data.frame(df) %>%
  select(State, Price, TaxRate, Total_cents, TotalRounded, RoundingDiff) %>%
  arrange(desc(RoundingDiff))

print(state_table)
  
  
End

Sunday, September 7, 2025

Davey Johnson, former Mets manager, an early proponent of sabermetrics

      Davey Johnson, former baseball player and manager, recently passed away. He was notable in the math and data analysis world for being an early proponent of sabermetrics, the use of statistical analysis of baseball to provide insights into player and team performance, long before the 2003 Moneyball book and 2011 movie brought sabermetrics to the public.

      Johnson was a math major from Trinity University in San Antonio, TX. In his early twenties while playing for the Baltimore Orioles, Johnson was writing simulations of the Orioles batting order, and he thought the 1969 lineup that was used most often was the sixth worst possible lineup (baseballhall.org/discover-more/news/johnson-davey). He was unable to convince manager Earl Weaver that he should bat second in the lineup.

      In 1984, his first year as Mets manager, he used dBase II to compile data on each opposing pitcher and each Mets hitting record against that pitcher. This is routine today, but it was innovative in 1984.

      He was particularly interested in the statistic On-Base Percentage (OBP), which measures how often a batter reaches base, and thus creates scoring opportunities. It is more comprehensive than simple Batting Average, and it incorporates the adage, “A walk is (almost) as good as a hit.” OBP equals (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies). As a result of his analysis he decided star Mookie Wilson was not the optimal leadoff hitter, and Johnson replaced Wilson at leadoff with Wally Backman.

      All the key Mets players improved their OBP in 1984 versus 1983 (except Hernandez who was over .400 in both yeara), but Backman's 1984 OPS of .360 was much higher than Wilson's .308, especially due to Backman's higher number of walks.

      Johnson was the Mets manager from 1984 to early 1990, he won at least 90 games in each of his first five seasons, and finished first or second in the NL East all six years. The 1983 Mets had won only 68 games. Part of his success was due to the emergence of Darryl Strawberry and Dwight Gooden, and the acquisitions of Keith Hernandez and Gary Carter. Johnson was known for platooning his players, even his stars. It is debatable how many wins are due to the manager (sportslawblogger.com says at most 5 wins per season), but Johnson's managerial record with the Mets is formidable. His lifteime winning percentage including managerial stints elsewhere is .562.

      As a player Johnson was a second baseman from 1965 through 1978, most notably for the Baltimore Orioles and the Atlanta Braves. In Baltimore he was part of a great defensive infield along with Brooks Robinson and Mark Belanger. In Atlanta, as part of a lineup including Hank Aaron and Darrell Evans, he had a season with 43 home runs (42 as second baseman, plus 1 as pinch hitter), tying Rogers Hornsby’s record for most home runs by a second baseman.

      Johnson's player statistics are as follows. They are not Hall of Fame calibre, but they are pretty good:

      Billy Beane, the subject of the book Moneyball that popularized sabermetrics, played five games for Johnson's 1984 Mets and eight games for Johnson's 1985 Mets. You have to wonder how much Beane was influenced by Johnson.

      The R code is as follows:


  
library(Lahman)
library(tidyverse)
library(ggplot2)
data(Batting)
data(People)
data(Teams)

common_theme <- theme(
    legend.position="right",
    plot.title = element_text(size=15, face="bold"),
    plot.subtitle = element_text(size=12.5, face="bold"),
    axis.title = element_text(size=15, face="bold"),
    axis.text = element_text(size=15, face="bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_text(size=15, face="bold"),
    legend.text = element_text(size=15, face="bold"))

# Mets key hitters 1983, 1984:
stars <- c("Backman", "Brooks", "Hernandez", "Foster", "Strawberry", "Wilson")
df <- Batting %>% 
  filter((yearID == 1983 | yearID == 1984) & teamID == "NYN") %>%
  left_join(People, by = "playerID") %>%
  filter(nameLast %in% stars | (nameLast == "Hernandez" & nameFirst == "Keith")) %>% 
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(nameFirst, nameLast, yearID, H, BB, HBP, AB, SF, BA, OBP) %>%
  arrange(yearID, desc(OBP)) 

ggplot(df, aes(x = nameLast, y = OBP, group = factor(yearID), fill = factor(yearID))) +
  geom_bar(stat = "identity", position = "dodge", width = .9) +
  scale_fill_manual(values = c("1983" = "red", "1984" = "blue")) +
  labs(title = "Mets Key Player OBP 1983, 1984",
       x = "Player",
       y = "On Base Percentage",
       fill = "Year") +
  common_theme

# NY Mets number of wins
mets_records <- Teams %>%
  filter(teamID == "NYN", yearID %in% seq(from = 1982, to = 1989, by = 1)) %>% 
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(yearID, W, L, H, HR, BA, OBP)
print(mets_records)

mets_records$mgr <- ifelse(mets_records$yearID == 1982 | mets_records$yearID == 1983, "Not Davey", "Davey")
mets_records$mgr <- factor(mets_records$mgr, levels = c("Not Davey", "Davey"))
ggplot(mets_records, aes(x = yearID, y = W, group = factor(mgr), fill = factor(mgr))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Not Davey" = "red", "Davey" = "blue")) +
  labs(title = "Mets Number of Wins by Year",
       x = "Year",
       y = "Number of Wins") +
       fill = "Manager") +
  common_theme

# find Davey Johnson in People:
People %>% filter(nameFirst == "Davey" & nameLast == "Johnson")
# his People %>% filter(nameFirst == "Davey" & nameLast == "Johnson")
# his playerID is johnsda02
df <- Batting %>% filter(playerID == "johnsda02") %>%
  left_join(People, by = "playerID") %>%
  mutate(
    BA = round(H/AB,3),
    OBP = round((H + BB + HBP) / (AB + BB + HBP + SF),3)
    ) %>%
  select(nameFirst, nameLast, yearID, teamID, H, BB, HBP, AB, SF, HR, BA, OBP)

# Calculate the totals row
totals_row <- df %>%
  summarize(
    nameFirst = "Totals",
    nameLast = "",
    yearID = NA_integer_, # NA is used for columns without a meaningful sum
    H = sum(H),
    BB = sum(BB),
    HBP = sum(HBP),
    AB = sum(AB),
    SF = sum(SF),
    HR = sum(HR),
    BA = round(sum(H) / sum(AB), 3),
    OBP = round(sum(H + BB + HBP) / sum(AB + BB + HBP + SF), 3)
  )
df_with_totals <- bind_rows(df, totals_row)
print(df_with_totals)


End

Saturday, June 21, 2025

It's a linear world - approximately; and the "Rule of 72"

      You are probably familiar with the "Rule of 72" in investing: if an investment compounds at annual interest rate i, then the number of years for money to double is approximately 72 / i. For example, if an investment compounds at annual interest rate of 9%, then the investment will double in approximately 72/9 = 8 years. The compound interest formula confirms this: 1 * 1.09 8 = 1.992 (rounded), which is approximately 2.

      For fun, ask your banker or investment advisor why the Rule of 72 works.

      It works because in the short run, it's a linear world - approximately. A little more mathematical is that, in a small enough neighborhood, any differentiable function is approximately linear.

      For the Rule of 72 we are solving for n in the equation 1 * (1 + i/100) n = 2. We take natural logarithms of both sides:

1 * (1 + i/100) n = 2
LN (1 + i/100) n = LN 2
n * LN (1 + i/100) = LN 2
n = LN 2 / LN (1 + i/100)
n ≈ .693 / LN (1 + i/100)
n ≈ .693 / (i/100)
n ≈ 69.3 / i
n ≈ 72 / i

      The key step three lines above is that for small values of x, LN (1 + x) is approximately equal to x. One way to see this is that the Taylor series (sorry, not invented by Taylor Swift) expansion around a = 0 is:

LN (1+x) = x - x 2 / 2 + x 3 / 3 - x 4 / 4 + ... .

For small x, LN (1+x) is approximately equal to x.

      In the following plots, the plot on the left shows that the logarithmic function 1 of y = LN (1 + x) is surely different than the linear function 2 of y = LN (1.05) + (1/1.05)*(x - .05). Function 2 is the equation of the tangent line to function 1 at x = .05. (To derive the tangent line, recall from calculus that if y = LN(1 + x), then dy/dx = 1/(1 + x) ). The plot on the right is of the same two functions, but with the x range shrunk to (0.0, 0.1). In this small range the two functions are indistinguishable. In this small neighborhood, the differentiable function y = LN(1 + X) is indistinguishable from its linear tangent.

      Of course a "Rule of 69.3", where LN 2 is .693 to three decimal places, would be a better approximation to the compound interest formula than the "Rule of 72", but 72 is close enough and is useful because 72 has so many integer divisors.

      In the real world we often assume linearity holds within a small range. For example in cooking a turkey (not that I have ever done this), one website says to allow about 15 minutes per pound (on average!) at 350°F to cook a stuffed turkey. But their time estimates by pound are not linear, so for example their time estimate to cook a 24-pounder is less than twice as long to cook a 12-pounder. Perhaps a better approximation is a power curve y = a * (x b) . Presumably there are some physics considerations that are non-linear.

      The Newton-Raphson method, x n+1 = x n - f (x n ) / f ' (x n ), for finding approximations to roots of a function, is an example of a mathematical use of linearity. N-R iteratively finds the x-intercept of the tangent of the graph of f. A number of R packages have N-R functions.

      And before we end the subject of "it's a linear world - approximately", you can certainly measure the straight line distance between two cities, but if you travel by airplane, the airplane is flying the distance between two points on a sphere. Given the latitude and longitude coordinates and the central angle between the two points, the spherical distance can be calculated by the Haversine Formula.

      R makes it easy not to settle for linearity. But for those of us who are not doing something where great precision is required, linearity may be just fine,

      Here is the R-code to plot the graphs. Note that in R, log(x) is the natural log function. This is also true in Python and Excel.



library(ggplot2)

common_theme <- theme(
     legend.position="right",
     plot.title = element_text(size=15, face="bold"),
     plot.subtitle = element_text(size=12.5, face="bold"),
     axis.title = element_text(size=15, face="bold"),
     axis.text = element_text(size=15, face="bold"),
     axis.title.y = element_text(angle = 0),
     legend.title = element_text(size=15, face="bold"),
     legend.text = element_text(size=15, face="bold"))

a <- 0.0
b <- 1.0
x_values <- seq(a, b, by = 0.01)
func1 <- function(x) {log(1 + x)}
func2 <- function(x) {log(1.05) + (1/1.05)*(x - .05)}
df1 <- data.frame(x = x_values,
           y = log(1 +x_values),
           func = "Function 1")
df2 <- data.frame(x = x_values,
           y = log(1.05) + (1/1.05)*(x_values - .05),
           func = "Function 2")
combined_df <- rbind(df1, df2)

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   ylim(c(0, 1)) + 
   labs(title = "Plot of LN and Linear Functions",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   xlim(c(0, 0.1)) +
   ylim(c(0, 0.11)) + 
   labs(title = "Plot of LN Function and its Tangent Line at x = .05",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

End

Sunday, June 15, 2025

Hebrew Gematria in R

      Gematria is a Greek word for the practice of assigning numerical values to letters. In Hebrew it has been used to interpret Jewish texts, particularly the Bible, by attempting to discover hidden meanings or connections between words and concepts.

      One example of gematria is in Genesis 14:14 when Abram takes 318 men to rescue his relative. The gematria values of Abram's servant Eliezer and of the Hebrew word for speaking are both 318. Perhaps there is more to this passage than just the 318 men. See https://stljewishlight.org/arts-entertainment/understanding-hebrew-numerology-and-the-secrets-of-the-torah/#:~:text=One%20famous%20example%20of%20gematria,connects%20to%20the%20heavenly%20universe .

      The table of 22 Hebrew letters and their gematria values is the following. I have omitted the five Hebrew letters that are formed differently when they appear as the last letter of a word, but their values are the same as the corresponding non last letter. Gematria does not use the dots and dashes vowel symbols that appear below the Hebrew letter.

      Although gematria is not universally accepted by Jews and Jewish scholars, it is common in Jewish culture to make donations in multiples of $18: 18 is the gematria value of the Hebrew word "chai" (חי) which means "life". ח (chet equals 8) + י (yod equals 10).

      There is an R package gemmatria (two MM's), but it is not currently available on CRAN due to encoding issues about non-ASCII characters. It is installable from GitHub using the devtools package with the following command: devtools::install_github("benyamindsmith/gemmatria", force=TRUE) . As an example of the package, the R code for the gematria value of yaakov, יעקב , is

library(gemmatria)
get_gemmatria("יעקב")

which gives 182. This is calculated as the sum (yod = 10) + (ayin = 70) + (qof = 100) + (bet = 2).

      One online source that will accept a Hebrew word and calculate its gematria value is https://www.torahcalc.com/tools/gematria .

      Although the above website and gemmatria package are sufficient for most uses, I thought it would be instructive to build my own gematria calculator in R.

      I began by creating a dataframe of the above table. I found the right-to-left nature of entering 27 Hebrew letters, surrounded by quotation marks and separated by commas, too cumbersome. Instead I used their Unicode hex equivalents, such as א is u05D0 ; print("\u05D0") will print alef, א.

      Then given a Hebrew word, I split the word into its individual characters, and I used the match command to look up and sum the gematria values. Note that the Hebrew word must not contain vowels, or else there will be no match.

      I also added a little exploration of five Biblical descendants of Abraham and the number 26. 26 has a lot of significance in the Bible.

  • 26 is the gematria value of the four letter unspeakable Hebrew name for G-d: Yod + He + Vav + He = 10 + 5 + 6 + 5 = 26.
  • 26 is the number of generations from Adam to Moses.
  • 26 is the number of generations from David to Jesus according to one geneaology (but interpretations differ).
  • There are many examples of Biblical gematria values of 26.

26 also has some interesting mathematical properties such as it is the sum of a cube's 6 faces, 12 edges, and 8 vertices, but I will leave such mathematical properties for another time.

      For five of the Biblical descendants of Abraham:

  • Gematria of Isaac יצחק is 208 is multiple of 26.
  • Gematria of Jacob יעקב is 182 is multiple of 26.
  • Gematria of Joseph יוסף is 156 is multiple of 26.
  • Gematria of Ishmael ישמעאל is 451 is not multiple of 26.
  • Gematria of Esau עשו is 376 is not multiple of 26.

      I will leave an interpretation of the above to the reader.

      There are many other examples of the use of gematria to help explain the Bible. Also, I have limited this to Standard Gematria; there are other forms.

      Is gematria clever wordplay and coincidence? Or were the biblical writers trying to tell us something?

      The R code is as follows:


# Define Hebrew letters, names, and gematria numbers
df <- data.frame(
  hebrew = c("\u05D0", "\u05D1", "\u05D2", "\u05D3", "\u05D4",
             "\u05D5", "\u05D6", "\u05D7", "\u05D8", "\u05D9",
             "\u05DA", "\u05DB", "\u05DC", "\u05DD", "\u05DE",
             "\u05DF", "\u05E0", "\u05E1", "\u05E2", "\u05E3",
             "\u05E4", "\u05E5", "\u05E6", "\u05E7", "\u05E8",
             "\u05E9", "\u05EA"),
  
  name = c("Alef", "Bet", "Gimel", "Dalet", "He", "Vav", "Zayin", 
           "Chet", "Tet", "Yod", "Final_Kaf", "Kaf", "Lamed", "Final_Mem", 
           "Mem", "Final_Nun", "Nun", "Samech", "Ayin", "Final_Pe", "Pe", 
           "Final_Tsadi", "Tsadi", "Qof", "Resh", "Shin", "Tav"),
  
  number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 20, 30, 40, 40, 
             50, 50, 60, 70, 80, 80, 90, 90, 100, 200, 300, 400)
)

# Define words;  no vowels!
yitzchak <- "יצחק"
yaakov <- "יעקב"
yosef <- "יוסף"
ishmael <- "ישמעאל"
esau <- "עשו"


words <- c(yitzchak, yaakov, yosef, ishmael, esau)

gematria <- function(word) {
   # Split word into characters
   v <- strsplit(word, split = "")[[1]]
   # Calculate Gematria of word using vectorized lookup
   gematria_value <- sum(df$number[match(v, df$hebrew)])
   if (gematria_value %% 26 == 0) divisibility = "is multiple of 26" 
     else divisibility = "is not multiple of 26"
  # Output result
  cat("Gematria of", word, "is", gematria_value, divisibility, "\n")
}

for (word in words) {
  gematria(word)
}

End

Monday, June 2, 2025

LGBTQ+ Pride Month

      This is the start of Pride month. I have friends, acquaintances, and family members in most if not all LGBTQ+ categories, and probably you do too although you may not know it. I wish they all could live lives of being who they want to be without being socially or legally hassled. I made this flag in computer language R.

      Here is the R code:

 # Draw empty plot
plot(NULL, col = "white", xlim = c(0, 12), ylim = c(0, 8), xlab="", xaxt="n", ylab="", yaxt="n", 
    main="LGBTQ+ Rainbow Flag" ) 

polygon(x = c(3.33, 12, 12, 2), y = c(6.67, 6.67, 8, 8), col = "#e40303")  # coordinates counter-
    clockwise; red                  
polygon(x = c(4.67, 12, 12, 3.33), y = c(5.33, 5.33, 6.67, 6.67), col = "#ff8c00") # orange
polygon(x = c(6, 12, 12, 4.67), y = c(4, 4, 5.33, 5.33), col = "#ffed00") # yellow
polygon(x = c(4.67, 12, 12, 6), y = c(2.67, 2.67, 4, 4), col = "#008026") # green
polygon(x = c(3.33, 12, 12, 4.67), y = c(1.33, 1.33, 2.67, 2.67), col = "#004dff") # blue
polygon(x = c(2, 12, 12, 3.33), y = c(0, 0, 1.33, 1.33), col = "#750787") # purple

polygon(x = c(1, 2, 6, 2, 1, 5), y = c(0, 0, 4, 8, 8,4), col = "#000000") # black
polygon(x = c(0, 1, 5, 1, 0, 4), y = c(0, 0, 4, 8, 8, 4), col = "#613915") # brown
polygon(x = c(0, 0, 4, 0, 0, 3), y = c(1.33, 0, 4, 8, 6.67, 4), col = "#74d7ee") # light blue
polygon(x = c(0, 0, 3, 0, 0, 2), y = c(2.67, 1.33, 4, 6.67, 5.33, 4), col = "#ffafc8") # pink
polygon(x = c(0, 0, 2), y = c(5.33, 2.67, 4), col = "#ffffff") # white

End

Monday, May 26, 2025

Self-intersecting Quadrilateral

      A quadrilateral is a polygon having four sides, four angles, and four vertices. A polygon means that the figure is a closed shape, meaning the last line segment connects back to the first one, effectively enclosing an area.

      We usually think of quadrilaterals as squares, rectangles, parallelograms, trapezoids, rhombuses, or kites. (I was impressed that my four year-old granddaughter knew the last one, although she called it a diamond!) It could also be irregularly shaped with no name.

      However, a polygon may intersect itself. A five-sided star is one example, where the sides are connected to alternating vertices.

      A quarilateral may also intersect itself. In the following diagram, the original quadrilateral has points A (0,0), B (4,0), C (3,3), D (1,4). The self-intersecting quadrilateral is formed from the original quadrilateral by shifting point D from (1,4) to (2, -2), so side CD crosses AB).

      This self-intersecting quadrilateral is still four-sided and closed, so it is no less a quadrilateral than the original.

      Here is some R code:


# Self-intersecting quadrilateral
library(ggplot2)

# Define coordinates for original quadrilateral
original_quad <- data.frame(
  x1 = c(0, 4, 3, 1),
  y1 = c(0, 0, 3, 4),
  x2 = c(4, 3, 1, 0),
  y2 = c(0, 3, 4, 0),
  group = c("AB", "BC", "CD", "DA"),
  labels = c("A", "B", "C", "D")
)

# Define coordinates for self-intersecting quadrilateral
# Shift point D from (1,4) to (2, -2), so side CD crosses AB)

stretched_quad <- data.frame(
  x1 = c(0, 4, 3, 2),  
  y1 = c(0, 0, 3, -2), 
  x2 = c(4, 3, 2, 0),
  y2 = c(0, 3, -2, 0),
  group = c("AB", "BC", "CD", "DA"),
  labels = c("A", "B", "C", "D")
)

# Define colors for each side
color_map <- c("AB" = "red", "BC" = "blue", "CD" = "green", "DA" = "purple")

# Function to plot the quadrilateral
plot_quad <- function(data, title, x_lim, y_lim) {
  ggplot(data) +
    geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, color = group), size = 1.5) +  # Draw each side
    geom_point(aes(x = x1, y = y1), size = 3, color = "black") +  # Show points
    geom_text(aes(x = x1, y = y1, label = labels), vjust = -1, hjust = -0.5, size = 6, fontface = "bold") +  # Label A, B, C, D
    scale_color_manual(values = color_map) +
    coord_fixed() +
    xlim(x_lim[1], x_lim[2]) +
    ylim(y_lim[1], y_lim[2]) +
    theme_minimal() +
    ggtitle(title)
}

# Expanded limits for full visibility
x_range <- c(-1, 7)  
y_range <- c(-3, 7)  

# Plot the original quadrilateral
p1 <- plot_quad(original_quad, "Original Quadrilateral", x_range, y_range)

# Plot the self-intersecting quadrilateral
p2 <- plot_quad(stretched_quad, "Self-Intersecting Quadrilateral", x_range, y_range)

# Display both plots
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

End

Monday, March 31, 2025

April Fool's Day, and the Pythagorean Theorem

      Here is a post in honor of April Fools' Day, which unlike March 14, is celebrated by countries that use either the month-day format or the day-month format.

      Not everyone is a math person, but nearly everyone remembers the Pythagorean Theorem.

      Well, not everyone. The Scarecrow got it wrong in the 1939 movie, "The Wizard of Oz". It is arguable whether the writers accidentally or deliberately got it wrong, or whether the actor Ray Bolger flubbed the line. (See also Singh, "The Simpsons and Their Mathematical Secrets", pp. 119-121.)

      So here is a little Pythagorean Theorem problem. In the following right triangle, calculate C.   A2 + B2 = C2, and then take the square root of C. How hard can it be?

      Of course you have to know if A = (√-1), then A2 = (√-1)2 = -1, but that seems reasonable.

      But does something seem wrong with the resulting value of C?

      I plotted the triangle in a complex number plane, not a real number plane.   (√-1) is not a "real" number, and perhaps it should be thought of as a non-real number. (I am intentionally avoiding the dreaded "i" word.)   Perhaps what is happening is that in the complex plane, the hypotenuse C is really a line segment of magnitude zero, and hence a point? No, not "really".

      So April Fool's. You calculated a hypotenuse of a right triangle and concluded it has length zero.

      A better answer is that a point (x, y) in the complex plane is represented as x + yi, and the Euclidean distance between two complex points (x, y) and (u, v) equals √( (u - x)2 + (v - y)2 ) which here is the expected √2.

      Here is the R code for the plot, followed by the Python code for the plot. Note that in R, the shading is done with geom_polygon, while in Python it is done with plt.fill_between:


library(ggplot2)

# Define the vertices of the triangle
triangle_data <- data.frame(
    x = c(0, 0, 1),    # x-coordinates of vertices
    y = c(0, 1, 0)     # y-coordinates of vertices
)

ggplot(data = NULL) +
    geom_polygon(data = triangle_data, aes(x = x, y = y), fill = "#9B111E", color = "black") +
    geom_segment(aes(x = 0, y = 0, xend = 1, yend = 0), color = "black", size = 2) +
    geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1), color = "black", size = 2) +
    geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), color = "black", size = 2) +
    ggtitle("Use the Pythagorean Theorem to calculate C") +
    xlab(expression(bold(A == sqrt(-1)))) +  # Square root symbol for x-axis label
    ylab("B = 1") +     # Label for y-axis
    geom_text(aes(x = 0.55, y = 0.55, label = "C = ?"), fontface = "bold", size = 6) +
    theme_classic() +
    theme(
        legend.position = "none",
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_text(size = 20, face = "bold"),
        axis.title.y = element_text(size = 20, face = "bold"),  
        plot.title = element_text(size = 20, face = "bold")
    )       




import numpy as np
import matplotlib.pyplot as plt

# Create figure and axes
fig, ax = plt.subplots()
ax.set_title('Use the Pythagorean Theorem to calculate C', fontsize=15, fontweight='bold')

# Define the points for the line
x = [0, 1]
y = [1, 0]

# Plot the horizontal and vertical lines
ax.axhline(y=0, color='black')  # x-axis
ax.axvline(x=0, color='black')  # y-axis

# Plot the diagonal line
ax.plot(x, y, color='black', linewidth=1.5)

# Fill the region enclosed by the axes and the line
plt.fill_between(x, y, 0, color='#9B111E', alpha=0.5)

# Add axis labels with bold and enlarged fonts
ax.text(0.5, -0.05, r"$\mathbf{A = \sqrt{-1}}$", ha='center', va='center', fontsize=14, transform=ax.transAxes)
ax.text(-0.05, 0.5, r"$\mathbf{B = 1}$", ha='center', va='center', fontsize=14, rotation=90, transform=ax.transAxes)

# Add diagonal line label with bold text, enlarged font, and moved farther off the line
ax.text(.6, .5, r"$\mathbf{C}$", ha='center', va='center', fontsize=14, color='black')

# Turn off the axes' tick marks and spines
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

# Display the plot
plt.show()

# end

Friday, March 14, 2025

Pi and billiard balls; a different application of π for Pi Day

      So suppose there is some sort of physical experiment, and the first time you do it, the answer comes out as 3. Then you change the experiment to make it a little more complicated, and the answer comes out as 31. Then you make it even more complicated, and the answer comes out 314. Then you ... . Of course you see where I'm going with this.

      It's March 14, or 3 14 (in the US date format), and because 314 are the first digits of π, many people use this day to share interesting and unusual appearances of π. I think you will find Pi and the Billiard Balls to be a little different!

      But first, if your statistics course did not require calculus as a prerequisite, you may be unaware that π is contained within the formula for the probability density of the Normal distribution,
f(x) = 1 2 π σ 2   exp   ( x μ ) 2 2 σ 2
where the 2π is necessary for the integral of the pdf to equal 1.

      A common application of π on Pi Day is the Buffon's needle problem: Given a needle of length l dropped on a plane ruled with parallel lines t units apart, what is the probability that the needle will lie across a line upon landing? There are many references to this such as Buffon, so the solution will not be repeated here.

      π appears in many places in math. One reason is that π is defined in reference to a circle. The trigonometric functions can be defined in terms of triangles within a circle. As a point traverses the circumference of the circle more than once, the trigonometric functions repeat in cycles of 2𝜋. Therefore, phenomena that repeat and can be represented by trigonometric functions are likely to have π somewhere within them. This is because π inherently relates to the periodic nature of these functions, making it indispensable in modeling cyclical behaviors.

      I recently discovered a physics problem called Pi and Billiard Balls. The original article is by G. Galperin, and I will try to summarize it. His paper is not an easy read.

      Suppose we have the first quadrant of an xy coordinate system. Suppose we have a vertical wall at x = 0, y ⪰ 0. We have ball 1 having mass m1 at initial position x1 > 0, and ball 2 having mass m2 ⪰ m1 at initial position x2 > x1. Let the ratio of the masses be a multiple of 100:   m2 / m1 = 100N for a fixed non-negative integer N, including N = 0. Suppose ball 2 moves from right to left along the x-axis and collides with ball 1, ball 1 moves from right to left and collides with the vertical wall, and assume all collisions will be perfectly elastic. This perfect elasticity assumption implies the balls will satisfy the law of conservation of momentum, and the law of conservation of kinetic energy. We also assume the balls only move along the x-axis; there is no y movement.

      The amazing conclusion of all this is: The total number of collisions C(N) is a number equal to the first N decimal digits of the number π (starting with 3) !

      For case 1, let N = 0 so m2 = m1. Now push ball 2 from right to left at initial velocity v2 until it hits ball 1. This is collision 1. Ball 1 will move from right to left at the same velocity v1' = v2, while ball 2 will now be at rest. Eventually ball 1 hits the wall for collision 2 and now moves from left to right at velocity -v1', until it hits ball 2 for collision 3. Now ball 1 is at rest, and ball 2 is moving to the right with velocity −v1'. There have been 3 collisions (the first digit of π), and there will be no more.

      For case 2, let N = 1 so m2 / m1 = 100. This is more complicated because after collision 1, v1' will not equal v2, and ball 2 will not be at rest. The result of the two conservation equations m1v1 + m2v2 = m1v1' + m2v2' and .5m1(v1^2) + .5m2(v2^2) = .5m1(v1' ^2) + .5m2(v2' ^2), give v1' = [(m1 - m2)v1 + 2m2v2] / (m1 + m2) and v2' = [(m2 - m1)v2 + 2m1v1] / (m1 + m2). After collision 1, substituting v1 = 0 and m2 / m1 = 100 gives v1' = 200v2/101 ≈ 1.98v2 (nearly twice as fast as initial velocity v2) and v2' = 99v2/101 ≈ .98v2 (slightly less than initial velocity v2).

      Ball 1 hits the wall for collision 2 and now moves from left to right at velocity -v1', until it hits ball 2 for collision 3.

      After collision 3, substituting primes into the two conservation equations, gives v1''' ≈ -0.96v2 and v2''' ≈ 1.96v2. Ball 1 will bounce back and forth between ball 2 and the wall many times. After each collision between the two balls, the velocity of each ball changes with ball 1 decreasing in velocity and ball 2 increasing in velocity. The relative velocity between ball 1 and ball 2 decreases with each collision, as the heavier ball will slowly transfer its momentum to the lighter ball. Eventually the relative velocity will be so small that they will effectively move together after the collision. At this point, no more collisions will occur. There will be 31 collisions (the first two digits of π).

      For case 3, let N = 2 and m2 / m1 = 1002, there are 314 collisions. And so on.

      Not surprisingly there is a circle lurking under all of this, due to the conservation of kinetic energy equation .5m1(v1^2) + .5m2(v2^2) = constant, and there is a trigonometric function and a calculation involving an angle and π. Galperin proves the conclusion in general: The total number of collisions C(N) is a number equal to the first N decimal digits of the number π (starting with 3) !

      In addition to the Galperin paper, a good explanation is here, but this is not an easy read.

      You are welcome to try this experiment yourself, if you can create the condition of perfect elasticity. Happy Pi Day.