How do I count thee? Let me count the ways?

It's a linear world - approximately; and the "Rule of 72"

      You are probably familiar with the "Rule of 72" in investing: if an investment compounds at annual interest rate i, the...

Showing posts with label haversine. Show all posts
Showing posts with label haversine. Show all posts

Saturday, June 21, 2025

It's a linear world - approximately; and the "Rule of 72"

      You are probably familiar with the "Rule of 72" in investing: if an investment compounds at annual interest rate i, then the number of years for money to double is approximately 72 / i. For example, if an investment compounds at annual interest rate of 9%, then the investment will double in approximately 72/9 = 8 years. The compound interest formula confirms this: 1 * 1.09 8 = 1.992 (rounded), which is approximately 2.

      For fun, ask your banker or investment advisor why the Rule of 72 works.

      It works because in the short run, it's a linear world - approximately. A little more mathematical is that, in a small enough neighborhood, any differentiable function is approximately linear.

      For the Rule of 72 we are solving for n in the equation 1 * (1 + i/100) n = 2. We take natural logarithms of both sides:

1 * (1 + i/100) n = 2
LN (1 + i/100) n = LN 2
n * LN (1 + i/100) = LN 2
n = LN 2 / LN (1 + i/100)
n ≈ .693 / LN (1 + i/100)
n ≈ .693 / (i/100)
n ≈ 69.3 / i
n ≈ 72 / i

      The key step three lines above is that for small values of x, LN (1 + x) is approximately equal to x. One way to see this is that the Taylor series (sorry, not invented by Taylor Swift) expansion around a = 0 is:

LN (1+x) = x - x 2 / 2 + x 3 / 3 - x 4 / 4 + ... .

For small x, LN (1+x) is approximately equal to x.

      In the following plots, the plot on the left shows that the logarithmic function 1 of y = LN (1 + x) is surely different than the linear function 2 of y = LN (1.05) + (1/1.05)*(x - .05). Function 2 is the equation of the tangent line to function 1 at x = .05. (To derive the tangent line, recall from calculus that if y = LN(1 + x), then dy/dx = 1/(1 + x) ). The plot on the right is of the same two functions, but with the x range shrunk to (0.0, 0.1). In this small range the two functions are indistinguishable. In this small neighborhood, the differentiable function y = LN(1 + X) is indistinguishable from its linear tangent.

      Of course a "Rule of 69.3", where LN 2 is .693 to three decimal places, would be a better approximation to the compound interest formula than the "Rule of 72", but 72 is close enough and is useful because 72 has so many integer divisors.

      In the real world we often assume linearity holds within a small range. For example in cooking a turkey (not that I have ever done this), one website says to allow about 15 minutes per pound (on average!) at 350°F to cook a stuffed turkey. But their time estimates by pound are not linear, so for example their time estimate to cook a 24-pounder is less than twice as long to cook a 12-pounder. Perhaps a better approximation is a power curve y = a * (x b) . Presumably there are some physics considerations that are non-linear.

      The Newton-Raphson method, x n+1 = x n - f (x n ) / f ' (x n ), for finding approximations to roots of a function, is an example of a mathematical use of linearity. N-R iteratively finds the x-intercept of the tangent of the graph of f. A number of R packages have N-R functions.

      And before we end the subject of "it's a linear world - approximately", you can certainly measure the straight line distance between two cities, but if you travel by airplane, the airplane is flying the distance between two points on a sphere. Given the latitude and longitude coordinates and the central angle between the two points, the spherical distance can be calculated by the Haversine Formula.

      R makes it easy not to settle for linearity. But for those of us who are not doing something where great precision is required, linearity may be just fine,

      Here is the R-code to plot the graphs. Note that in R, log(x) is the natural log function. This is also true in Python and Excel.



library(ggplot2)

common_theme <- theme(
     legend.position="right",
     plot.title = element_text(size=15, face="bold"),
     plot.subtitle = element_text(size=12.5, face="bold"),
     axis.title = element_text(size=15, face="bold"),
     axis.text = element_text(size=15, face="bold"),
     axis.title.y = element_text(angle = 0),
     legend.title = element_text(size=15, face="bold"),
     legend.text = element_text(size=15, face="bold"))

a <- 0.0
b <- 1.0
x_values <- seq(a, b, by = 0.01)
func1 <- function(x) {log(1 + x)}
func2 <- function(x) {log(1.05) + (1/1.05)*(x - .05)}
df1 <- data.frame(x = x_values,
           y = log(1 +x_values),
           func = "Function 1")
df2 <- data.frame(x = x_values,
           y = log(1.05) + (1/1.05)*(x_values - .05),
           func = "Function 2")
combined_df <- rbind(df1, df2)

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   ylim(c(0, 1)) + 
   labs(title = "Plot of LN and Linear Functions",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

ggplot(combined_df, aes(x = x, y = y, color = func)) +
   geom_line(size = 1.5) +
   scale_color_manual(values = c("Function 1" = "#E41A1C", 
            "Function 2" = "royalblue4")) +
   xlim(c(0, 0.1)) +
   ylim(c(0, 0.11)) + 
   labs(title = "Plot of LN Function and its Tangent Line at x = .05",
              x = "x",
              y = "y",
              color = "Functions") +
  common_theme

End

Tuesday, January 2, 2024

Using great circle distance to determine Twin Cities

In the US we think of Minneapolis and St. Paul as the Twin Cities, but Ben Olin, author of Math with Bad Drawings, https://mathwithbaddrawings.com/, posed this data analysis question: Which U.S. cities are the true twin cities? He imposed three conditions:
  1. the cities must be at most 10 miles apart,
  2. each city must have at least 200,000 people, and
  3. the populations have to be within a ratio of 2:1.
This seemed like a nice data analysis problem, so I searched for a dataset containing both population and location. Interestingly, each of population and location has its nuances, and I learned a lot more from this problem than I expected.

I found https://simplemaps.com/data/us-cities has a dataset of 30,844 cities containing latitude, longitude, and population. However, when Minneapolis’ population came out as 2.9 million, Ben recognized that the population was shown for the broader metropolitan area, not the city proper. I got a second dataset of populations of city propers from https://www.census.gov/data/tables/time-series/demo/popest/2020s-total-cities-and-towns.html, I joined the two datasets, and I used the populations from the second database.

How do you measure the distance between two cities? This is not as simple a question as it sounds.

As a start, I used https://www.distancecalculator.net/ and entered two cities I am familiar with, New York, NY and Hoboken, NJ. That website calculated a distance of 2.39 miles, and it provided a map. The site further clarified that it used the great circle distance formula. So this raises two questions: How does it decide which two points to measure from, and what is a great circle distance?

Hoboken has an area of 1.97 square miles, so it probably doesn’t matter too much which point in Hoboken to use. New York City has an area of 472.43 square miles, so it does matter which point to use. It is not obvious which point it used, and it certainly did not use the closest point, but from other work, I believe it used City Hall or something close.

Although some sites will measure distance between two cities as driving distance, it is more common to use great circle distance, which is the shortest distance along the surface of a sphere. The earth is not exactly a sphere, but it is approximately a sphere.

Latitude and longitude is a coordinate system to describe any point on the earth’s surface. Lines of latitude are horizontal lines parallel to the Equator, and represent how far north or south a point is from the Equator. Lines of longitude represent how far a point is east or west from a vertical line called the Prime Meridian that runs through Greenwich, England. Both latitude and longitude are measured in degrees, which are broken down into smaller units called minutes and seconds. For convenience they are also expressed in decimal degrees. If D is degrees, M is minutes, and S is seconds, then the conversion to decimal degree uses D + M/60 + S/3600.

When we use trig functions to calculate distances, we need to convert decimal degrees to radians by multiplying by π / 180. We also need to know the radius of the earth which is 3963.0 miles.

If point A is (lat1, long1) in decimals and point B is (lat2, long2) in decimals, then the distance formula d is the great-circle distance on a perfect sphere using the haversine distance formula, which is derived from principles of three-dimensional spherical trigonometry including the spherical law of cosines. A haversine of an angle θ is defined as hav(θ) = sin2(θ/2), and this concept is used in the derivation.

d = ACOS(SIN(PI()*[Lat_start]/180.0)*SIN(PI()*[Lat_end]/180.0)+COS(PI()*[Lat_start]/180.0) *COS(PI()*[Lat_end]/180.0)*COS(PI()*[Long_start]/180.0-PI()*[Long_end]/180.0))*3963

As I mentioned, I used https://simplemaps.com/data/us-cities which contains cities with their latitude and longitude, and I applied the above distance formulas to pairs of cities. But I was still curious about the choice of a latitude and longitude for a particular city. That file lists New York City as (40.6943, -73.9249). Another website that finds a street address from a decimal latitude and longitude, https://www.mapdevelopers.com/reverse_geocode_tool.php , lists the address of (40.6943, -73.9249) as 871 Bushwick Avenue, Brooklyn, which is some distance from City Hall, but does not appear to be the centroid of New York City either. Wikipedia's choice of latitude and longitude for New York City is 42 Park Row which is close to City Hall.

The following map from https://www.mapdevelopers.com/reverse_geocode_tool.php?lat=40.694300&lng=-73.924900&zoom=12 shows the approximate location of 871 Bushwick Avenue, Brooklyn.

I joined the dataset of locations with the populations of city propers from the second dataset, and I applied Ben’s three conditions. This produced 8 pairs of cities, and of course this list uses the first dataset’s choice of a city’s latitude and longitude and the distances resulting from that. Different choices of latitude and longitude produce a different list.

Of these pairs, I actually like Hialeah and Miami as the true twin cities. Besides meeting the original three criteria, they both share the same large ethnic population and they share a public transportation system.

Wikipedia has a much larger list of twin cities https://en.wikipedia.org/wiki/Twin_cities, but they did not necessarily use Ben Olin's three criteria. Also, Ben's problem is for US cities only, and Wikipedia has several pairs of Canada-US and Mexico-US cities that I had not thought about.

Here is the R code I used:

df1 <- read.csv("https://raw.githubusercontent.com/fcas80/jt_files/main/uscities.csv")
df1 <- subset(df1, select = c(city, lat, lng, state_name))
n1 <- nrow(df1) # 30844

library("readxl")
df2 <- read_excel("https://raw.githubusercontent.com/fcas80/jt_files/main/censuspop.xlsx", mode = "wb", skip = 3)
df2 <- df2[ -c(1,4:6) ]
colnames(df2) <- c("city", "pop")
# city appears as format Los Angeles city, California
df2$state <- gsub(".*\\, ", "", df2$city) # extract state: everything after comma blank
df2$city <- gsub("\\,.*", "", df2$city) # extract everything before comma
df2$city <- gsub(" city*", "", df2$city) # delete: blank city

df = merge(x=df1, y=df2, by="city",all=TRUE)
df <- na.omit(df)
df <- df[df$pop >= 200000, ]
df <- df[df$state_name == df$state, ] # delete improper merge same city in 2 states
df <- subset(df, select = -c(state_name, state))
n <- nrow(df) # 112

kount <- 1
df11 <- data.frame()
for (i in 1:n){
      Lat_start <- df[i,2]
      Long_start <- df[i,3]
      for (j in 1:n){
            Lat_end <- df[j,2]
            Long_end <- df[j,3]
            dist_miles <- acos(sin(pi*(Lat_start)/180.0)*sin(pi*(Lat_end)/180.0)+
            cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*
            (Long_start)/180.0-pi*(Long_end)/180.0))*3963
            cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*(Long_start)/180.0-pi*
            (Long_end)/180.0))*3963
            dist_miles <- round(dist_miles, 0)
            pop_ratio <- round(max(df[i,4]/df[j,4], df[j,4]/df[i,4]),1)
            if (df[i,1] != df[j,1] & dist_miles > 0 & dist_miles <= 10 & pop_ratio <= 2){
            df11[kount,1] <- df[i,1]
            df11[kount,2] <- df[j,1]
            df11[kount,3] <- dist_miles
            df11[kount,4] <- df[i,4]
            df11[kount,5] <- df[j,4]
            df11[kount,6] <- pop_ratio
            df11[kount,7] <- df[i,4] + df[j,4]
            kount <- kount + 1
            }
      }
}
colnames(df11) <- c("City1", "City2", "Dist", "Pop1", "Pop2", "Ratio", "TotPop")
df11 <- df11[!duplicated(df11$TotPop), ] # remove duplicates
df11 <- df11[ -c(7) ]
df11 <- data.frame(df11, row.names = NULL) # renumber rows consecutively
df11