How do I count thee? Let me count the ways?

How easily can you be identified on the Internet?

How easily can you be identified on the Internet? Suppose you finish your meal at a restaurant, you are about to pay the check, and t...

Thursday, February 29, 2024

How easily can you be identified on the Internet?

How easily can you be identified on the Internet?

Suppose you finish your meal at a restaurant, you are about to pay the check, and the manager asks, “Would you like a free dessert on your next birthday? Just fill out the bottom of the check with your date of birth, your email address and your zip code, and we’ll send you an email with a coupon just before your birthday.”

That sounds OK, but you ask, “You mean just the month and day, right?”

The manager replies, “No, we need the year too because we will send you other coupons that are age-appropriate. The zip code is just so we know how far our customers traveled.”

That sounds reasonable, but you are still a little skeptical. “Are you going to sell your mailing list to others with my name and home address?”

The manager replies, “We don’t need your name or your home address for this.” So you fill out the bottom of the check. Still skeptical, you pay with cash so the restaurant has no record of your name.

And of course, several months later you receive a coupon by snail mail at your home address, correctly addressed to you by name and home address.

Suppose you google your own name. I googled mine. Eliminating people with the same name who are not me, I found references to me at a number of free sites that state at least two of my age, mailing address, phone numbers, and family members.

Some of these have my correct birth date and some do not. Most of these have my correct age. Mailing addresses, phone numbers, email addresses, and family members may or may not be accurate, or current.

Computer scientist Latanya Sweeney, currently a dean at Harvard, is credited with a commonly quoted statement that 87% of the US population is uniquely identifiable from just 5-digit zip code, gender, and date of birth, (Sweeney, 2000.) She used many publicly available data sources in her research including voter registration lists.

Sweeney provides this quick calculation: 365 days in a year x 100 years x 2 genders = 73,000 unique permutations. (About My Info, 2013.) But many 5-digit zip codes are not that large. My zip code from my youth in New York City has a population of about 67,000. My current zip code is about 31,000. Sweeney’s work estimates that in my current 5-digit zip code, combined with some census data on total number of people with my age and gender, there is likely to be only one person in my zip code with my gender and date of birth.

So when that restaurant sells its customer database to some third party for a completely different use, that third party can probably identify me by name, even though I never gave my name or address to the restaurant. And that was the time I ordered the burger AND the fries.

About My Info. (2013.) How unique am I? Retrieved from https://aboutmyinfo.org/identity/about

Sweeney, L. (2000.) Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh. Retrieved from https://dataprivacylab.org/projects/identifiability/paper1.pdf

Sunday, February 25, 2024

Miss Google

Miss Google

Some time ago I reported that the voice of Google Maps, whom I call Miss Google, seemed to be getting lazy. I facetiously said that she stopped wearing mascara and lipstick, and that more importantly she stopped naming specific streets and exit names.

It turns out the real Miss Google was quite offended by this. She said she absolutely did not stop wearing mascara and lipstick. But more importantly, she told me to set my phone setting in Google Maps to DEFAULT English, not simply English. That did solve my Google Maps problem.

I was curious what Miss Google looks like. She refused to tell me. I thought I could cleverly get around this by asking Gemini, the Google AI tool, but Gemini said it would violate Google's AI principles to share a picture.

Undeterred, I tried several other AI tools. Gencraft was happy to answer my question of what Miss Google looks like, and here is the answer:

Surprisingly, I got a number of responses from my original comment about Miss Google and lipstick. Now that I have an authoritative picture, I went to another site to determine her lipstick shade. That site identifies it as #91454C, which Carol tells me is a burgundy.

Happy to be of service in the world of data analysis.

plot(-10:10, -10:10, type = 'n',
axes = FALSE, xlab = NA, ylab = NA)
rect(-8, -8, 18, -4, col = '#91454C', border="blue")

Sunday, February 11, 2024

Taylor Swift and Data Analysis

Taylor Swift and Data Analysis. by Jerry Tuttle

Who will be the most talked-about celebrity before, during, and after the Super Bowl?

She is an accomplished performer. songwriter, businesswoman, and philanthropist. I think she is very pretty. And those lips!

So what can a data analyst add to everything that has been said about her?

I was curious whether R could identify her lipstick color. I should preface this by saying I have some degree of color-challengedness, although I am not colorblind. I am also aware that you can Google something like "what lipstick shade does taylor swift use" and you will get many replies. But I am more interested in an answer like E41D4F. I do wonder if I could visit a cosmetics store and say, "I'd like to buy a lipstick for my wife. Do you have anything in E41D4F ?"

There are sites that take an image, allow you to hover over a particular point, and the site will attempt to identify the computer color. Here is one: RedKetchup But I want a more R-related approach. A note on computers and colors: A computer represents an image in units called pixels. Each pixel contains a combination of base sixteen numbers for red, green and blue. A base 16 number ranges from 0 through F. Each of red, green and blue is a two-digit base 16 number, so a full number is a six-digit base 16 number. There are 16^ 6 = 16,777,216 possible colors. E41D4F is one of those 16.8 million colors.

There are R packages that will take an image and identify the most frequent colors. I tried this with the image above, and I got unhelpful colors. This is because the image contains the background, her hair, her clothing, and lots of other things unrelated to her lips. If you think about it, the lips are really a small portion of a face anyway. So I need to narrow down the image to her lips.

I plotted the image on a rectangular grid, using the number of columns of the image file as the xlimit width, and the number of rows as the ylimit height. Then by trial and error I manually found the coordinates of a rectangle for the lips. The magick library allows you to crop an image, using this crop format:

<width>x<height>{+-}<xoffset>{+-}<yoffset> The y offset is from the top, not the bottom. The cropped image can be printed.

The package colouR will then identify the most frequent colors. I found it necessary to save the cropped image to my computer and then read it back in because colouR would not accept it otherwise. The getTopCol command will extract the top colors by frequency. I assume it is counting frequency of hex color codes among the pixel elements. Here is a histogram of the result:

Really? I'm disappointed. Although I am color-challenged, this can't be right.

I have tried this with other photos of Taylor. I do get that she wears more than one lipstick color. I have also learned that with 16.8 million colors, perhaps the color is not identical on the entire lip - maybe some of you lipstick aficionados have always known this. All of these attempts have been somewhat unsatisfactory. There are too many colors on the graph that seem absolutely wrong, and no one color seems to really capture her shade, at least as I perceive it. Any suggestions from the R community?

No matter who you root for in the Super Bowl - go Taylor.

Here is my R code:

library(png)
library(ggplot2)
library(grid)
library(colouR)
library(magick)

xpos <- c(0,0,0)
ypos <- c(0,0,0)
df <- data.frame(xpos = xpos, ypos = ypos)

# downloaded from
# https://img.etimg.com/thumb/msid-100921419,width-300,height-225,imgsize-50890,resizemode-75/taylor-swift-mitchell-taebel-from-indiana-arrested-for-stalking-threatening-singer.jpg

img <- "C:/Users/Jerry/Desktop/R_files/taylor/taylor_swift.png"
img <- readPNG(img, native=TRUE)
height <- nrow(img) # 457
width <- ncol(img) # 584
img <- rasterGrob (img, interpolate = TRUE)

# print onto grid
ggplot(data = df,
aes(xpos, ypos)) +
xlim(0, width) + ylim(0, height) +
geom_blank() +
annotation_custom(img, xmin=0, xmax=width, ymin=0, ymax=height)

#############################################
# choose dimensions of subset rectangle

width <- 105
height <- 47
x1 <- 215 # from left
y1 <- 300 # from top

library(magick)
# must read in as magick object
img <- image_read('C:/Users/Jerry/Desktop/R_files/taylor/taylor_swift.png')
# print(img)

# crop format: x{+-}{+-}
cropped_img <- image_crop(img, "105x47+215+300")
print(cropped_img) # lips only
image_write(cropped_img, path = "C:/Users/Jerry/Desktop/R_files/taylor/lips1.png", format = "png")

##############################################

# extract top colors of lips image

top10 <- colouR::getTopCol(path = "C:/Users/Jerry/Desktop/R_files/taylor/lips1.png",
n = 10, avgCols = FALSE, exclude = FALSE)
top10

# plot
ggplot(top10, aes(x = hex, y = freq, , fill = hex)) +
geom_bar(stat = 'identity') +
labs(title="Top 10 colors by frequency") +
xlab("HEX colour code") + ylab("Frequency") +
theme(
legend.position="NULL",
plot.title = element_text(size=15, face="bold"),
axis.title = element_text(size=15, face="bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size=12, face="bold"))

# End
##################################################################################

Tuesday, January 2, 2024

Using great circle distance to determine Twin Cities

In the US we think of Minneapolis and St. Paul as the Twin Cities, but Ben Olin, author of Math with Bad Drawings, https://mathwithbaddrawings.com/, posed this data science question: Which U.S. cities are the true twin cities? He imposed three conditions:
  1. the cities must be at most 10 miles apart,
  2. each city must have at least 200,000 people, and
  3. the populations have to be within a ratio of 2:1.
This seemed like a nice data analysis problem, so I searched for a dataset containing both population and location. Interestingly, each of population and location has its nuances, and I learned a lot more from this problem than I expected.

I found https://simplemaps.com/data/us-cities has a dataset of 30,844 cities containing latitude, longitude, and population. However, when Minneapolis’ population came out as 2.9 million, Ben recognized that the population was shown for the broader metropolitan area, not the city proper. I got a second dataset of populations of city propers from https://www.census.gov/data/tables/time-series/demo/popest/2020s-total-cities-and-towns.html, I joined the two datasets, and I used the populations from the second database.

How do you measure the distance between two cities? This is not as simple a question as it sounds.

As a start, I used https://www.distancecalculator.net/ and entered two cities I am familiar with, New York, NY and Hoboken, NJ. That website calculated a distance of 2.39 miles, and it provided a map. The site further clarified that it used the great circle distance formula. So this raises two questions: How does it decide which two points to measure from, and what is a great circle distance?

Hoboken has an area of 1.97 square miles, so it probably doesn’t matter too much which point in Hoboken to use. New York City has an area of 472.43 square miles, so it does matter which point to use. It is not obvious which point it used, and it certainly did not use the closest point, but from other work, I believe it used City Hall or something close.

Although some sites will measure distance between two cities as driving distance, it is more common to use great circle distance, which is the shortest distance along the surface of a sphere. The earth is not exactly a sphere, but it is approximately a sphere.

Latitude and longitude is a coordinate system to describe any point on the earth’s surface. Lines of latitude are horizontal lines parallel to the Equator, and represent how far north or south a point is from the Equator. Lines of longitude represent how far a point is east or west from a vertical line called the Prime Meridian that runs through Greenwich, England. Both latitude and longitude are measured in degrees, which are broken down into smaller units called minutes and seconds. For convenience they are also expressed in decimal degrees. If D is degrees, M is minutes, and S is seconds, then the conversion to decimal degree uses D + M/60 + S/3600.

When we use trig functions to calculate distances, we need to convert decimal degrees to radians by multiplying by π / 180. We also need to know the radius of the earth which is 3963.0 miles.

If point A is (lat1, long1) in decimals and point B is (lat2, long2) in decimals, then the distance formula d is the great-circle distance on a perfect sphere using the haversine distance formula, which is derived from principles of three-dimensional spherical trigonometry including the spherical law of cosines. A haversine of an angle θ is defined as hav(θ) = sin2(θ/2), and this concept is used in the derivation.

d = ACOS(SIN(PI()*[Lat_start]/180.0)*SIN(PI()*[Lat_end]/180.0)+COS(PI()*[Lat_start]/180.0) *COS(PI()*[Lat_end]/180.0)*COS(PI()*[Long_start]/180.0-PI()*[Long_end]/180.0))*3963

As I mentioned, I used https://simplemaps.com/data/us-cities which contains cities with their latitude and longitude, and I applied the above distance formulas to pairs of cities. But I was still curious about the choice of a latitude and longitude for a particular city. That file lists New York City as (40.6943, -73.9249). Another website that finds a street address from a decimal latitude and longitude, https://www.mapdevelopers.com/reverse_geocode_tool.php , lists the address of (40.6943, -73.9249) as 871 Bushwick Avenue, Brooklyn, which is some distance from City Hall, but does not appear to be the centroid of New York City either. Wikipedia's choice of latitude and longitude for New York City is 42 Park Row which is close to City Hall.

The following map from https://www.mapdevelopers.com/reverse_geocode_tool.php?lat=40.694300&lng=-73.924900&zoom=12 shows the approximate location of 871 Bushwick Avenue, Brooklyn.

I joined the dataset of locations with the populations of city propers from the second dataset, and I applied Ben’s three conditions. This produced 8 pairs of cities, and of course this list uses the first dataset’s choice of a city’s latitude and longitude and the distances resulting from that. Different choices of latitude and longitude produce a different list.

Of these pairs, I actually like Hialeah and Miami as the true twin cities. Besides meeting the original three criteria, they both share the same large ethnic population and they share a public transportation system.

Wikipedia has a much larger list of twin cities https://en.wikipedia.org/wiki/Twin_cities, but they did not necessarily use Ben Olin's three criteria. Also, Ben's problem is for US cities only, and Wikipedia has several pairs of Canada-US and Mexico-US cities that I had not thought about.

Here is the R code I used:

df1 <- read.csv("https://raw.githubusercontent.com/fcas80/jt_files/main/uscities.csv")
df1 <- subset(df1, select = c(city, lat, lng, state_name))
n1 <- nrow(df1) # 30844

library("readxl")
df2 <- read_excel("https://raw.githubusercontent.com/fcas80/jt_files/main/censuspop.xlsx", mode = "wb", skip = 3)
df2 <- df2[ -c(1,4:6) ]
colnames(df2) <- c("city", "pop")
# city appears as format Los Angeles city, California
df2$state <- gsub(".*\\, ", "", df2$city) # extract state: everything after comma blank
df2$city <- gsub("\\,.*", "", df2$city) # extract everything before comma
df2$city <- gsub(" city*", "", df2$city) # delete: blank city

df = merge(x=df1, y=df2, by="city",all=TRUE)
df <- na.omit(df)
df <- df[df$pop >= 200000, ]
df <- df[df$state_name == df$state, ] # delete improper merge same city in 2 states
df <- subset(df, select = -c(state_name, state))
n <- nrow(df) # 112

kount <- 1
df11 <- data.frame()
for (i in 1:n){
      Lat_start <- df[i,2]
      Long_start <- df[i,3]
      for (j in 1:n){
            Lat_end <- df[j,2]
            Long_end <- df[j,3]
            dist_miles <- acos(sin(pi*(Lat_start)/180.0)*sin(pi*(Lat_end)/180.0)+
            cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*
            (Long_start)/180.0-pi*(Long_end)/180.0))*3963
            cos(pi*(Lat_start)/180.0)*cos(pi*(Lat_end)/180.0)*cos(pi*(Long_start)/180.0-pi*
            (Long_end)/180.0))*3963
            dist_miles <- round(dist_miles, 0)
            pop_ratio <- round(max(df[i,4]/df[j,4], df[j,4]/df[i,4]),1)
            if (df[i,1] != df[j,1] & dist_miles > 0 & dist_miles <= 10 & pop_ratio <= 2){
            df11[kount,1] <- df[i,1]
            df11[kount,2] <- df[j,1]
            df11[kount,3] <- dist_miles
            df11[kount,4] <- df[i,4]
            df11[kount,5] <- df[j,4]
            df11[kount,6] <- pop_ratio
            df11[kount,7] <- df[i,4] + df[j,4]
            kount <- kount + 1
            }
      }
}
colnames(df11) <- c("City1", "City2", "Dist", "Pop1", "Pop2", "Ratio", "TotPop")
df11 <- df11[!duplicated(df11$TotPop), ] # remove duplicates
df11 <- df11[ -c(7) ]
df11 <- data.frame(df11, row.names = NULL) # renumber rows consecutively
df11