How do I count thee? Let me count the ways?

Sheldon Cooper's favorite number

      If you are a fan of the television series "The Big Bang Theory", then you know Sheldon often wears a shirt with 73 ...

Saturday, October 29, 2022

Find the next number in the sequence

Between ages two and four, most children can count up to at least ten.

If you ask your child, "What number comes next after 1, 2, 3, 4, 5?" they will probably say "6."

But to math nerds, any number can be the next number in a finite sequence. I like -14.

Given a sequence of n real numbers f(x1), f(x2), f(x3), ... , f(xn), there is always a mathematical procedure to find the next number f(x n+1) of the sequence. The resulting solution may not appear to be satisfying to students, but it is mathematically logical.

I can draw a smooth curve through the points (1,1), (2,2), (3,3), (4,4), (5,5), (6, -14). If I can find an equation for that smooth curve, then I know my answer of -14 has some logic to it. Actually many equations will work.

In my example one equation is of the form y = (x-1)*(x-2)*(x-3)*(x-4)*(x-5)*(A/120) + x, where A is chosen so that when x is 6, the first term reduces to A, and A + 6 equals the -14 I want. So A is -20. This is called a collocation polynomial.

There is a theorem that for n+1 distinct values of xi and their corresponding yi values, there is a unique polynomial P of degree n with P(xi) = yi. One method to find P is to use polynomial regression. Another way is to use Newton's Forward Difference Formula (probably no longer taught in Numerical Analysis courses).

Higher degree polynomials than degree n is one reason why additional equations will work.

The equation does not have to be a polynomial, which then adds rational functions among others.

Of course the next number after -14 can be any number. It could be 7 :)

There are many famous sequences, and of course someone catalogued them.

Here is some R code.

xpoints <- c(1,2,3,4,5,6)
ypoints <- c(1,2,3,4,5,-14)
y <- vector()
x <- seq(from=1, to=6, by=.01)
y <- (x-1)*(x-2)*(x-3)*(x-4)*(x-5)*(-20/120) + x
plot(xpoints, ypoints, pch=18, type="p", cex=2, col="blue", xlim=c(1,6), ylim=c(-14,6), xlab="x", ylab="y")
lines(x,y, pch = 19, cex=1.3, col = "red")
fit <- lm(ypoints ~ xpoints + I(xpoints^2) + I(xpoints^3) +I(xpoints^4) +I(xpoints^5) )
s <- summary(fit)
bo <- s$coefficient[1]
b1 <- s$coefficient[2]
b2 <- s$coefficient[3]
b3 <- s$coefficient[4]
b4 <- s$coefficient[5]
b5 <- s$coefficient[6]
x <- seq(from=1, to=6, by=.01)
z <- bo+b1*x+b2*x^2+b3*x^3+b4*x^4+b5*x^5
plot(xpoints, ypoints, pch=18, type="p", cex=2, col="blue", xlim=c(1,6), ylim=c(-14,6), xlab="x", ylab="y")
lines(x,z, pch = 19, cex=1.3, col = "red")

More great R blogs at r-bloggers.com

Saturday, September 17, 2022

How do I count thee? Let me count the ways?

by Jerry Tuttle   

In Major League Baseball, a player who hits 50 home runs in a single season has hit a lot of home runs. Suppose I want to count the number of 50 homer seasons by team, and also the number of 50 homer seasons by New York Yankees. (I will count Maris and Mantle in 1961 as two.) Here is the data including Aaron Judge's 62 in 2022 :

You would think base R would have a count function such as count(df$Team) and count(df$Team == "NYY") but this gives the error "could not find function 'count'". Base R does not have a count function.

Base R has at last four ways to perform a count:

1. The table function will count items in a vector.    table(df$Team) presents results horizontally, and data.frame(table(df$Team)) presents results vertically.    table(df$Team == "NYY") displays results 37 false and 10 true, while table(df$Team == "NYY")[2] just displays the result 10 true.

2. The sum function can be used to count the number of rows meeting a condition.    sum(df$Team == "NYY") displays the result 10. Here df$Team == "NYY" is creating a logical vector, and sum is summing the number of true = 1.

3. Similar to sum, nrow(df[df$Team == "NYY", ]) counts the number of rows meeting the NYY condition.

4. The length function counts the number of elements in an R object.    length(which(df$Team == "NYY")) , length(df$Team[df$Team == "NYY"]) , and length(grep("NYY", df[ , "Team"])) are all ways that will count the 10 Yankees.

The more direct solution to counting uses the count function in the dplyr library. Note that dplyr's count function applies to a data frame or tibble, but not to a vector. After loading library(dplyr) ,

1. df %>% count(Team) lists the count for each team.

2. df %>% filter(Team = "NYY") lists each Yankee, and you can see there are 10.

3. df %>% count(Team == "NYY") displays 37 false and 10 true, while df %>% filter(Team == "NYY") %>% count() just displays the 10 true.

The following is a bar chart of the results by team for teams with at least 1 50 homer season:

Finally, "How do I count thee? Let me count the ways?" is of course adapted from Elizabeth Barrett Browning's poem "How do I love thee? Let me count the ways?" But in her poem, just how would we count the number of times "love" is mentioned? The tidytext library makes counting words fairly easy, and the answer is ten, the same number of 50 homer Yankee seasons. Coincidence?

The following is all the R code. Happy counting!

library(dplyr)
library(ggplot2)
library(tidytext)

df <- data.frame(
   Player=c('Ruth','Ruth','Ruth','Ruth','Wilson','Foxx','Greenberg','Foxx','Kiner','Mize','Kiner','Mays','Mantle','Maris', 'Mantle','Mays','Foster','Fielder','Belle','McGwire','Anderson','McGwire','Griffey','McGwire','Sosa','Griffey', 'Vaughn','McGwire','Sosa','Sosa','Bonds','Sosa','Gonzalez','Rodriguez','Rodriguez','Thome','Jones','Howard','Ortiz', 'Rodriguez','Fielder','Bautista','Davis','Stanton','Judge','Alonso','Judge'),
   Year=c(1920,1921,1927,1928,1930,1932,1938,1938,1947,1947,1949,1955,1956,1961,1961,1965,1977,1990,1995,1996,1996,1997,1997, 1998,1998,1998,1998,1999,1999,2000,2001,2001,2001,2001,2002,2002,2005,2006,2006,2007,2007,2010,2013,2017,2017,2019,2022),
   Homers=c(54,59,60,54,56,58,58,50,51,51,54,51,52,61,54,52,52,51,50,52,50,58,56,70,66,56,50,65,63,50,73,64,57,52,57,52,51, 58,54,54,50,54,53,59,52,53,62),
   Team=c('NYY','NYY','NYY','NYY','CHC','PHA','DET','BOS','PIT','NYG','PIT','NYG','NYY','NYY','NYY','SF','CIN','DET','CLE', 'OAK','BAL','OAK/SLC','SEA','SLC','CHC','SEA','SD','SLC','CHC','CHC','SF','CHC','ARI','TEX','TEX','CLE','ATL','PHP', 'BOR','NYY','MIL','TOR','BAL','MIA','NYY','NYM','NYY'))

head(df)

# base R ways to count:

table(df$Team)    # shows results horizontally
data.frame(table(df$Team))    #shows results vertically
table(df$Team == "NYY")    # displays 37 false and 10 true
table(df$Team == "NYY")[2]

sum(df$Team == "NYY")    # displays the result 10.

nrow(df[df$Team == "NYY", ])    # counts the number of rows meeting the NYY condition.

length(which(df$Team == "NYY"))     # which returns a vector of indices which are true
length(df$Team[df$Team == "NYY"])
length(grep("NYY", df[ , "Team"]))     # grep returns a vector of indices that match the pattern

# dplyr R ways to count; remember to load library(dplyr):

df %>% count(Team)    # lists the count for each team.

df %>% filter(Team == "NYY")    # lists each Yankee, and you can see there are 10.

df %>% count(Team == "NYY")    # displays 37 false and 10 true, while
df %>% filter(Team == "NYY") %>% count()    # just displays the 10 true.

# barplot of all teams with at least 1 50 homer season; remember to load library(ggplot2)

df %>%
    group_by(Team) %>%
    summarise(count = n()) %>%
    ggplot(aes(x=reorder(Team, count), y=count, fill=Team)) +
    geom_bar(stat = 'identity') +
    ggtitle("Count of 50 Homer Seasons") +
    xlab("Team") +
    scale_y_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10)) +
    coord_flip() +
    theme(plot.title = element_text(face="bold", size=18)) +
    theme(axis.title.y = element_text(face="bold")) +
    theme(axis.title.x = element_blank()) +
    theme(axis.text.x = element_text(size=12, face="bold"),
    axis.text.y = element_text(size=12, face="bold")) +
    theme(legend.position="none")

# count number of times "love" is mentioned in Browning's poem; remember to load library(tidytext)

textfile <- c("How do I love thee? Let me count the ways.",
"I love thee to the depth and breadth and height",
"My soul can reach, when feeling out of sight",
"For the ends of being and ideal grace.",
"I love thee to the level of every day's",
"Most quiet need, by sun and candle-light.",
"I love thee freely, as men strive for right.",
"I love thee purely, as they turn from praise.",
"I love thee with the passion put to use",
"In my old griefs, and with my childhood's faith.",
"I love thee with a love I seemed to lose",
"With my lost saints. I love thee with the breath,",
"Smiles, tears, of all my life; and, if God choose,",
"I shall but love thee better after death.")

df<-data.frame(line=1:length(textfile), text=textfile)
df_words <- df %>% unnest_tokens(word, text)
cleaned_words <- df_words %>% anti_join(get_stopwords())
cleaned_words %>% count(word, sort = TRUE) %>% head(6)
cleaned_words %>% filter(word == "love") %>% count()

More great R blogs at r-bloggers.com

Wednesday, June 22, 2022

What is the difference between statistics and data analysis?

What is the difference between statistics and data analysis?

Of course to answer this we need to define those terms, and definitions of such things are hardly standard. But they are nor particularly standard in other disciplines either. Can you define art? Music? How about mathematics?

Would you have defined mathematics as "including such topics as numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes," as in Wikipedia? Is this all-encompassing?

Statistics and data analysis have some overlaps. Both involve defining, exploring, cleaning, visualizing, and describing data. Data analyst students study some traditional statistics. Statistics students nowadays study some data analysis.

The father and daughter Larose team have suggested a working distinction of inferential statistics versus data mining (so neither of these is identical to the terms in the first sentence above) as follows:

Inferential statistics involves having a prior hypothesis about a population and testing that hypothesis with a sample from that population. The test may result in statistical significance, even if there is no practical significance.

Data mining does not begin with a prior hypothesis, but rather the analyst "freely trolls through the data for actionable results." (Larose, p. 161)

Larose, D.T. & Larose, C.D. (2015). Data mining and predictive analytics. Wiley. Hoboken, NJ.