How do I count thee? Let me count the ways?

Sheldon Cooper's favorite number

      If you are a fan of the television series "The Big Bang Theory", then you know Sheldon often wears a shirt with 73 ...

Wednesday, July 24, 2024

The distribution has changed; and pretty tables in base R

The distribution has changed; and pretty tables in base R, by Jerry Tuttle

      So you spent hours, or maybe days, cranking out thousands of numbers, you submit it to your boss just at the deadline, your boss quickly peruses your exhibit of numbers, points to a single number and says, "This number doesn't look right." Bosses have an uncanny ability to do this.

      Your boss is pointing to something like this: Your company sells property insurance on both personal and commercial properties. The average personal property premium increased 10% in 2024. The average commercial property premium increased 10% in 2024. But you say the combined average property premium decreased 3% in 2024. You realize that negative 3% does not look right.

      You might have made an input error or a calculation error, but you don't want to admit to that. So you blurt out, "That's because the distribution has changed." And to your relief, the boss buys into that.

      "The distribution has changed" is probably a pretty good answer in more instances than you realize. A more common example is if your investment portfolio starts at 90% stocks and 10% bonds, but you have a good year with stocks, at year-end the distribution of your portfolio has changed to 94% stocks and 6% bonds, and you may want to rebalance your portfolio. How about the distribution of population by state? It has definitely changed since the prior Census. It might be hard to think of a non-trivial example of something where the distribution has not changed.

      The key calculations are the 2023 average premium is (2000 * 90 + 10000 * 10) / (90 + 10) = 2800, the 2024 average premium is (2200 * 110 + 11000 * 7) / (110 + 7) = 2726, and so the average premium change is 2726 / 2800 = .97 (rounded) = - 3%. And the distribution between personal and commercial DID change, as measured either by the distribution of number of policies (was 90% personal, now 94%) or by the distribution of premium (was 64% personal, now 76%). So because the distribution has changed towards more smaller personal policies, this pulls the combined average premium down, even though the average premium for each of the two separate subgroups has increased. (There are alternatives to the key calculations, such as weighting the percentage changes, instead of weighting the average premiums.)

      Years ago I posed a dilemma like this during job interviews for actuarial trainees to see how well they would respond to a sort of non-routine problem, but I decided it was too difficult.

      I did the above exhibit in Excel because it was quick. It was also pretty easy to add custom colors to alternating rows, which I did via FONT > FILL > MORE COLORS > CUSTOM > enter HEX CODE. Here I chose cornsilk #fff8dc and cyan #00ffff for fun.

      Then I wondered how easy it would be to make a pretty table in R. If you google something like "pretty tables in R", you will find a number of R packages that create HTML type code that can be saved as an HTML file, a PDF file, or another file format. Much has been written about these packages, but they seem a little complicated for basic work, and further, I like the idea of staying exclusively within the R environment. When I realized a table is just a collection of rectangles, it occurred to me that the base R commands of rectangle and text are pretty much all I need.

      Here is a table of sample rectangles with text, written in R. The rectangle syntax is rect(xleft, ybottom, xright, ytop, col, border) and the text syntax is text(x, y, labels, col, cex, font). The numerical axis is helpful when first defining the rectangles, but can be deleted by adding axes = FALSE to the plot function for the final table.

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
plot(x = c(0, 500), y = c(0, 500), type= "n", xlab = "", ylab = "", main = "Sample rectangles with text", cex=2, font=2)

rect(0, 0, 250, 250, col = "#E41A1C", border = "blue")
text(125, 125, "red rectangle, white font", col="white", cex=1.15, font=2)

rect(250, 0, 500, 250, col = "yellow", border = "blue")
text(375, 125, "yellow rectangle, blue font", col="navyblue", cex=1.15, font=2)

rect(0, 250, 250, 500, col = "cornsilk", border = "blue")
text(125, 375, "cornsilk rectangle, black font", col="black", cex=1., font=2)

rect(250, 250, 500, 500, col = "cyan", border = "blue")
text(375, 375, "cyan rectangle, purple font", col="purple", cex=1.15, font=2)

      The R equivalent of the Excel exhibit is the following. Note that all code is in base R.

title <- c("Subgroups increase, but the average decreases")
df <- data.frame(Personal = c(2000, 90, 2200, 110, 10),
Commercial = c(10000, 10, 11000, 7, 10),
Weighted = c(2800, 100, 2726, 117, -3))

rownames(df) = c("2023 Avg Prem", "2023 No. Policies","2024 Avg Prem", "2024 No. Policies", "Avg Prem % Change")
df

# rectangle syntax: rect(xleft, ybottom, xright, ytop)
op <- par(bg = "thistle")
col1 = "cornsilk"
col2 = "cyan"

plot(x = c(0, 500), y = c(0, 700), type= "n", xlab = "", ylab = "", axes = FALSE)
height = 100

rect(0, 6*height, 500, 7*height, col = col1, border = "blue")
text(250, 6.5*height, title, col="black", cex=1.25, font=2)

rect(0, 5*height, 200, 6*height, col = col2, border = "blue")
rect(200, 5*height, 300, 6*height, col = col2, border = "blue")
rect(300, 5*height, 400, 6*height, col = col2, border = "blue")
rect(400, 5*height, 500, 6*height, col = col2, border = "blue")

text(100, 5.5*height, "", col="blue")
text(250, 5.5*height, colnames(df)[1], col="blue")
text(350, 5.5*height, colnames(df)[2], col="blue")
text(450, 5.5*height, colnames(df)[3], col="blue")

rect(0, 4*height, 200, 5*height, col = col1, border = "blue")
rect(200, 4*height, 300, 5*height, col = col1, border = "blue")
rect(300, 4*height, 400, 5*height, col = col1, border = "blue")
rect(400, 4*height, 500, 5*height, col = col1, border = "blue")

text(100, 4.5*height, rownames(df)[1], col="blue")
text(250, 4.5*height, df[1,1], col="blue")
text(350, 4.5*height, df[1,2], col="blue")
text(450, 4.5*height, df[1,3], col="blue")

rect(0, 3*height, 200, 4*height, col = col2, border = "blue")
rect(200, 3*height, 300, 4*height, col = col2, border = "blue")
rect(300, 3*height, 400, 4*height, col = col2, border = "blue")
rect(400, 3*height, 500, 4*height, col = col2, border = "blue")

text(100, 3.5*height, rownames(df)[2], col="blue")
text(250, 3.5*height, df[2,1], col="blue")
text(350, 3.5*height, df[2,2], col="blue")
text(450, 3.5*height, df[2,3], col="blue")

rect(0, 2*height, 200, 3*height, col = col1, border = "blue")
rect(200, 2*height, 300, 3*height, col = col1, border = "blue")
rect(300, 2*height, 400, 3*height, col = col1, border = "blue")
rect(400, 2*height, 500, 3*height, col = col1, border = "blue")

text(100, 2.5*height, rownames(df)[3], col="blue")
text(250, 2.5*height, df[3,1], col="blue")
text(350, 2.5*height, df[3,2], col="blue")
text(450, 2.5*height, df[3,3], col="blue")

rect(0, height, 200, 2*height, col = col2, border = "blue")
rect(200, height, 300, 2*height, col = col2, border = "blue")
rect(300, height, 400, 2*height, col = col2, border = "blue")
rect(400, height, 500, 2*height, col = col2, border = "blue")

text(100, 1.5*height, rownames(df)[4], col="blue")
text(250, 1.5*height, df[4,1], col="blue")
text(350, 1.5*height, df[4,2], col="blue")
text(450, 1.5*height, df[4,3], col="blue")

rect(0, 0, 200, height, col = col1, border = "blue")
rect(200, 0, 300, height, col = col1, border = "blue")
rect(300, 0, 400, height, col = col1, border = "blue")
rect(400, 0, 500, height, col = col1, border = "blue")

text(100, .5*height, rownames(df)[5], col="blue")
text(250, .5*height, paste(df[5,1], "%"), col="black", cex=1.5)
text(350, .5*height, paste(df[5,2], "%"), col="black", cex=1.5)
text(450, .5*height, paste(df[5,3], "%"), col="black", cex=2, font=2)

par(op)

# END

#######################################

2 comments:

  1. There were a few minor errors and inconsistencies in the code that won't let it be replicable:

    - The line `col1 = "cornsilk"v col2 = "cyan"` should be corrected to `col1 = "cornsilk"; col2 = "cyan"`.
    - Some of the rectangle and text plotting commands are concatenated with a v instead of being on separate lines or separated by a semicolon.
    - The original code repeatedly draws rectangles and places text in a very repetitive and verbose manner. This could be simplified using loops or functions to improve readability and maintainability.

    # Data preparation
    title <- "Subgroups increase, but the average decreases"
    df <- data.frame(
    Personal = c(2000, 90, 2200, 110, 10),
    Commercial = c(10000, 10, 11000, 7, 10),
    Weighted = c(2800, 100, 2726, 117, -3)
    )
    rownames(df) <- c("2023 Avg Prem", "2023 No. Policies", "2024 Avg Prem", "2024 No. Policies", "Avg Prem % Change")

    # Define colors
    col1 <- "cornsilk"
    col2 <- "cyan"

    # Function to draw rectangles and text
    draw_rect_text <- function(ybottom, ytop, colors, data_row = NULL, rowname = NULL, is_percentage = FALSE) {
    rect(0, ybottom, 200, ytop, col = colors[1], border = "blue")
    rect(200, ybottom, 300, ytop, col = colors[1], border = "blue")
    rect(300, ybottom, 400, ytop, col = colors[1], border = "blue")
    rect(400, ybottom, 500, ytop, col = colors[1], border = "blue")
    if (!is.null(rowname)) text(100, (ybottom + ytop) / 2, rowname, col = "blue")
    if (!is.null(data_row)) {
    text(250, (ybottom + ytop) / 2, if (is_percentage) paste(data_row[1], "%") else data_row[1], col = "blue")
    text(350, (ybottom + ytop) / 2, if (is_percentage) paste(data_row[2], "%") else data_row[2], col = "blue")
    text(450, (ybottom + ytop) / 2, if (is_percentage) paste(data_row[3], "%") else data_row[3], col = "blue")
    }
    }

    # Plot setup
    op <- par(bg = "thistle")
    plot(x = c(0, 500), y = c(0, 700), type = "n", xlab = "", ylab = "", axes = FALSE)
    height <- 100

    # Title
    rect(0, 6 * height, 500, 7 * height, col = col1, border = "blue")
    text(250, 6.5 * height, title, col = "black", cex = 1.25, font = 2)

    # Headers
    rect(0, 5 * height, 500, 6 * height, col = col2, border = "blue")
    text(100, 5.5 * height, "", col = "blue")
    text(250, 5.5 * height, colnames(df)[1], col = "blue")
    text(350, 5.5 * height, colnames(df)[2], col = "blue")
    text(450, 5.5 * height, colnames(df)[3], col = "blue")

    # Data rows
    draw_rect_text(4 * height, 5 * height, col1, df[1, ], rownames(df)[1])
    draw_rect_text(3 * height, 4 * height, col2, df[2, ], rownames(df)[2])
    draw_rect_text(2 * height, 3 * height, col1, df[3, ], rownames(df)[3])
    draw_rect_text(height, 2 * height, col2, df[4, ], rownames(df)[4])
    draw_rect_text(0, height, col1, NULL, NULL, is_percentage = TRUE)
    text(100, .5*height, rownames(df)[5], col="blue")
    text(250, .5*height, paste(df[5,1], "%"), col="black", cex=1.5)
    text(350, .5*height, paste(df[5,2], "%"), col="black", cex=1.5)
    text(450, .5*height, paste(df[5,3], "%"), col="black", cex=2, font=2)

    # Reset par
    par(op)

    I've left the last part the same, as the text format changes drastically from the rest. And indeed, it could be written even less verbose.

    ReplyDelete
  2. Thanks JOA. The v's were an error where I intended to type Ctrl V. I made a correction for this. I agree that a loop or function is preferable for repeatability.

    ReplyDelete