Note: Click on the Code button to see the R code for each table and figure. The R Markdown file that made this document is here, reproducible by cloning this repository: https://github.com/judgelord/813/

How do we model logit in R?

Logit is just a linear model predicting log odds.

load("data/EX4.Rdata") # loads data as "d"

# general linear model with logit link function
model <- glm(Probat ~ Take + Report + Night + Convict, 
             data=d, 
             family=binomial(link="logit"))

knitr::kable(tidy(model), digits = 3)
term estimate std.error statistic p.value
(Intercept) 4.235 0.720 5.879 0.000
Take 0.000 0.000 -1.302 0.193
Report 1.696 0.511 3.318 0.001
Night -0.264 0.439 -0.600 0.548
Convict -2.202 0.281 -7.833 0.000

How do we interpret a logit?

Interpreting log odds is tricky. Thus we transform the estimated log odds, \(\hat{Y}\), into predicted probabilities at relevant potential values of our variables. Log odds can be converted into proportions/probabilities using the inverse logit function:

\(p = e^\hat{Y} / 1+e^\hat{Y}\)

But most people have a hard time doing this in their head. A logistic regression intercept of 4 corresponds to odds of \(e^2=\) 54.59815, meaning that probation was about 55 times more likely than no probation for the reference group (0 convictions, no report, daytime crime, 0$ taken, etc.). However, interpreting other model coefficients is not as straightforward. This is because the simple inverse function calculates log odds relative to 0, whereas the regression coefficients are relative to the log odds of everything else in the regression including the intercept. To get a meaningful estimate of the change in probability, we have to run the logit function on the predicted value for each condition (i.e., plugging the coefficients into the regression equation). Differences in predicted probabilities (e.g., between report and no report) depend on the value of the other coefficients.

With linear OLS regression, model coefficients have a straightforward interpretation: a model coefficient \(\beta\) means that for every one-unit increase in \(x\), the model predicts a \(\beta\)-unit increase in \(\hat{Y}\) (the predicted value of the dependent variable). Technically, the logistic regression coefficient means the same thing: as \(x\) goes up by 1, the log odds go up by \(\beta\). However, log odds are difficult to interpret, especially relative log odds: while we can straightforwardly calculate \(\beta\) log odds on its own, an increase of \(\beta\) in log odds means something different depending on what the log odds increased from.

tl;dr, interpreting logit means calculating marginal effects and predicted probabilities. (just like last week).

Predicted probability

augment() gives us predicted outcomes and standard errors. The newdata argument specifies values at which we want to predict. The default fitted value is on the scale of the linear predictors. type.predict = "response" applies the link function to the fitted values (for logit, this means transforming log odds into predicted probability).

# A data frame of values at which to estimate probabilities:
values <- d %>% 
  expand(Take = mean(Take), Report, Night = 1, Convict)

predicted <- augment(model,  
                     type.predict = "response",
                     newdata = values
                     ) 

# Naming things a bit better
predicted %<>% 
  mutate(Convict = str_c(Convict, " Conviction(s)")) %>%
  rename(Convictions = Convict) %>% 
  mutate(Report = ifelse(Report == 1, "Report", "No Report"))

# As a table
predicted %>% 
  select(Report, Convictions, .fitted) %>%
  mutate(.fitted = round(.fitted, 1)) %>% 
  group_by(Convictions, Report) %>%
  spread(key = Report, value = ".fitted") %>% 
  knitr::kable(caption = "Probability of probation")
Probability of probation
Convictions No Report Report
0 Conviction(s) 1.0 1.0
1 Conviction(s) 0.8 1.0
2 Conviction(s) 0.3 0.7
3 Conviction(s) 0.0 0.2
4 Conviction(s) 0.0 0.0
5 Conviction(s) 0.0 0.0
# As a plot
predicted %>% 
  ggplot() + 
  aes(x = Report, y = .fitted, color = Report) + 
  geom_pointrange(aes(ymin = .fitted - 1.96*.se.fit,
                  ymax = .fitted + 1.96*.se.fit) )  + 
  coord_flip() +
  facet_wrap("Convictions", ncol = 1) +
  labs(y = "Probability", x = "") + 
  scale_color_discrete() + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

How can we calculate the uncertainty in these estimates?

Instead of relying on functions like augment, Zelig, or margins, we can estimate uncertainty more directly by sampling the assumed distribution of our parameters.

The central limit theorem shows that, with a large enough sample and bounded variance, we can simulate the distribution of parameters with draws from a multivariate normal distribution.

Consider draws of 10, 100, and 1000 for one paramater (i.e., one dimension of our multivariate normal):

library(mvtnorm)# for multivariate normals

draw <- function(n){
  # rnorm(n) returns n draws from a normal with a given mean and sd
  # mvrnorm(n) returns n draws from a multivariate normal with a given a vector of means (here, our betas) and sigma, a covariance matrix (here, the estimated covariances between the parameter estimates)
rmvnorm(n = n, 
        mean = model$coefficients, 
        sigma = vcov(model)) %>%
    as_tibble() %>% 
    rename_all(~str_c("beta_", .)) %>% 
    mutate(n = n)
}
  
draws <- map_dfr(c(10, 100, 1000), draw)

draws %>% 
  ggplot() +
  geom_dotplot(aes(x = beta_Report), binwidth = .02) +
  geom_vline(xintercept = tidy(model) %>% 
               filter(term == "Report") %>% 
               .$estimate, 
             color = "blue") +
  facet_grid(n ~ ., scales = "free_y") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

The same draws, plotted on two parameters (i.e., two dimensions of our multivariate normal):

p <- draws %>% 
  filter(n == 1000) %>% 
  ggplot() +
  aes(x = beta_Report, y = beta_Convict) +
  geom_point()

p +   geom_vline(xintercept = tidy(model) %>% 
               filter(term == "Report") %>% 
               .$estimate, 
             color = "blue")

p +
  geom_rug(alpha = .2, color = "blue") + 
  # Report (beta_3) mean, variance
  geom_label(aes(x = mean(beta_Report), 
                 y = min(beta_Convict)), 
             label= str_c(round(model$coefficients[3], 1),
                          round(vcov(model),1)[3,3],
                          sep = ", "), 
             color = "blue", vjust = 0)+ 
  # Convict (beta_5) mean, variance 
  geom_label(aes(x = min(beta_Report), 
                 y = mean(beta_Convict)), 
             label = str_c(round(model$coefficients[5], 1),
                          round(vcov(model),1)[5,5],
                          sep = ", "), 
             color = "blue", hjust = 0)

p + geom_smooth(se = F, method = "lm") + # (just to illustrate, not exactly the covariance)
  geom_label(aes(x = mean(beta_Report), 
                 y = mean(beta_Convict)), 
             label= round(vcov(model),1)[3,5], 
             color = "blue")

# Select the estimated covariances between the parameter estimates for Report and Convict
vcov(model)[c(3,5),c(3,5)] %>% 
  round(1) %>% 
  knitr::kable(caption = "Variance-Covariance Matrix")
Variance-Covariance Matrix
Report Convict
Report 0.3 -0.1
Convict -0.1 0.1

To understand uncertainty in logit, we must understand the relationship between covariance and predicted probability. Imagining a multivariate (multi-dimensional) distribution with more than two parameters can be tricky. The important thing to remember is that each variable has variance-covariance with the other variables. This means that, just our estimated effects of one variable are conditional on the values of the others (we’re talking about marginal effects), uncertainty around predicted probabilities is also conditional.

Using the variance-covariance matrix, we can write a function to estimate a distribution of predicted probabilities using a large number of random draws from the assumed multivariate normal distribution of our model estimates. That is, we draw from a multivariate normal distribution, setting the means to our model parameter estimates and \(\sigma\) to those parameters’ variance-covariance matrix, then calculate the log odds, and, finally, a predicted probability of probation for each draw.

In Stata:

* Calculation of Logit Bounds in Exercise 4 */
/* for a particular set of values of the independent variables */
/* After estimating logit, get coefficient and covariances */
/* (Important: enter variables in order below) */
/* matrix define b=e(b) */
/* matrix define V=e(V) */
/* Excute this program then execute the following */
/* Logit_bounds 1 2 3 4 */
/* where 1 is the value of report */
/* 2 is the value of convict */
/* 3 is the value of take */
/* 4 is the value of night */
/* */
/* */
capture program drop Logit_bounds
program define Logit_bounds
drop _all
drawnorm c_report c_convict c_take c_night c_cons, means(b) cov(V) cstorage(full) n(1000)
generate z=`1'*c_report+`2'*c_convict+`3'*c_take+`4'*c_night+c_cons
generate p = 1/(1+exp(-z))
sum p, d
end

In R:

Logit_bounds <- function(model, Take, Report, Night, Convict){
  predictions <- rmvnorm(n = 1000, 
                         mean = model$coefficients, 
                         sigma = vcov(model)) %>%
    as_tibble() %>% 
    # Add a prefix to be clear that these are our betas
    rename_all(~str_c("beta_", .)) %>% 
    # z = log odds (the linear combination of predictors)
    mutate(z = `beta_(Intercept)` + beta_Take*Take + beta_Report*Report + beta_Night*Night + beta_Convict*Convict) %>% 
    # p = probabilty. Apply the logistic (inverse logit) function to the log odds
    mutate(p = 1/(1+exp(-z)) ) %>%
    # Add values to the data frame
    mutate(Take = Take,
           Report = Report,
           Night = Night,
           Convict = Convict)
  return(predictions)
}

First, let’s test out our function:

Logit_bounds(model = model, 
             Take = mean(d$Take),
             Report = 1, 
             Night = 1, 
             Convict = 0) %>%
  glimpse() %>% 
  summarise(lower = quantile(p, probs = .1),
            upper = quantile(p, probs = .9),
            mean = mean(p))
## Observations: 1,000
## Variables: 11
## $ `beta_(Intercept)` <dbl> 3.587311, 3.983680, 3.695530, 4.682815, 4.438…
## $ beta_Take          <dbl> -5.904074e-05, -1.224851e-04, -8.322423e-05, …
## $ beta_Report        <dbl> 1.573238, 1.907170, 1.319511, 2.016811, 2.148…
## $ beta_Night         <dbl> -0.63383721, -0.42392537, -0.42813799, -0.148…
## $ beta_Convict       <dbl> -1.887404, -2.151516, -1.736672, -2.529019, -…
## $ z                  <dbl> 4.224367, 4.839684, 4.160715, 5.934871, 6.126…
## $ p                  <dbl> 0.9855765, 0.9921525, 0.9846431, 0.9973614, 0…
## $ Take               <dbl> 5120.95, 5120.95, 5120.95, 5120.95, 5120.95, …
## $ Report             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Night              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Convict            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## # A tibble: 1 x 3
##   lower upper  mean
##   <dbl> <dbl> <dbl>
## 1 0.986 0.998 0.993

We can apply this function to interesting possible values of our explainatory variables with pmap from purrr:

# a data frame of possible values
values <- d %>% 
  expand(Take = mean(Take), Report, Night, Convict)

# map values to Logit_bounds function, return a dataframe
predicted <- pmap_dfr(.l = values, # the values at which to run the function
                      .f = Logit_bounds, # the function
                      model = model) # an argument to the function that does not change

predicted %<>% 
  mutate(Convict = str_c(Convict, " Conviction(s)")) %>%
  mutate(Report = ifelse(Report == 1, "Report", "No Report"))

# A table
predicted %>%
  group_by(Take, Convict, Night, Report) %>% 
  summarise(lower = quantile(p, probs = .1),
            upper = quantile(p, probs = .9),
            mean = mean(p)) %>% 
  arrange(mean) %>%
  knitr::kable(caption = "Predicted probabilities of probation, with bounds") 
Predicted probabilities of probation, with bounds
Take Convict Night Report lower upper mean
5120.95 5 Conviction(s) 1 No Report 0.0001358 0.0020936 0.0009260
5120.95 5 Conviction(s) 0 No Report 0.0001974 0.0026952 0.0012614
5120.95 5 Conviction(s) 1 Report 0.0008729 0.0095759 0.0041741
5120.95 5 Conviction(s) 0 Report 0.0011967 0.0128164 0.0057894
5120.95 4 Conviction(s) 1 No Report 0.0017787 0.0126782 0.0064857
5120.95 4 Conviction(s) 0 No Report 0.0022684 0.0156508 0.0081924
5120.95 4 Conviction(s) 1 Report 0.0114604 0.0588400 0.0316975
5120.95 4 Conviction(s) 0 Report 0.0132104 0.0774012 0.0400476
5120.95 3 Conviction(s) 1 No Report 0.0222813 0.0798581 0.0472974
5120.95 3 Conviction(s) 0 No Report 0.0274964 0.1043505 0.0610827
5120.95 3 Conviction(s) 1 Report 0.1131040 0.3132733 0.2040027
5120.95 3 Conviction(s) 0 Report 0.1424017 0.3703989 0.2465397
5120.95 2 Conviction(s) 1 No Report 0.1948618 0.3869562 0.2876755
5120.95 2 Conviction(s) 0 No Report 0.2350821 0.4576084 0.3456722
5120.95 2 Conviction(s) 1 Report 0.5415061 0.8005519 0.6739648
5120.95 2 Conviction(s) 0 Report 0.6140356 0.8336342 0.7286002
5120.95 1 Conviction(s) 1 No Report 0.7004944 0.8466938 0.7760056
5120.95 1 Conviction(s) 0 No Report 0.7389226 0.8876758 0.8154005
5120.95 1 Conviction(s) 1 Report 0.9084832 0.9741654 0.9462674
5120.95 1 Conviction(s) 0 Report 0.9201637 0.9814726 0.9543743
5120.95 0 Conviction(s) 1 No Report 0.9427498 0.9842333 0.9664082
5120.95 0 Conviction(s) 0 No Report 0.9549286 0.9890337 0.9738756
5120.95 0 Conviction(s) 1 Report 0.9852347 0.9977282 0.9925138
5120.95 0 Conviction(s) 0 Report 0.9889254 0.9984255 0.9943564
# A plot of a few of those 
predicted %>%
  ggplot() +
  aes(x = p, 
      fill = Report,
      color = Report) +
  geom_density(alpha = .5, trim = T) +
  facet_wrap(.~ Convict, 
             ncol = 1,
             scales = "free_y") +
  scale_color_discrete()+ 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Look familiar?

To get distributions of predicted probability at specific values, you can either run the function on lots of values of interest and subset (like the plot above does) or run the function on just one set of values (i.e., for one condition) of interest.

How do we interpret the results? What are logit bounds?

Good question. “Logit_bounds” may not be the best name for the above function. “Logit_distributions” or “Logit_predictions” might be better, since it does not (yet) calculate bounds (i.e. confidence intervals). What it does do is create distributions with a mean and, critically, variance, for predicted probabilities of interest. We can use these distributions to estimate confidence intervals around certain probabilities of interest in order comment on the extent to which the department should continue writing reports (i.e., under what conditions, if any, can we say these reports matter).

What are the probabilities of interest?

This is a theoretical question. What might policymakers care about?

Can we just use confint?

Yes, in practice, we just use functions under the hood of augment(), Zelig, margins, and confit(). The point of this exercise is to understand the concept of a confidence interval around a predicted probability coming from a logit model. Unlike linear regression, we can’t just calculate standard errors from variance. Notice that confint uses simple \(t\) values for models of class "lm", but it requires the variance-covariance matrix for "glm" and calls on more complex functions to compute the profile likelihood. In short, the math for Logit is not easy; it is easier to understand by looking at how the assumed distribution of the parameters translates to the distribution of probabilities.

How do we run the following commands from Stata in R? estat gof, estat classification, lroc, lsens.

All of these functions return statistics about a model; the latter three about how well it predicts observed outcomes.

estat returns various model statistics. For many of these statistics, we might use summary(), glance(), augment(), vcov() depending on what exactly we are after.

estat classfication returns a classification table.

lroc computes the area under the receiver operating character (ROC) curve and lsens runs a sensitivity analysis. I know ROCR and other packages do both of these things.

estat gof is a goodness of fit test, specifically the F-test (recall last week).

When should we do sensitivity analyses?

We do a sensitivity analysis to evaluate a classification method. Any classification method has a sensitivity/specificity tradeoff; some give relatively more false negatives and others more false positives. Before we can evaluate our classification, we need to do some classifying. Let’s start by classifying each case that our model gave probability less than .5 as a (no probation) and the rest a 1 (probation). We can just round our predicted probabilities.

predicted <- augment(model,  
                     type.predict = "response"
                     ) %>% 
  mutate(classification = round(.fitted)) %>% 
  mutate(type = ifelse(classification == 0 & Probat == 0, "True negative",
                ifelse(classification == 1 & Probat == 1, "True positive",
                ifelse(classification == 1 & Probat == 0, "False positive",
                "False negative")))
         )

predicted %>% select(Probat, .fitted, classification, type)
## # A tibble: 280 x 4
##    Probat .fitted classification type          
##     <dbl>   <dbl>          <dbl> <chr>         
##  1      0 0.793                1 False positive
##  2      1 0.978                1 True positive 
##  3      0 0.253                0 True negative 
##  4      0 0.0389               0 True negative 
##  5      1 0.683                1 True positive 
##  6      0 0.00845              0 True negative 
##  7      1 0.958                1 True positive 
##  8      1 0.997                1 True positive 
##  9      1 0.242                0 False negative
## 10      1 0.970                1 True positive 
## # … with 270 more rows
table <- predicted %>% group_by(type) %>%
  count() 
table
## # A tibble: 4 x 2
## # Groups:   type [4]
##   type               n
##   <chr>          <int>
## 1 False negative    15
## 2 False positive    16
## 3 True negative    103
## 4 True positive    146

Specifictity = True Positive / Positive = Correctly predicted probation / probation = 0.91

Sensitivity (recall) = True Negative / Negative = Correctly predicted no probation / no probation = 0.87

How do we interpret sensitivity analyses?

An ROC curve measures these statistics at different possible cut-off points.

# devtools::install_github("sachsmc/plotROC")
library(plotROC)

predicted %>% 
  mutate(Report = ifelse(Report == 1, "Report", "No Report")) %>%
  ggplot() + 
  geom_roc(aes(d = Probat, m = .fitted)) + 
  style_roc() + 
  facet_wrap("Report")

One use for this sensitivity-specificity analysis is choosing a cut-off point (e.g., for decision-making or for assigning cases to groups).

Another is to assess model performance. If a model that performs well across possible cut-off points, the ROC curve will hug the upper-left corner of the plot. Area Under the Curve (AUC) is a measure of model performance.

More resources

A tutorial on interpreting Logit coeficients

An R-bloggers post on simulating confidence intervals

A tutorial with some model diagnostics

Interactions in Logit with margins

Logistic Regression is core to deep learning, or, conversely logistic regression can be viewed as a simple kind of neural network

Logistic Regression with a Neural Networks Mindset

A nice walk-through of classification with logit

A great video tutorial explaining Support Vector Machines

Another example from a post on Stack Exchange:

df <- read.csv("https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/125/2018/11/parasites.txt", header = T)

m1 <- glm(data=df, infected ~ age + weight + sex, family = "binomial") # add spaces to variables separated by arithmetic operators
link_func <- m1$family$linkinv # maybe this could become a generic function

# anonymous functions are quick and easy to type, my preference if only one input arg
newdat_func <- . %>% # meant to start with df
  dplyr::select(weight, age) %>% # keep only column of interest
  map(~ round(seq(min(.), max(.), length.out = 15))) %>% # don't repeat yourself and call the same operation on both columns in one line
  c(list(sex = c("female", "male"))) %>% # prep a 3-element list for expand.grid to process
  expand.grid()

newdat2 <- newdat_func(df)

# fall back to traditional function format for multiple inputs
x_func <- function(model, newdata, link_func) {
  predict.glm(model, newdata = newdata, type="link", se=TRUE) %>% # obviously this only works on glm objects, you could add checks to be defensive
    keep(~ length(.) == nrow(newdata)) %>% # drop the third element that is length 1
    bind_cols() %>% # build data frame with a column from each list element
    mutate(low = fit - 1.96 * se.fit,
           high = fit + 1.96 * se.fit) %>%
    mutate_all(funs(link_func)) %>% # again don't repeat yourself
    bind_cols(newdata) %>% # bolt back on simulated predictors
    mutate(category = cut(age,
                          breaks = c(0, 69, 138, 206),
                          labels = c("0-69", "70-139", "139-206")),
           age = as.factor(age))
}

x2 <- x_func(m1, newdat2, link_func)

ggplot(data = x2, aes(x = weight)) + # always use spaces around '+' and '=', do ggplot(data = data) +
  geom_line(aes(y = fit, color = age)) +
  geom_ribbon(aes(ymin = low, ymax = high, fill = age), alpha = 0.1) + # okay is all on one line (<80 chars)
  facet_grid(category ~ sex)