Note: Click on the Code
button to see the R code for each table and figure. The R Markdown file that made this document is here, reproducible by cloning this repository: https://github.com/judgelord/813/
load("data/EX4.Rdata") # loads data as "d"
# general linear model with logit link function
model <- glm(Probat ~ Take + Report + Night + Convict,
data=d,
family=binomial(link="logit"))
knitr::kable(tidy(model), digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.235 | 0.720 | 5.879 | 0.000 |
Take | 0.000 | 0.000 | -1.302 | 0.193 |
Report | 1.696 | 0.511 | 3.318 | 0.001 |
Night | -0.264 | 0.439 | -0.600 | 0.548 |
Convict | -2.202 | 0.281 | -7.833 | 0.000 |
Interpreting log odds is tricky. Thus we transform the estimated log odds, \(\hat{Y}\), into predicted probabilities at relevant potential values of our variables. Log odds can be converted into proportions/probabilities using the inverse logit function:
\(p = e^\hat{Y} / 1+e^\hat{Y}\)
But most people have a hard time doing this in their head. A logistic regression intercept of 4 corresponds to odds of \(e^2=\) 54.59815, meaning that probation was about 55 times more likely than no probation for the reference group (0 convictions, no report, daytime crime, 0$ taken, etc.). However, interpreting other model coefficients is not as straightforward. This is because the simple inverse function calculates log odds relative to 0, whereas the regression coefficients are relative to the log odds of everything else in the regression including the intercept. To get a meaningful estimate of the change in probability, we have to run the logit function on the predicted value for each condition (i.e., plugging the coefficients into the regression equation). Differences in predicted probabilities (e.g., between report and no report) depend on the value of the other coefficients.
With linear OLS regression, model coefficients have a straightforward interpretation: a model coefficient \(\beta\) means that for every one-unit increase in \(x\), the model predicts a \(\beta\)-unit increase in \(\hat{Y}\) (the predicted value of the dependent variable). Technically, the logistic regression coefficient means the same thing: as \(x\) goes up by 1, the log odds go up by \(\beta\). However, log odds are difficult to interpret, especially relative log odds: while we can straightforwardly calculate \(\beta\) log odds on its own, an increase of \(\beta\) in log odds means something different depending on what the log odds increased from.
augment()
gives us predicted outcomes and standard errors. The newdata
argument specifies values at which we want to predict. The default fitted value is on the scale of the linear predictors. type.predict = "response"
applies the link function to the fitted values (for logit, this means transforming log odds into predicted probability).
# A data frame of values at which to estimate probabilities:
values <- d %>%
expand(Take = mean(Take), Report, Night = 1, Convict)
predicted <- augment(model,
type.predict = "response",
newdata = values
)
# Naming things a bit better
predicted %<>%
mutate(Convict = str_c(Convict, " Conviction(s)")) %>%
rename(Convictions = Convict) %>%
mutate(Report = ifelse(Report == 1, "Report", "No Report"))
# As a table
predicted %>%
select(Report, Convictions, .fitted) %>%
mutate(.fitted = round(.fitted, 1)) %>%
group_by(Convictions, Report) %>%
spread(key = Report, value = ".fitted") %>%
knitr::kable(caption = "Probability of probation")
Convictions | No Report | Report |
---|---|---|
0 Conviction(s) | 1.0 | 1.0 |
1 Conviction(s) | 0.8 | 1.0 |
2 Conviction(s) | 0.3 | 0.7 |
3 Conviction(s) | 0.0 | 0.2 |
4 Conviction(s) | 0.0 | 0.0 |
5 Conviction(s) | 0.0 | 0.0 |
# As a plot
predicted %>%
ggplot() +
aes(x = Report, y = .fitted, color = Report) +
geom_pointrange(aes(ymin = .fitted - 1.96*.se.fit,
ymax = .fitted + 1.96*.se.fit) ) +
coord_flip() +
facet_wrap("Convictions", ncol = 1) +
labs(y = "Probability", x = "") +
scale_color_discrete() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
Instead of relying on functions like augment
, Zelig
, or margins
, we can estimate uncertainty more directly by sampling the assumed distribution of our parameters.
The central limit theorem shows that, with a large enough sample and bounded variance, we can simulate the distribution of parameters with draws from a multivariate normal distribution.
Consider draws of 10, 100, and 1000 for one paramater (i.e., one dimension of our multivariate normal):
library(mvtnorm)# for multivariate normals
draw <- function(n){
# rnorm(n) returns n draws from a normal with a given mean and sd
# mvrnorm(n) returns n draws from a multivariate normal with a given a vector of means (here, our betas) and sigma, a covariance matrix (here, the estimated covariances between the parameter estimates)
rmvnorm(n = n,
mean = model$coefficients,
sigma = vcov(model)) %>%
as_tibble() %>%
rename_all(~str_c("beta_", .)) %>%
mutate(n = n)
}
draws <- map_dfr(c(10, 100, 1000), draw)
draws %>%
ggplot() +
geom_dotplot(aes(x = beta_Report), binwidth = .02) +
geom_vline(xintercept = tidy(model) %>%
filter(term == "Report") %>%
.$estimate,
color = "blue") +
facet_grid(n ~ ., scales = "free_y") +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
The same draws, plotted on two parameters (i.e., two dimensions of our multivariate normal):
p <- draws %>%
filter(n == 1000) %>%
ggplot() +
aes(x = beta_Report, y = beta_Convict) +
geom_point()
p + geom_vline(xintercept = tidy(model) %>%
filter(term == "Report") %>%
.$estimate,
color = "blue")
p +
geom_rug(alpha = .2, color = "blue") +
# Report (beta_3) mean, variance
geom_label(aes(x = mean(beta_Report),
y = min(beta_Convict)),
label= str_c(round(model$coefficients[3], 1),
round(vcov(model),1)[3,3],
sep = ", "),
color = "blue", vjust = 0)+
# Convict (beta_5) mean, variance
geom_label(aes(x = min(beta_Report),
y = mean(beta_Convict)),
label = str_c(round(model$coefficients[5], 1),
round(vcov(model),1)[5,5],
sep = ", "),
color = "blue", hjust = 0)
p + geom_smooth(se = F, method = "lm") + # (just to illustrate, not exactly the covariance)
geom_label(aes(x = mean(beta_Report),
y = mean(beta_Convict)),
label= round(vcov(model),1)[3,5],
color = "blue")
# Select the estimated covariances between the parameter estimates for Report and Convict
vcov(model)[c(3,5),c(3,5)] %>%
round(1) %>%
knitr::kable(caption = "Variance-Covariance Matrix")
Report | Convict | |
---|---|---|
Report | 0.3 | -0.1 |
Convict | -0.1 | 0.1 |
To understand uncertainty in logit, we must understand the relationship between covariance and predicted probability. Imagining a multivariate (multi-dimensional) distribution with more than two parameters can be tricky. The important thing to remember is that each variable has variance-covariance with the other variables. This means that, just our estimated effects of one variable are conditional on the values of the others (we’re talking about marginal effects), uncertainty around predicted probabilities is also conditional.
Using the variance-covariance matrix, we can write a function to estimate a distribution of predicted probabilities using a large number of random draws from the assumed multivariate normal distribution of our model estimates. That is, we draw from a multivariate normal distribution, setting the means to our model parameter estimates and \(\sigma\) to those parameters’ variance-covariance matrix, then calculate the log odds, and, finally, a predicted probability of probation for each draw.
In Stata:
* Calculation of Logit Bounds in Exercise 4 */
/* for a particular set of values of the independent variables */
/* After estimating logit, get coefficient and covariances */
/* (Important: enter variables in order below) */
/* matrix define b=e(b) */
/* matrix define V=e(V) */
/* Excute this program then execute the following */
/* Logit_bounds 1 2 3 4 */
/* where 1 is the value of report */
/* 2 is the value of convict */
/* 3 is the value of take */
/* 4 is the value of night */
/* */
/* */
capture program drop Logit_bounds
program define Logit_bounds
drop _all
drawnorm c_report c_convict c_take c_night c_cons, means(b) cov(V) cstorage(full) n(1000)
generate z=`1'*c_report+`2'*c_convict+`3'*c_take+`4'*c_night+c_cons
generate p = 1/(1+exp(-z))
sum p, d
end
In R:
Logit_bounds <- function(model, Take, Report, Night, Convict){
predictions <- rmvnorm(n = 1000,
mean = model$coefficients,
sigma = vcov(model)) %>%
as_tibble() %>%
# Add a prefix to be clear that these are our betas
rename_all(~str_c("beta_", .)) %>%
# z = log odds (the linear combination of predictors)
mutate(z = `beta_(Intercept)` + beta_Take*Take + beta_Report*Report + beta_Night*Night + beta_Convict*Convict) %>%
# p = probabilty. Apply the logistic (inverse logit) function to the log odds
mutate(p = 1/(1+exp(-z)) ) %>%
# Add values to the data frame
mutate(Take = Take,
Report = Report,
Night = Night,
Convict = Convict)
return(predictions)
}
First, let’s test out our function:
Logit_bounds(model = model,
Take = mean(d$Take),
Report = 1,
Night = 1,
Convict = 0) %>%
glimpse() %>%
summarise(lower = quantile(p, probs = .1),
upper = quantile(p, probs = .9),
mean = mean(p))
## Observations: 1,000
## Variables: 11
## $ `beta_(Intercept)` <dbl> 3.587311, 3.983680, 3.695530, 4.682815, 4.438…
## $ beta_Take <dbl> -5.904074e-05, -1.224851e-04, -8.322423e-05, …
## $ beta_Report <dbl> 1.573238, 1.907170, 1.319511, 2.016811, 2.148…
## $ beta_Night <dbl> -0.63383721, -0.42392537, -0.42813799, -0.148…
## $ beta_Convict <dbl> -1.887404, -2.151516, -1.736672, -2.529019, -…
## $ z <dbl> 4.224367, 4.839684, 4.160715, 5.934871, 6.126…
## $ p <dbl> 0.9855765, 0.9921525, 0.9846431, 0.9973614, 0…
## $ Take <dbl> 5120.95, 5120.95, 5120.95, 5120.95, 5120.95, …
## $ Report <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Night <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Convict <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## # A tibble: 1 x 3
## lower upper mean
## <dbl> <dbl> <dbl>
## 1 0.986 0.998 0.993
We can apply this function to interesting possible values of our explainatory variables with pmap
from purrr
:
# a data frame of possible values
values <- d %>%
expand(Take = mean(Take), Report, Night, Convict)
# map values to Logit_bounds function, return a dataframe
predicted <- pmap_dfr(.l = values, # the values at which to run the function
.f = Logit_bounds, # the function
model = model) # an argument to the function that does not change
predicted %<>%
mutate(Convict = str_c(Convict, " Conviction(s)")) %>%
mutate(Report = ifelse(Report == 1, "Report", "No Report"))
# A table
predicted %>%
group_by(Take, Convict, Night, Report) %>%
summarise(lower = quantile(p, probs = .1),
upper = quantile(p, probs = .9),
mean = mean(p)) %>%
arrange(mean) %>%
knitr::kable(caption = "Predicted probabilities of probation, with bounds")
Take | Convict | Night | Report | lower | upper | mean |
---|---|---|---|---|---|---|
5120.95 | 5 Conviction(s) | 1 | No Report | 0.0001358 | 0.0020936 | 0.0009260 |
5120.95 | 5 Conviction(s) | 0 | No Report | 0.0001974 | 0.0026952 | 0.0012614 |
5120.95 | 5 Conviction(s) | 1 | Report | 0.0008729 | 0.0095759 | 0.0041741 |
5120.95 | 5 Conviction(s) | 0 | Report | 0.0011967 | 0.0128164 | 0.0057894 |
5120.95 | 4 Conviction(s) | 1 | No Report | 0.0017787 | 0.0126782 | 0.0064857 |
5120.95 | 4 Conviction(s) | 0 | No Report | 0.0022684 | 0.0156508 | 0.0081924 |
5120.95 | 4 Conviction(s) | 1 | Report | 0.0114604 | 0.0588400 | 0.0316975 |
5120.95 | 4 Conviction(s) | 0 | Report | 0.0132104 | 0.0774012 | 0.0400476 |
5120.95 | 3 Conviction(s) | 1 | No Report | 0.0222813 | 0.0798581 | 0.0472974 |
5120.95 | 3 Conviction(s) | 0 | No Report | 0.0274964 | 0.1043505 | 0.0610827 |
5120.95 | 3 Conviction(s) | 1 | Report | 0.1131040 | 0.3132733 | 0.2040027 |
5120.95 | 3 Conviction(s) | 0 | Report | 0.1424017 | 0.3703989 | 0.2465397 |
5120.95 | 2 Conviction(s) | 1 | No Report | 0.1948618 | 0.3869562 | 0.2876755 |
5120.95 | 2 Conviction(s) | 0 | No Report | 0.2350821 | 0.4576084 | 0.3456722 |
5120.95 | 2 Conviction(s) | 1 | Report | 0.5415061 | 0.8005519 | 0.6739648 |
5120.95 | 2 Conviction(s) | 0 | Report | 0.6140356 | 0.8336342 | 0.7286002 |
5120.95 | 1 Conviction(s) | 1 | No Report | 0.7004944 | 0.8466938 | 0.7760056 |
5120.95 | 1 Conviction(s) | 0 | No Report | 0.7389226 | 0.8876758 | 0.8154005 |
5120.95 | 1 Conviction(s) | 1 | Report | 0.9084832 | 0.9741654 | 0.9462674 |
5120.95 | 1 Conviction(s) | 0 | Report | 0.9201637 | 0.9814726 | 0.9543743 |
5120.95 | 0 Conviction(s) | 1 | No Report | 0.9427498 | 0.9842333 | 0.9664082 |
5120.95 | 0 Conviction(s) | 0 | No Report | 0.9549286 | 0.9890337 | 0.9738756 |
5120.95 | 0 Conviction(s) | 1 | Report | 0.9852347 | 0.9977282 | 0.9925138 |
5120.95 | 0 Conviction(s) | 0 | Report | 0.9889254 | 0.9984255 | 0.9943564 |
# A plot of a few of those
predicted %>%
ggplot() +
aes(x = p,
fill = Report,
color = Report) +
geom_density(alpha = .5, trim = T) +
facet_wrap(.~ Convict,
ncol = 1,
scales = "free_y") +
scale_color_discrete()+
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
Look familiar?
To get distributions of predicted probability at specific values, you can either run the function on lots of values of interest and subset (like the plot above does) or run the function on just one set of values (i.e., for one condition) of interest.
Good question. “Logit_bounds” may not be the best name for the above function. “Logit_distributions” or “Logit_predictions” might be better, since it does not (yet) calculate bounds (i.e. confidence intervals). What it does do is create distributions with a mean and, critically, variance, for predicted probabilities of interest. We can use these distributions to estimate confidence intervals around certain probabilities of interest in order comment on the extent to which the department should continue writing reports (i.e., under what conditions, if any, can we say these reports matter).
This is a theoretical question. What might policymakers care about?
confint
?Yes, in practice, we just use functions under the hood of augment()
, Zelig
, margins
, and confit()
. The point of this exercise is to understand the concept of a confidence interval around a predicted probability coming from a logit model. Unlike linear regression, we can’t just calculate standard errors from variance. Notice that confint
uses simple \(t\) values for models of class "lm"
, but it requires the variance-covariance matrix for "glm"
and calls on more complex functions to compute the profile likelihood. In short, the math for Logit is not easy; it is easier to understand by looking at how the assumed distribution of the parameters translates to the distribution of probabilities.
estat gof
, estat classification
, lroc
, lsens
.All of these functions return statistics about a model; the latter three about how well it predicts observed outcomes.
estat
returns various model statistics. For many of these statistics, we might use summary()
, glance()
, augment()
, vcov()
depending on what exactly we are after.
estat classfication
returns a classification table.
lroc
computes the area under the receiver operating character (ROC) curve and lsens
runs a sensitivity analysis. I know ROCR
and other packages do both of these things.
estat gof
is a goodness of fit test, specifically the F-test (recall last week).
We do a sensitivity analysis to evaluate a classification method. Any classification method has a sensitivity/specificity tradeoff; some give relatively more false negatives and others more false positives. Before we can evaluate our classification, we need to do some classifying. Let’s start by classifying each case that our model gave probability less than .5 as a (no probation) and the rest a 1 (probation). We can just round our predicted probabilities.
predicted <- augment(model,
type.predict = "response"
) %>%
mutate(classification = round(.fitted)) %>%
mutate(type = ifelse(classification == 0 & Probat == 0, "True negative",
ifelse(classification == 1 & Probat == 1, "True positive",
ifelse(classification == 1 & Probat == 0, "False positive",
"False negative")))
)
predicted %>% select(Probat, .fitted, classification, type)
## # A tibble: 280 x 4
## Probat .fitted classification type
## <dbl> <dbl> <dbl> <chr>
## 1 0 0.793 1 False positive
## 2 1 0.978 1 True positive
## 3 0 0.253 0 True negative
## 4 0 0.0389 0 True negative
## 5 1 0.683 1 True positive
## 6 0 0.00845 0 True negative
## 7 1 0.958 1 True positive
## 8 1 0.997 1 True positive
## 9 1 0.242 0 False negative
## 10 1 0.970 1 True positive
## # … with 270 more rows
table <- predicted %>% group_by(type) %>%
count()
table
## # A tibble: 4 x 2
## # Groups: type [4]
## type n
## <chr> <int>
## 1 False negative 15
## 2 False positive 16
## 3 True negative 103
## 4 True positive 146
Specifictity = True Positive / Positive = Correctly predicted probation / probation = 0.91
Sensitivity (recall) = True Negative / Negative = Correctly predicted no probation / no probation = 0.87
An ROC curve measures these statistics at different possible cut-off points.
# devtools::install_github("sachsmc/plotROC")
library(plotROC)
predicted %>%
mutate(Report = ifelse(Report == 1, "Report", "No Report")) %>%
ggplot() +
geom_roc(aes(d = Probat, m = .fitted)) +
style_roc() +
facet_wrap("Report")
One use for this sensitivity-specificity analysis is choosing a cut-off point (e.g., for decision-making or for assigning cases to groups).
Another is to assess model performance. If a model that performs well across possible cut-off points, the ROC curve will hug the upper-left corner of the plot. Area Under the Curve (AUC) is a measure of model performance.
A tutorial on interpreting Logit coeficients
An R-bloggers post on simulating confidence intervals
A tutorial with some model diagnostics
Interactions in Logit with margins
Logistic Regression is core to deep learning, or, conversely logistic regression can be viewed as a simple kind of neural network
Logistic Regression with a Neural Networks Mindset
A nice walk-through of classification with logit
A great video tutorial explaining Support Vector Machines
Another example from a post on Stack Exchange:
df <- read.csv("https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/125/2018/11/parasites.txt", header = T)
m1 <- glm(data=df, infected ~ age + weight + sex, family = "binomial") # add spaces to variables separated by arithmetic operators
link_func <- m1$family$linkinv # maybe this could become a generic function
# anonymous functions are quick and easy to type, my preference if only one input arg
newdat_func <- . %>% # meant to start with df
dplyr::select(weight, age) %>% # keep only column of interest
map(~ round(seq(min(.), max(.), length.out = 15))) %>% # don't repeat yourself and call the same operation on both columns in one line
c(list(sex = c("female", "male"))) %>% # prep a 3-element list for expand.grid to process
expand.grid()
newdat2 <- newdat_func(df)
# fall back to traditional function format for multiple inputs
x_func <- function(model, newdata, link_func) {
predict.glm(model, newdata = newdata, type="link", se=TRUE) %>% # obviously this only works on glm objects, you could add checks to be defensive
keep(~ length(.) == nrow(newdata)) %>% # drop the third element that is length 1
bind_cols() %>% # build data frame with a column from each list element
mutate(low = fit - 1.96 * se.fit,
high = fit + 1.96 * se.fit) %>%
mutate_all(funs(link_func)) %>% # again don't repeat yourself
bind_cols(newdata) %>% # bolt back on simulated predictors
mutate(category = cut(age,
breaks = c(0, 69, 138, 206),
labels = c("0-69", "70-139", "139-206")),
age = as.factor(age))
}
x2 <- x_func(m1, newdat2, link_func)
ggplot(data = x2, aes(x = weight)) + # always use spaces around '+' and '=', do ggplot(data = data) +
geom_line(aes(y = fit, color = age)) +
geom_ribbon(aes(ymin = low, ymax = high, fill = age), alpha = 0.1) + # okay is all on one line (<80 chars)
facet_grid(category ~ sex)