Data from “How Legislators Actually Spend Their Time: Constituents, donors, or policy” with Justin Grimmer and Eleanor Powell (APSA 2018)

In addition to tidyverse, this example uses the tidytext, topicmodels, textfeatures, and cleanNLP packages.

library(tidytext)
library(topicmodels)
library(textfeatures)
library(cleanNLP) ## https://statsmaths.github.io/cleanNLP/ 
load(url("https://github.com/judgelord/correspondence/raw/master/data/correspondenceTexts.Rdata"))
d <- correspondenceTexts 

# Naming things! 
d %<>% rename(Party = party_name)

First, we clean up the text a bit:

# Make a single string of stopwards separated by regex "OR" ("|")
stopwords <- str_c(stop_words$word, collapse = " | ")
# Add to the list of things to exclude
stopwords <- paste("[0-9]|", stopwords, "| senator", "| representative ", "| write", "| writes", "| writing", "| letter ")

d$SUBJECT %<>% 
  # To lower case
  tolower() %>% 
  # Remove stopwords
  str_replace(stopwords, " ") %>%
  # Remove numbers 
  str_replace_all("[0-9]", "")

Tokens and word counts

Single words

With tidy text, counting words is simple:

  1. unnest_tokens() splits each response into tokens (by word by defult, but we can also tokenize by phrases of length n, called n-grams).

  2. [optional] anti_join(stop_words) removes words that often have little meaning, like “a” and “the”, called stop words. We can also do this with with filter(!(word %in% stop_words$word))

  3. count() how many times each word appears (count(word) is like group_by(word) %>% summarize(n = n()) %>% ungroup() )

words <- d %>% 
 unnest_tokens(word, SUBJECT) %>% 
 filter(!(word %in% stop_words$word)) %>% 
 group_by(Department) %>%
 count(word, sort = TRUE) %>% 
 top_n(10) %>% 
 mutate(word = fct_inorder(word))

ggplot(words, aes(x = fct_rev(word), y = n)) + 
 geom_col() + 
 coord_flip() +
 scale_y_continuous(labels = scales::comma) +
 labs(y = "Count", x = NULL, title = "10 most frequent words") +
 facet_wrap("Department", scales = "free")

Bigrams

We can also look at the frequency of pairs of words. First, we’ll look at common bigrams, filtering out stop words again (since we don’t want things like “of the” and “in the”):

bigrams <- d %>% 
  group_by(Department) %>% 
  unnest_tokens(bigram, SUBJECT, token = "ngrams", n = 2) %>% 
 # Split the bigram column into two columns
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word,
  !word2 %in% stop_words$word) %>% 
  # Put the two word columns back together
  unite(bigram, word1, word2, sep = " ") %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(10)

ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) + 
  geom_col() + 
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(y = "Count", x = NULL, title = "10 most frequent word pairs") +
  facet_wrap("Department", scales = "free")

Bigrams and probability

We can replicate the “She Giggles, He Gallops” idea, looking for gendered verbs by counting the bigrams that match “he X” and “she X”.

The log ratio is the factor by which the word is more likely to follow “he” than “she” in the text.

pronouns <- c("he", "she")

bigram_binary_counts <- d %>%
  group_by(Party) %>% 
  unnest_tokens(bigram, SUBJECT, token = "ngrams", n = 2) %>%
  # count(bigram, sort = TRUE) %>%
  # Split the bigram column into two columns
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  # Only choose rows where the first word is he or she
  filter(word1 %in% pronouns) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(total = n)

word_ratios <- bigram_binary_counts %>%
  # Spread out the word1 column so that there's a column named "he" and one named "she"
  spread(word1, total, fill = 0) %>%
  # Add 1 to each number so that logs work (just in case any are zero)
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  # Create a new column that is the logged ratio of the she counts to he counts
  mutate(logratio = log2(she / he)) %>%
  # take the absolute value
  mutate(abslogratio = abs(logratio))

word_ratios %>% 
  top_n(10, abslogratio) %>%
  mutate(word = reorder(word2, logratio)) %>%
ggplot(aes(word, logratio, color = logratio < 0)) +
  geom_segment(aes(x = word, xend = word,
  y = 0, yend = logratio), 
  size = 1.1, alpha = 0.6) +
  geom_point(size = 3.5) +
  coord_flip() +
  labs(y = "How much more/less likely", x = NULL) +
  scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) +
  scale_y_continuous(breaks = seq(-3, 3), labels = c("8x", "4x", "2x", "Same", "2x", "4x", "8x")) +
  theme(legend.position = "bottom") + 
  facet_grid(. ~ Party)

by <- c("support", "oppose")

bigram_support_oppose_counts <- d %>% 
  # Regular expression (regex) to match any suffix
  mutate(SUBJECT = str_replace_all(SUBJECT, "support[a-z]*", "support") ) %>%
  mutate(SUBJECT = str_replace_all(SUBJECT, "oppos[a-z]*", "oppose") ) %>%
  group_by(Party) %>% 
  unnest_tokens(bigram, SUBJECT, token = "ngrams", n = 2) %>%
  # Split the bigram column into two columns
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  # Only choose rows where the first word is support or oppose
  filter(str_detect(word1, by)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(total = n)

word_ratios <- bigram_support_oppose_counts %>%
  group_by(Party) %>%
  # Spread out the word1 column so that there's a column named "support" and one named "oppose"
  spread(word1, total, fill = 0) %>%
  # Add 1 to each number so that logs work (just in case any are zero)
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  # Create a new column that is the logged ratio of the she counts to he counts
  mutate(logratio = log2(support / oppose)) %>%
  # take the absolute value
  mutate(abslogratio = abs(logratio))

word_ratios %>%
  top_n(5, abslogratio) %>%
  ungroup() %>%
  group_by(Party) %>% 
  mutate(word = reorder(word2, logratio))  %>% 
ggplot( aes(word, logratio, color = logratio < 0)) +
  geom_segment(aes(x = word, xend = word,
  y = 0, yend = logratio), 
  size = 1.1, alpha = 0.6) +
  geom_point(size = 3.5) +
  coord_flip() +
  labs(y = "How much more/less likely", x = NULL) +
  scale_color_discrete(name = "", labels = c("More 'support ___'", "More 'oppose ___'")) +
  scale_y_continuous(breaks = seq(-3, 3),
  labels = c("8x", "4x", "2x",
  "Same", "2x", "4x", "8x")) +
  theme(legend.position = "bottom") + 
  facet_grid(. ~ Party)

Sentiment analysis

Sentiment analysis generally involves dictionaries (a list of words) scored as negative or positive. Some sentiment dictionaries mark if a word is “negative” or “positive”; some give words a score from -3 to 3; some give emotions like “sadness” or “anger”. You can see what the different dictionaries look like with get_sentiments()

# Dictionaries
get_sentiments("afinn") # Scoring system
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,466 more rows
# get_sentiments("bing") # Negative/positive
# get_sentiments("nrc") # Specific emotions
# get_sentiments("loughran") # Designed for financial statements; positive/negative

Here we split the documents into words, join a sentiment dictionary (bing) to it, and calculate the net number positive words in each document.

sentiment <- d %>% 
 # Split into individual words
 unnest_tokens(word, SUBJECT) %>% 
 # Join bing sentiment dicionary
 inner_join(get_sentiments("bing")) %>% 
 # Count how many postive/negative words are in each chapter
 count(Party, Department, sentiment) %>% 
 # Spread the count into two columns named positive and negative
 spread(sentiment, n, fill = 0) %>% 
 # Subtract the positive words from the negative words
 mutate(net_sentiment = positive - negative)

sentiment %>% 
ggplot() +
  aes(x = reorder(Department, abs(net_sentiment)), 
      y = net_sentiment, fill = net_sentiment > 0) +
  geom_col() +
  guides(fill = FALSE) +
  coord_flip() +
  labs(x = "Department", y = "Net sentiment") + 
  facet_wrap("Party")

tf-idf

The tf-idf (term frequency-inverse document frequency) score for each term determines which words are the most unique for each document in our corpus:

\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]

The bind_tf_idf() function calculates this. The higher the tf-idf number, the more unique the term is in the document, but these numbers are unitless—you can’t convert them to a percentage or anything.

Here are the most unique words per Department:

# Get a list of words
words <- d %>% 
  unnest_tokens(word, SUBJECT) %>% 
  group_by(Department) %>% 
  count(word, sort = TRUE) %>% 
  ungroup()

# Add the tf-idf for these words
tf_idf <- words %>% 
 bind_tf_idf(word, Department, n) %>% 
 arrange(desc(tf_idf))

# Get the top 10 most unique words
tf_idf %>% 
  group_by(Department) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  # order by word
  mutate(word = fct_inorder(word)) %>%
  # Plot by tf_idf
  ggplot(aes(x = fct_rev(word), y = tf_idf, fill = Department)) +
  geom_col() +
  guides(fill = FALSE) +
  labs(y = "tf-idf", x = NULL) +
  facet_wrap(~ Department, scales = "free") +
  coord_flip()


Text features

Words are only one of many features of text. - Bi-grams, tri-grams…n-grams can help preserve context. While often impractical to model, matching n-grams are a key method for identifying text reuse. - We can also capture the number of punctuation marks, hashtags, mentions, etc. using the textfeatures() function:

# For textfeatures() to work, the column with the text in it has to be named text
features <- d %>% 
 rename(text = SUBJECT) %>% 
 ## Don't calculate sentiment because it takes longer. 
## Also, don't calculate word2vec dimensions, since these take longer to do and they're kinda weird and uninterpretable. 
## Also, don't normalize the final numbers---keep them as raw numbers
textfeatures(sentiment = FALSE, word_dims = 0, normalize = FALSE) %>% 
 # Add the text back to the data frame since textfeatures wiped it out
 bind_cols(d)
# Look at all these columns you can work with now!
glimpse(features)
## Observations: 57,029
## Variables: 39
## $ ID               <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1…
## $ n_urls           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_hashtags       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_mentions       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_chars          <int> 98, 182, 108, 98, 129, 122, 99, 125, 71, 100, 1…
## $ n_commas         <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2,…
## $ n_digits         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_exclaims       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_extraspaces    <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ n_lowers         <int> 97, 179, 102, 97, 127, 118, 97, 124, 68, 98, 13…
## $ n_lowersp        <dbl> 0.9898990, 0.9836066, 0.9449541, 0.9898990, 0.9…
## $ n_periods        <int> 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1,…
## $ n_words          <int> 17, 27, 18, 12, 19, 18, 16, 21, 12, 17, 22, 22,…
## $ n_caps           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_nonasciis      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_puncts         <int> 0, 2, 3, 0, 1, 2, 1, 0, 0, 1, 1, 1, 2, 1, 2, 6,…
## $ n_capsp          <dbl> 0.010101010, 0.005464481, 0.009174312, 0.010101…
## $ n_charsperword   <dbl> 5.500000, 6.535714, 5.736842, 7.615385, 6.50000…
## $ n_polite         <dbl> -0.50000000, 0.18750000, -0.31250000, 0.0000000…
## $ n_first_person   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_first_personp  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_second_person  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_second_personp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_third_person   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_tobe           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_prepositions   <int> 2, 5, 2, 1, 2, 2, 2, 3, 1, 2, 4, 3, 2, 1, 5, 3,…
## $ ID1              <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1…
## $ DATE             <date> 2008-02-25, 2008-01-09, 2008-01-09, 2008-01-08…
## $ year             <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,…
## $ congress         <dbl> 110, 110, 110, 110, 110, 110, 110, 110, 110, 11…
## $ bioname          <chr> "MICHAUD, Michael H.", "SCHUMER, Charles Ellis …
## $ state            <chr> "maine", "new york", "rhode island", "michigan"…
## $ Party            <chr> "Democratic", "Democratic", "Democratic", "Demo…
## $ department       <chr> "DHHS", "DHHS", "DHHS", "DHHS", "DHHS", "DHHS",…
## $ Department       <chr> "Health and Human Services", "Health and Human …
## $ agency           <chr> "DHHS_HRSA", "DHHS_HRSA", "DHHS_HRSA", "DHHS_HR…
## $ SUBJECT          <chr> "forwards recommendation dr. david hartley to s…
## $ Type             <fct> To be coded, To be coded, 501c3 or Local Gov., …
## $ POLICY_EVENT     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Topic modeling

(Counting words, classifying tokens)

With topic modeling, we first count the words in each document and then use unsupervised Bayesian machine learning to find clusters of words that tend to hang together.

# pick an agency
agency <- "EPA"

dtm <- d %>%
  # subset to that agency
  filter(agency == agency) %>% 
 unnest_tokens(word, SUBJECT) %>%
  # Count words per document ID number
 count(ID, word, sort = TRUE) %>%
 # Convert this to a document-term matrix (each row is a document, each column is a term, and the value is the count of words in that document)
 cast_dtm(term = word, document = ID, value = n)

# Find 3 topics (or clusters of words)
lda <- LDA(dtm, k = 3, control = list(seed = 1234))

# Convert the LDA object into a tidy data frame 
# The beta column is essentially a measure of word importance within the
# topic---the higher the number, the more important the word is in the topic
topics <- tidy(lda, matrix = "beta")

The algorithm finds \(k\) clusters of words. Meaning is in the eye of the beholder. We can only look at the most common words in each topic (i.e. the estimated probability that a token draw from that topic is that word or, more simply, the probability of seeing that word given that topic).

# Here are the most important words in each of the clusters
top_terms <- topics %>%
  filter(!is.na(term)) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Make a comma-separated list of the top terms in each topic
top_terms %>% 
  group_by(topic) %>% 
  nest(term) %>% 
  mutate(words = data %>% map_chr(~ paste(.$term, collapse = ", "))) %>% 
  select(-data) 
## # A tibble: 3 x 2
##   topic words                                                         
##   <int> <chr>                                                         
## 1     1 for, to, the, application, grant, a, and, support, on, program
## 2     2 of, the, and, a, in, regarding, for, by, writes, b            
## 3     3 the, and, in, of, support, a, on, his, an, behalf
top_terms %>%
 mutate(term = reorder(term, beta)) %>%
 ggplot(aes(term, beta, fill = factor(topic))) +
 geom_col(show.legend = FALSE) +
 labs(x = NULL, y = "LDA beta (word prevelence in topic)") +
 facet_wrap(~ topic, scales = "free") +
 coord_flip()

We observe the total number of unique words (\(w_1,...,w_W\)) in the vocabulary of all documents and \(w_{d,n}\) is the word observed at the \(n\)th token in document \(d\). All texts are ``tokenized’’ by giving each word a unique index \(n\). If token \(n\) belongs to topic \(z\), then the probability that the token is word \(w\) is the topic-specific probability \(\pi_{\tau, w}\). At the document level, \(\pi_{z, d}\) is our beliefs about the proportion of topic \(z\) for document \(d\).

From “Measuring Change and Influence in Budget Texts” (APSA 2017)

Topic Models

Topic models generally require one to set the number of topics. There are a number of approaches to topic selection, including algorithms, but none are objective . @{Chuang2014} argue that topic modeling is done best with a “human-in-the-loop” approach where researchers review multiple candidate models. However, the \(k\) can be estimated using the methods described in @{Lee2014}, which selects the number of topics based on t-distributed stochastic neighbor embedding.

The two core types of topic models, single-member models and mixed-member models, emerge from distinct assumptions about the texts. Single-member models are a classical mixture model where each document is assumed to belong to a single topic. Single member models are best suited to classifying texts as belonging to one type or another. Mixed-member models assume that documents are generated from a latent distribution of topics. The most prominent approach to estimating topic distributions, the Latent Dirichlet Allocation (LDA) model .

Mechanics of LDA

In LDA, texts are assumed to be random mixtures of latent topics and topics are distributions of words . Both the distribution of topics over documents and distribution of words over topics are estimated. Each word in the vocabulary may be associated with multiple topics, but each token in a document is assigned to exactly one topic. Thus, the output includes both the estimated distributions of words in each topic (the content of that topic) and a representation of each document or collection of documents as a vector of topic proportions (the topics content of the documents), which is simply the fraction of the words in each document belong to each topic.

The main difference between LDA and a classic two-level model common in hierarchical Bayes is that LDA has a latent variable of topics (distributions of words) \(z\) between the beliefs about topic proportions \(\pi\) and observed words.

The document generating process is assumed to have two steps. First the topic proportions for each document are drawn from a Dirichlet distribution. Second, the topic for each word is drawn from from a multinomial distribution reflecting its document’s topic proportions. The specific word is drawn from the distribution of words in that topic:

  • Draw topic proportions \(\pi_d | \alpha \sim Dir(\alpha).\)
  • For each token \(w_{d,n}\):
    • Draw topic assignment \(z_{d, n} | \pi_d \sim Mult(\pi_d)\)
    • Draw word \(w_{d,n} | z_{d,n,\beta_{1:T}} \sim Mult(\beta_{z_{d,n}}).\)

We observe the total number of unique words (\(w_1,...,w_W\)) in the vocabulary of all documents and \(w_{d,n}\) is the word observed at the \(n\)th token in document \(d\). All texts are ``tokenized’’ by giving each word a unique index \(n\). If token \(n\) belongs to topic \(z\), then the probability that the token is word \(w\) is the topic-specific probability \(\pi_{\tau, w}\). At the document level, \(\pi_{z, d}\) is our beliefs about the proportion of topic \(z\) for document \(d\).

\(T\), \(\alpha\), and \(\beta\) are defined. \(T\) is the number of topics \((z_1,...z_T)\) where \(z_{n, d}\) is the topic assignment of the \(n\)th token in document \(d\). Each token comes from exactly one topic. \(\alpha\) is the parameter of the prior on the per-document topic distributions, and \(\beta\) is the parameter of the prior on the per-topic word distributions.

One major problem is that topic estimation in LDA and its extensions can be arbitrary and unstable . Computer scientists have extended the core LDA model, incorporating available information to improve the estimation of meaningful topics. This includes document attributes such as the author, year, or treatment condition , document network structure such as citations , word correlations , sentiment of words @{Lin2009JointAnalysis}, constraints based on intuitions (``lexical priors’’) on the distribution of words over topic , or combinations of these kinds of information .

In one of the most promising steps to help topic models measure influence, @{Chang2009} develop a version of LDA they call a Relational Topic Model. This model includes a binary random variable for each document pair that is conditioned by the latent space that also determines topic proportions. Thus, the document generating process has an extra step:

  • Draw topic proportions \(\pi_d | \alpha \sim Dir(\alpha).\)
  • For each word \(w_{d,n}\):
    • Draw assignment \(z_{d, n} | \pi_d \sim Mult(\pi_d)\)
    • Draw word \(w_{d,n} | z_{d,n,\beta_{1:K}} \sim Mult(\beta_{z_{d,n}}).\)
  • For each pair of documents \(d,d'\):\ \(y|z_d, z_{d'} \sim \psi (\cdot |x_d, z_{d'})\).

@{Chang2009} demonstrate this model with respect to academic documents and citations, but political influence may be modeled similarly. For example, documents may be linked by one citing the other or plagiarizing the other. While @{Bode2014CandidateMidterms} use multidimensional scaling to explore topics of tweets that use common hashtags, a Relational Topic Model may be an ideal way to model tweet content linked by common hashtags.

Another variation on LDA that may help measure influence is Structural Topic Modeling (STM). Instead of covariates only being used post-hoc (estimating effects after naively estimating topics), STM brings information contained in covariates into the topic model by (1) assigning unique priors by covariate value, (2) allowing topics to be correlated, (3) allowing word use within a topic to vary by covariate values. Instead of \(\pi \sim Dirichlet(\alpha)\), topic proportions can be influenced by covariates \(X\) through a regression model, \(\pi \sim LogisticNormal(X\gamma, \Sigma)\). This helps the model avoid having to develop a categorization scheme from scratch and improves the consistency of estimated covariate effects .

Whereas lexical priors include word-level information and STM includes document-level information, @{Kang2014} propose what they call a Hereto-Labeled LDA, which is able to incorporate both document and feature labels, even when only some document and feature types are known.

Finally, a variety of approaches have built of the Dynamic Topic Model proposed by @{Blei2006}. This version allows the topic content to change over time. @{Brookhart2015} use a version of this model in their analysis of the emergence of political issues. In their model, an autoregressive latent variable allows the words used to discuss the same issue to change over time.

Computation of LDA

The basic inferential task for LDA is computing the posterior distribution of latent variables for topics over documents, \(\pi\), and words over topic, \(z\).

Because the likelihood function contains latent variables, \(p(w|\alpha, \beta)\) cannot be estimated exactly for LDA (unlike many hierarchical models where there are direct links between estimated and observed parameters). A lower bound on the log-likelihood function can be estimated deterministically with variational inference , but many prefer to estimate LDA stochastically with Gibbs sampling and many extensions require Gibbs sampling. Additionally, Gibbs sampling allows the algorithm to jump out of local optima and does not require the assumption that all documents are independent as variational Bayes does (because the likelihood of seeing the corpus is the product of the likelihood for each document).

Applications of Topic Models in Political Science

Most studies in the social sciences and humanities using topic models, especially LDA, rely on observational data and focus on descriptive findings, often called text mining .

Single-member models are often used to classify texts in order to create a variable. In one of the most well-known examples,@{Grimmer2010} analyzed over 100,000 press releases to identify policy agendas, characterize representational style, and predict voting behavior. Similarly, @{Quinn2010} use a single-member model to identify the topics of floor speeches and @{Wilkerson2016} use a range of LDA models to identify meta topics in one-minute floor speeches, finding support for the theory of issue ownership. @{Boussalis2016} describe discourse around climate change denial, focusing on the relationship between topics that focus on politics and science and how discourse has changed over time. They find an increased proportion of text focusing on scientific integrity. While not explicitly an independent-dependent variable setup, they conclude that the observed discourse is often a reaction to scientific claims.

While Structural Topic Models (STMs) have the potential to estimate differences in policy content across types of texts, they have not been widely applied in this area. Notable recent contributions on this front include @{Bagozzi2016} who use an STM to examine attention to different issues vary over time in State Department Reports and @{Genovese2017SectorsFrom} who uses an STM to study country positions on international climate change negotiations.

The concept of influence implies a change from a baseline condition or counterfactual. In the case of topic models, this means a significantly different distribution of words than was observed previously or than would be expected given some external information. If the hypothesized cause of change is also observed as a text (e.g. a survey experiment vignette or an interest group letter), influence will often mean that the outcome text changed significantly toward using a distribution of words more similar to the original text.

Model setup and interpretation depend on whether the texts are theorized to each contain a single topic or a mixture of several topics. If texts represent actor positions, single-member models may be used to capture the coalition to which a text belongs and the distribution of topics from a mixed-member model may represent the distribution of priorities. The difference between influencing which cluster an actor is in and influencing the distribution of topic an actor emphasizes could be seen as a form of the distinction @{Carsey2006} make between changing sides and changing minds.

Text as the DV: Survey Experiments

A few studies using structural topic models have aimed at measuring influence by measuring treatment effects on survey experiment responses . The text of an open response question is the dependent variable. @{Mildenberger2015} find that framing climate change in different ways affects the topics discussed by survey respondents.

In an experimental context, the inference is based on finding a credible difference in word distributions between treatment conditions. A naive mixed-member model can estimate the content of a given number of topics and the differences between treatment conditions on each topic. Treatments are almost often text, but these texts are not generally used in modeling.

In survey experiments, treatment texts (e.g. the text of vignettes) could be used to improve topic estimation and inference in several ways. First, the topic model could include priors over the words in the treatments such that each word is expected to be assigned to the same topic as other words unique to the same condition and not the same topic as words from different conditions. This would improve model estimation and make it more likely that topics reflect treatments. This may be helpful, for instance, when studying framing effects and the same words used in the treatment appear in responses it may indicate successful manipulation or cueing. In other contexts, words repeated may represent uninformative parroting of the vignette.

Text as the DV: Observational Data

Recent scholarship has begun to use mixed-member models to describe political relationships using the relative emphasis of different topics in texts as the dependent variable. @{Genovese2017SectorsFrom} uses a Structural Topic Model to investigate the relationship between statements made by businesses and governments regarding climate change and sectoral levels of pollution and trade. She finds that high-polluting sectors with more exposure to trade are less likely to support international cooperation on climate change, and low-polluting sectors with high exposure to trade are more supportive. Importantly, government statements mirror domestic industry positions suggesting that industry preferences influence the government’s approach to climate change.

Instead of a topic model, @{Kluver2015} use a two-step process that first classifies interest groups with k-means clustering and then uses multidimensional scaling to identify the latent dimensions that best distinguish them, thus identifying multiple dimensions of disagreement. An alternative approach would be to use a single-member model to identify interest group coalitions and then use a version of LDA incorporating elements of the Structural Topic Model and Relational Topic Model to identify the dimensions of policy disagreement given this coalition structure. Though they do not use topic models, @{Kluver2015} make an important contribution by including the text of the outcome policy in the analysis. As discussed in the final section, placing policy outputs in the space of policy debate (whether measured through multidimensional scaling or latent topics) should be a priority for researchers aiming to estimate influence over policy.

Political scientists may especially benefit from computer science work developing models that use both the content of texts and networked structure between them. Much of this work thus far has been demonstrated with respect to webpage structure or academic citation networks. @{Neiswanger2014} model Wikipedia content in a latent variable setup (a topic model) with corrections, or weights, based on the content of linked articles to get a better picture of what each article is really about. As noted, @{Chang2009} accomplish a similar task, explicitly building academic citation information into an LDA model, allowing them to predict the distribution of words for a new article based on its citations or its citations based only on its content. This could be especially useful in modeling influence in political networks when we observe some causal processes linking some pairs of documents and aim to estimate other linkages based on observed content. For example, politicians may not always cite news sources by name when policy action is motivated by media attention, but the occasions when a particular source is cited can be seen as a partially observed network of influence. By modeling the content of news stories that are cited and corresponding policy statements, we may be able to infer additional media-politician relationships from the words they use.

Validation

Given the arbitrariness of the number of topics selected and the potential instability in estimation, one key validation step is to demonstrate that topics make sense . There is no guarantee that an unsupervised model will identify meaningful topics. As noted modeling strategies can help but are still no guarantee that meaningful latent topics will be recovered.

When topic models are used to generate independent variables, interpretation of topic content naturally receives attention. However, if scholarship moves more in the direction of hypothesis testing where theories focus on the similarity, difference, or change in topic distribution between documents, there is a risk that topic content could become hidden behind correlations of topic proportion means. To avoid this, two validations strategies have been suggested: (1) demonstrate that different algorithms produce similar topics and (2) establish that variations in topic emphasis across time or venues correlate with real-world events . @{Beauchamp2017PredictingData} validates his topic model of political opinions on Twitter against opinion polls. @{Quinn2010} validate their classification of Senate speeches against previous findings from hand-coding approaches. @{Wilkerson2016} call attention that, because topic selection is arbitrary, reporting and validating a single model is insufficient. They argue that topic selection and validation should explicitly address the robustness of topics across specifications. Beyond validation, they illustrate that focusing on topic robustness can help interpret results by grouping potential topics into robust meta topics. Thus, in addition to interpretability and face validity, reliability across model specifications is an important standard for unsupervised approaches.

Text Reuse

In contrast to classification methods like topic models, recent advances in text reuse and semantic analysis has received less attention in political science. For text reuse, the unit of analysis is a pair of texts. Text reuse methods produce at least one of two statistics, a global alignment score, and local alignment scores. Global alignment is a measure of how well documents align overall. Some methods are based on word frequencies and others are based on sequence. Local alignment methods identify and often score matching portions of two documents and are necessarily based on sequence. Global alignment is measured in many ways but generally aims to reflect the amount of content they share. Local alignment identifies specific matching sequences of words.

If affecting which words people use is a measure of influence, then affecting the distribution and sequence of words may be even stronger evidence. Thus I focus on local alignment and only briefly discuss global alignment and semantic methods . Additionally, for measuring influence, global alignment has a major disadvantage compared to both topic modeling and local alignment: results are much less informative about the content that is similar. Whereas topic models share the bag-of-words assumption with methods like cosine similarity, they can identify specific distributions of words on which documents are similar or dissimilar. Local alignment methods identify specific sequences of words on which documents are similar or dissimilar. Whereas @{Acree2016} suggest that if cosine similarity is to be used to estimate document similarity, it should be weighted with local alignment, incorporating local alignment into topic modeling strategies could accomplish the same aim with more interpretable results.

Text reuse and semantic attributes are usually discovered using deterministic matching. Text reuse methods identify matching strings and often score these matches based on defined criteria, for example, the relative priority a researcher places on the exactness and the length of the matching sequence. Semantic attributes, such as the tone or verb tense, are usually captured using dictionary-based methods. Words (or occasionally phrases) are compared against a dictionary of terms determined to have that attribute. One may use an existing dictionary or create a custom one for the research purpose, either by hand or with a learning algorithm. For example, @{Acree2014} hand-codes texts into ideological categories and then uses an algorithm to create a dictionary of phrases reflective of each group of texts, allowing him to measure the proportion different ideological phrases used by candidates.

Matching algorithms loop over strings of words to identify words or sequences of words (or characters) that match align with a reference pattern. For dictionary-based methods, this is a simple set operation; a word or phrase is in the dictionary or it is not. For text reuse methods this is often much more difficult. There are often multiple ways that two texts with shared sequences can be aligned. Algorithms generate alignment scores based on user-defined points and penalties for the length and exactness of a match.

Mechanics of Smith-Waterman Local Alignment Algorithm

Early work on alignment algorithms was advanced in the field of genetic sequencing. One common algorithm was developed by @{Smith1981} for matching gene sequence.

The Smith-Waterman (SW) algorithm goes up when the next character is a match and down when it is not. One must set the size of the increase when the next character in a string is a match, the decrease when it is not, the size of the penalty for inserting a gap, and a condition for when the algorithm will terminate. Penalties for mismatches or gaps allow one to be more or less tolerant of differences in local alignments. Finally, if the goal is to identify a subset of the text pairs that match, one must set a minimum threshold score for the inclusion of each local alignment in the output dataset.

Given these values, the SW algorithm identifies optimal local alignment by selecting the highest score from the various ``paths’’ through a matrix of possible word pairs in the two documents, generally maximizing the inclusion of matching tokens and minimizing the inclusion of mismatches. It does this in a three-step process. First, given two documents of length \(a\) and \(b\), a \(a+1\) by \(b+1\) matrix is initialized with its first row and column equal to 0. Second, the matrix is scored based on matches, nonmatches, and gaps. If matches and mismatches are assigned scores of 1 and -1, respectively, the substitution matrix can be described as

\(s(a_i, b_j) = \begin{Bmatrix} +1, a_i = b_j\\ -1, a_i \neq b_j \end{Bmatrix}\)

If the penalty for a gap is \(W\), the matrix \(H\) is filled in such that

\(H_{ij} = max \begin{Bmatrix}H_{i-1,j-1} + s(a_i, b_j),\\ max_{k\geq 1} \{H_{i-k,j} - W_k\},\\ max_{l\geq 1} \{H_{i-l,j} - W_l\},\\0 \end{Bmatrix}\).

Finally, a traceback algorithm starts at the highest value in the matrix and follows the path that maximizes the cumulative score until reaching the termination threshold, usually 0. This is illustrated in Figure 1 where orange arrows indicate the path taken by the traceback algorithm from each cell with a positive score to the next cell that maximizes the cumulative score. The first match has a cumulative score of 4 and the second match has a cumulative score of 6.

Computation

Compared to topic modeling approaches, text reuse algorithms tend to involve more straightforward computations but in greater numbers. Because text alignment algorithms involve multiple operations on all possible word pairs in all possible document pairs these methods can quickly become computationally expensive larger numbers of longer documents. Thus, one of the key ways to reduce computational cost is to cut down the number of document pairs considered.

If many documents in a corpus do not share any content, these pairs can be excluded. One way to do this is to filter out pairs that do not meet matching criteria that is easier to compute. For example, @{Wilkerson2015} exclude pairs that do not share at least five 10-grams, greatly reducing the number of pairs considered. As noted, comparing n-grams is a simple set operation and is cheap to compute. Similarly, pairs that do not share a baseline level of similarity can be eliminated by converting text strings to hash codes, whereby duplicate strings are given a common index and are thus easy to detect . @{Collins2015} and @{Eshbaugh-Soha2013} use popular plagiarism detection program WCopyfind, which uses hashing to detect exact matches between texts, but, unlike Smith-Waterman, the algorithm it uses is only able to skip over non-matching segments of near-identical length and does not attempt to optimize alignment .

Local alignment calculations can also be sped up by requiring that alignments include certain exact n-gram matches. If these anchoring n-gram matches are long enough, it is unlikely to lead to suboptimal matches. @{Wilkerson2015} find that 10-word matches worked well for identifying matching strings in sections of bills

Applications of Text Reuse in Political Science

Few studies focus directly on the concept of influence. @{Collins2015} uses WCopyfind to assess the influence of amicus curiae briefs on Supreme Court opinions and @{Eshbaugh-Soha2013} uses it to measure the influence of White House press conferences on news coverage. These inferences depend on the assumption that text matching that in amicus briefs or press secretaries means it originated there, which requires information about how those documents are created.

@{Hertel-Fernandez} use text reuse to detect the proposal and adoption of legislative language proposed by the American Legislative Exchange Council (ALEC), a group that drafts model state legislation and advocates for it. They find that states with less legislative professionalism are more likely to introduce and pass bills that contain ALEC language. Because they have external information about the document generating process, they know that ALEC plays a major role in writing and disseminating model legislation to state legislators, @{Hertel-Fernandez} can infer that the similarities they find are evidence of ALEC influence. Even if some state legislators are copying other states, ALEC is influential if it is the originator or dissemination of the policy text, a plausible assumption for many of these texts.

Recent research has begun to incorporate information about the network of relationships among actors, greatly improving the plausibility that text similarity reflects influence. @{Linder2017} find that state legislators with similar voting behavior introduce bills with matching text and that text reuse reflects policy diffusion networks found by other scholars. @{Garrett2015} assess the relative influence of interest groups and early adopting states on policy diffusion using text similarity as an attribute of a network model. They note that ``Knowing which states adopted similar policies does not inform our understanding of interest groups’ impact on policy adoption and emulation…studies cannot distinguish the influence of interest group model legislation from the impact of other state and national actors’’ (pg. 7). Using a network model allows them to measure the centrality of different texts. However, because they use global alignment scores rather than local alignment or topics, their model cannot describe the nature of copied content.