This document covers how to collect, store, and summarize word-level (token-level) information about the relationship among texts. The general approach applies to any word or phrase-level (token-level) attribute (topic, relative frequency, citations, plagiarism), but here, I focus on a simple form of plagiarism detection using matching 10-word phrases.

Just as we can use repeated 10-word phrases to identify change between draft and final rules (see the Volker Rule example), we can use a 10-word (10-gram) moving window to identify which words in a comment is part of a 10-word phrase that also appears in the other comments or the proposed rule.

Building on Casas et al. 2019, I first used this method to detect similarity and change in agency budget justifications and congressional appropriations texts (see my 2017 polmeth poster and APSA paper). I then adapted it to identify coalitions and form letters with repeated text in public comments (SPSA 2019 paper).

Below, I walk through R functions to parse and summarize repeated text, applying them to CFPB’s Payday Loan rule.

The result is information about each word in each comment. Was this word part of a 10-word phrase that also appeared in the NPRM? Was this word part of a 10-word phrase that also appeared in other comments? If so, which ones? What percent of the text of each comment matches each other comment? Computation and data storage are trivial for a few comments but expands exponentially, approximately the square of the number of words in all comments. I use CFPB’s Payday Loan Rule to illustrate how these methods help us examine a rule with a great deal of comments.

Method

I filter out most mass comments (continually improving my methods for detecting mass comments) and focus on text from attachments submitted by identified organizations. I extract the text from these attachments and eliminate files less than 10 KB (the size of a short paragraph, often a failed OCR). Raw text files for CFPB’s Payday Loan Rule are available here.

The comment_tengrams function that I define below requires two inputs:

  1. file paths comment txt files
  2. a link to the NPRM text (to identify NPRM text repeated in comments)

This function also relies on a few custom helper functions to parse rule text, clean text, and parse and match ngrams.

# load required functions from https://github.com/judgelord/rulemaking/blob/master/functions
source(here::here("functions", "xml_rule_text.R"))
source(here::here("functions", "clean_string.R"))
source(here::here("functions", "tengram.R"))
source(here::here("functions", "read_comment.R"))

# a function to parse comments into 10-word phrases and identify matching phrases in other comments or the NPRM
comment_tengrams <- function(nprm, comments){
  
  # read in rule text from federal register
  pr_text <- xml_rule_text(nprm) %>% 
    summarise(text = text %>% clean_string() ) %>%
    unnest_tokens(tengram, text, token = "ngrams", n = 10)  %>% 
    filter(!is.na(tengram))
  
  # filter to file paths ending in txt
  d <- comments %>% 
    filter( str_detect(path, "txt")) %>% 
    # in SQL, CFPB file names are regs_dot_gov_document_id, shortened to document_id for now
    mutate( document_id = path %>% 
              str_remove(".*/")  %>% 
              str_remove("\\..*") 
            )
  
  # parse each document with the read_grams function 
  d %<>% 
    mutate(tengrams = path %>% map(possibly(read_grams, 
                                            otherwise = list(tengram = "404error")
                                            )
                                   ) 
           )
  
  # map each document to all others, including the NPRM
  d %<>% 
    mutate(
      text = tengrams %>% 
        # diff with the NPRM
        map2( list(pr_text$tengram), match_tibble) %>% 
        map(tengram_match_dfr) %>% 
        map(~rename(., nprm_match = match) ) %>% 
        # reassemble text from the first word of each ngram
        map(~mutate(., word = str_extract(ngram, "\\w+") ) ) %>% 
        # drop ngrams to save space 
        map(~select(., -ngram) ),
      # diff with all other comments 
      other_docs = tibble(document_id2 = list(document_id),
                          match = tengrams %>% map(~map2(., d$tengrams, match) %>% map(tengram_match))
                          )
      ) %>% 
    # turn the tibble of lists into a list of tibbles
    group_by(document_id) %>% 
    mutate(other_docs = other_docs %>% purrr::flatten() %>%  as_tibble() %>% list() ) %>% 
    ungroup()
  
  # drop variables that we no longer need
  d %<>%  select(-path, -tengrams)
  
  return(d)
} # end function

Data

agency <- "CFPB"
docket <- "CFPB-2016-0025"
nprm <- "https://www.federalregister.gov/documents/full_text/xml/2016/07/22/2016-13490.xml"

# get txt file names from a directory, here called "comment_text"
comments <- tibble( path = list.files( here::here("comment_text", agency, docket), full.names = T) ) 

# apply comment_tengrams function
d <- comment_tengrams(nprm, 
                      comments %>% top_n(50) # just a few comments for now 
                      )

Output

d
## # A tibble: 50 x 3
##    document_id            text                  reuse            
##    <chr>                  <list>                <list>           
##  1 CFPB-2016-0025-90947-1 <tibble [3,172 × 2]>  <tibble [50 × 2]>
##  2 CFPB-2016-0025-90967-1 <tibble [2,967 × 2]>  <tibble [50 × 2]>
##  3 CFPB-2016-0025-90980-1 <tibble [2,680 × 2]>  <tibble [50 × 2]>
##  4 CFPB-2016-0025-90995-1 <tibble [974 × 2]>    <tibble [50 × 2]>
##  5 CFPB-2016-0025-90999-1 <tibble [1,295 × 2]>  <tibble [50 × 2]>
##  6 CFPB-2016-0025-91049-1 <tibble [4,418 × 2]>  <tibble [50 × 2]>
##  7 CFPB-2016-0025-91053-1 <tibble [2,678 × 2]>  <tibble [50 × 2]>
##  8 CFPB-2016-0025-91058-1 <tibble [4,245 × 2]>  <tibble [50 × 2]>
##  9 CFPB-2016-0025-91130-1 <tibble [18,922 × 2]> <tibble [50 × 2]>
## 10 CFPB-2016-0025-91131-1 <tibble [16,642 × 2]> <tibble [50 × 2]>
## # … with 40 more rows

Each document_id now has two associated data frames:

  1. text contains two columns:
    • word is the document’s first 10,000 words
    • nprm_match is whether the that word is part of a ten-word phrase that is in the NPRM
  2. other_docs contains a data frame for each other comment:
    • document_id2
    • match indicates whether each word in the document,document_id, is part of a ten-word phrase that appears in the second document, document_id2.

1. Comparing comment text to the NPRM

d$text[1:2]
## [[1]]
## # A tibble: 3,172 x 2
##    nprm_phrase word     
##    <lgl>       <chr>    
##  1 FALSE       short    
##  2 FALSE       h        
##  3 FALSE       ome      
##  4 FALSE       mortgage 
##  5 FALSE       lenders  
##  6 FALSE       have     
##  7 FALSE       long     
##  8 FALSE       enjoyed  
##  9 FALSE       favorable
## 10 FALSE       treat    
## # … with 3,162 more rows
## 
## [[2]]
## # A tibble: 2,967 x 2
##    nprm_phrase word       
##    <lgl>       <chr>      
##  1 FALSE       an         
##  2 FALSE       analysis   
##  3 FALSE       of         
##  4 FALSE       pawn       
##  5 FALSE       shops      
##  6 FALSE       legislation
##  7 FALSE       series     
##  8 FALSE       overview   
##  9 FALSE       of         
## 10 FALSE       article    
## # … with 2,957 more rows
nprm_percent_match <- d %>% 
  select(-other_docs) %>% # select just the doc name and text table (otherwise unnest duplicates the other_docs table for every observation)
  unnest(text) %>%
  group_by(document_id) %>% 
  summarise(percent_match_with_NPRM = sum(nprm_match)/n() ) 
nprm_percent_match %>% 
  mutate(document_id = document_id %>% str_remove(".*0025-") ) %>% 
  ggplot() + 
  aes(x = document_id, y = percent_match_with_NPRM) + 
  geom_col() + 
  theme(panel.border = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 1))

The attachment 9521-1 is several pages of the NPRM. Because it was read in from the PDF attachment, only 80% of these tengrams appeared in the text of the NPRM read directly from the federal register’s XML.

Comments 95976 and 95977 were duplicate uploads of an email containing the text of a different NPRM, which had ~2% overlap with this NPRM.

Comment 90980 was a policy brief from the LBJ School of Public Affairs, which contained text that also appeared in the NPRM:

d %>% 
  filter(document_id == "CFPB-2016-0025-90980-1") %>% 
  select(text) %>% 
  unnest() %>% 
  mutate(word = ifelse(nprm_match == T & lead(nprm_match) == F, str_c(word, "..."), word) ) %>%
  filter(nprm_match) %>%
  summarise(matching_text = word %>% paste(collapse = " ") %>% {str_c("\"...", ., "\"")} ) %>% 
  kable()
matching_text
“…pew charitable trusts payday lending in america who borrows where they borrow and why… a complex portrait an examination of small dollar credit consumers…”

These 24 words come from two citations, each at least 10-words long, that were also cited in the NPRM:

  1. Pew Charitable Trusts. Payday Lending in America: Who Borrows, Where They Borrow, and Why. Washington, DC, 2012. 6.
  2. Levy, Rob, and Joshua Sledge. A Complex Portrait: An Examination of Small-Dollar Credit Consumers. Chicago: Center for Financial Services Innovation, 2012. 11.

The ability to quickly recall shared text is a key advantage of storing the output in this format.

2. Text reuse across comments

Just as nprm_match indicates whether each word in a comment was part of a tengram that also appeared in the NPRM, the column match indicates whether each word was part of a tengram that also appeared in each other comment in the data.

d$other_docs[1:2]
## [[1]]
## # A tibble: 50 x 2
##    document_id2           match        
##    <chr>                  <list>       
##  1 CFPB-2016-0025-90947-1 <lgl [3,171]>
##  2 CFPB-2016-0025-90967-1 <lgl [3,171]>
##  3 CFPB-2016-0025-90980-1 <lgl [3,171]>
##  4 CFPB-2016-0025-90995-1 <lgl [3,171]>
##  5 CFPB-2016-0025-90999-1 <lgl [3,171]>
##  6 CFPB-2016-0025-91049-1 <lgl [3,171]>
##  7 CFPB-2016-0025-91053-1 <lgl [3,171]>
##  8 CFPB-2016-0025-91058-1 <lgl [3,171]>
##  9 CFPB-2016-0025-91130-1 <lgl [3,171]>
## 10 CFPB-2016-0025-91131-1 <lgl [3,171]>
## # … with 40 more rows
## 
## [[2]]
## # A tibble: 50 x 2
##    document_id2           match        
##    <chr>                  <list>       
##  1 CFPB-2016-0025-90947-1 <lgl [2,967]>
##  2 CFPB-2016-0025-90967-1 <lgl [2,967]>
##  3 CFPB-2016-0025-90980-1 <lgl [2,967]>
##  4 CFPB-2016-0025-90995-1 <lgl [2,967]>
##  5 CFPB-2016-0025-90999-1 <lgl [2,967]>
##  6 CFPB-2016-0025-91049-1 <lgl [2,967]>
##  7 CFPB-2016-0025-91053-1 <lgl [2,967]>
##  8 CFPB-2016-0025-91058-1 <lgl [2,967]>
##  9 CFPB-2016-0025-91130-1 <lgl [2,967]>
## 10 CFPB-2016-0025-91131-1 <lgl [2,967]>
## # … with 40 more rows
comment_percent_match <- d %>% 
  top_n(50, document_id) %>% 
  select(-text) %>% # select just the doc name and other_docs table (otherwise unnest duplicates the text table for every observation)
  unnest(other_docs) %>%             # unnest other_docs tibble
  unnest(document_id2, match) %>% # unnest document_id and match lists 
  group_by(document_id, document_id2) %>% 
  summarise(percent_match = sum(match)/n() )
comment_percent_match %>% 
  mutate(document_id = document_id %>% str_remove(".*0025-"), 
         document_id2 = document_id2 %>% str_remove(".*0025-") ) %>% 
  ggplot() +
  aes(x = document_id, 
      y = document_id2, 
      fill = percent_match) + 
  geom_tile(color = "grey") + 
  scale_fill_gradient(low = "white", high = "black") + 
  theme(panel.grid = element_blank(),
        panel.border = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 1))

Notice that 100% of the words from CFPB-2016-0025-95976 (pdf, txt ) are part of some tengram that also appears in CFPB-2016-0025-95977 (pdf, txt) because the exact same comment was uploaded twice.

Comments 91130 through 91156 are all partial or exact matches. Most are part of a mass comment campaign by Access Financial. The percent of the identical text is lower than many mass-comment campaigns because these are hand-written comments, but the n-gram method still picks up overlap in the OCRed text in the header and footer.

Partial matches are not always exactly symmetrical; the percent of comment x in comment y may not be the same as the percent of y in x. For example, 100% of A is in B, but only 50% of B is in A.

A. “While I commend the Bureau’s efforts to protect consumers, the proposed protections could be stronger.”

B. “While I commend the Bureau’s efforts to protect consumers, the proposed protections could be stronger. As part of its mandate, the Bureau has authority to prevent unfair, deceptive, and abusive acts or practices.”

Minimize the size of stored output

The aim here is to identify standard data formats to allow for plug-and-play analysis.

Token matches stored as Run Length Encoding take up about half the space as a logical vector of token matches and can easily be converted back. Download the RLE version of these data here.

Here is an example of what matches look like when stored as RLE, how to convert them back to a logical vector, and ultimately, the matching text:

# functions to convert between logical and run-length encoding
lgl_to_rle <- . %>% mutate(match = map(match, rle)) 

rle_to_lgl <- . %>% mutate(match = map(match, inverse.rle)) 

# put the data above into RLE 
d_rle <- d %>% 
  mutate(other_docs = map(other_docs, lgl_to_rle)) 

# back out logical format
d_lgl <- d_rle %>% 
  mutate(other_docs = map(other_docs, rle_to_lgl)) 

# compare object_size()
object_size(d_rle)
## 6.49 MB
object_size(d_lgl)
## 65.4 MB
# compare file size on disk 
save(d_rle, file =  here::here("data", "payday_comments_rle.Rdata"))
save(d_lgl, file = here::here("data", "payday_comments_lgl.Rdata"))

# inspect comment 91153 matched with 91154
example_rle <- d_rle %>% 
  filter(document_id == "CFPB-2016-0025-91153-1") %>% 
  unnest(other_docs) %>% 
  filter(document_id2 == "CFPB-2016-0025-91154-1") %>% 
  flatten()

# matching sequences as rle
example_rle$match
## Run Length Encoding
##   lengths: int [1:259] 2 13 43 19 2 17 7 19 167 19 ...
##   values : logi [1:259] FALSE TRUE FALSE TRUE FALSE TRUE ...
# matching sequences as logical 
example_rle$match %>% inverse.rle() %>% head()
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# words in CFPB-2016-0025-91153 where match is TRUE 
example_rle$text$word[example_rle$match %>% inverse.rle()] %>% head(10) 
##  [1] "i"          "docket"     "no"         "cfp"        "consumer"  
##  [6] "financial"  "protection" "bureau"     "g"          "street"