This document covers how to collect, store, and summarize word-level (token-level) information about the relationship among texts. The general approach applies to any word or phrase-level (token-level) attribute (topic, relative frequency, citations, plagiarism), but here, I focus on a simple form of plagiarism detection using matching 10-word phrases.
Just as we can use repeated 10-word phrases to identify change between draft and final rules (see the Volker Rule example), we can use a 10-word (10-gram) moving window to identify which words in a comment is part of a 10-word phrase that also appears in the other comments or the proposed rule.
Building on Casas et al. 2019, I first used this method to detect similarity and change in agency budget justifications and congressional appropriations texts (see my 2017 polmeth poster and APSA paper). I then adapted it to identify coalitions and form letters with repeated text in public comments (SPSA 2019 paper).
Below, I walk through R functions to parse and summarize repeated text, applying them to CFPB’s Payday Loan rule.
The result is information about each word in each comment. Was this word part of a 10-word phrase that also appeared in the NPRM? Was this word part of a 10-word phrase that also appeared in other comments? If so, which ones? What percent of the text of each comment matches each other comment? Computation and data storage are trivial for a few comments but expands exponentially, approximately the square of the number of words in all comments. I use CFPB’s Payday Loan Rule to illustrate how these methods help us examine a rule with a great deal of comments.
I filter out most mass comments (continually improving my methods for detecting mass comments) and focus on text from attachments submitted by identified organizations. I extract the text from these attachments and eliminate files less than 10 KB (the size of a short paragraph, often a failed OCR). Raw text files for CFPB’s Payday Loan Rule are available here.
The comment_tengrams
function that I define below requires two inputs:
This function also relies on a few custom helper functions to parse rule text, clean text, and parse and match ngrams.
# load required functions from https://github.com/judgelord/rulemaking/blob/master/functions
source(here::here("functions", "xml_rule_text.R"))
source(here::here("functions", "clean_string.R"))
source(here::here("functions", "tengram.R"))
source(here::here("functions", "read_comment.R"))
# a function to parse comments into 10-word phrases and identify matching phrases in other comments or the NPRM
function(nprm, comments){
comment_tengrams <-
# read in rule text from federal register
xml_rule_text(nprm) %>%
pr_text <- summarise(text = text %>% clean_string() ) %>%
unnest_tokens(tengram, text, token = "ngrams", n = 10) %>%
filter(!is.na(tengram))
# filter to file paths ending in txt
comments %>%
d <- filter( str_detect(path, "txt")) %>%
# in SQL, CFPB file names are regs_dot_gov_document_id, shortened to document_id for now
mutate( document_id = path %>%
str_remove(".*/") %>%
str_remove("\\..*")
)
# parse each document with the read_grams function
%<>%
d mutate(tengrams = path %>% map(possibly(read_grams,
otherwise = list(tengram = "404error")
)
)
)
# map each document to all others, including the NPRM
%<>%
d mutate(
text = tengrams %>%
# diff with the NPRM
map2( list(pr_text$tengram), match_tibble) %>%
map(tengram_match_dfr) %>%
map(~rename(., nprm_match = match) ) %>%
# reassemble text from the first word of each ngram
map(~mutate(., word = str_extract(ngram, "\\w+") ) ) %>%
# drop ngrams to save space
map(~select(., -ngram) ),
# diff with all other comments
other_docs = tibble(document_id2 = list(document_id),
match = tengrams %>% map(~map2(., d$tengrams, match) %>% map(tengram_match))
)%>%
) # turn the tibble of lists into a list of tibbles
group_by(document_id) %>%
mutate(other_docs = other_docs %>% purrr::flatten() %>% as_tibble() %>% list() ) %>%
ungroup()
# drop variables that we no longer need
%<>% select(-path, -tengrams)
d
return(d)
# end function }
"CFPB"
agency <- "CFPB-2016-0025"
docket <- "https://www.federalregister.gov/documents/full_text/xml/2016/07/22/2016-13490.xml"
nprm <-
# get txt file names from a directory, here called "comment_text"
tibble( path = list.files( here::here("comment_text", agency, docket), full.names = T) )
comments <-
# apply comment_tengrams function
comment_tengrams(nprm,
d <-%>% top_n(50) # just a few comments for now
comments )
d
## # A tibble: 50 x 3
## document_id text reuse
## <chr> <list> <list>
## 1 CFPB-2016-0025-90947-1 <tibble [3,172 × 2]> <tibble [50 × 2]>
## 2 CFPB-2016-0025-90967-1 <tibble [2,967 × 2]> <tibble [50 × 2]>
## 3 CFPB-2016-0025-90980-1 <tibble [2,680 × 2]> <tibble [50 × 2]>
## 4 CFPB-2016-0025-90995-1 <tibble [974 × 2]> <tibble [50 × 2]>
## 5 CFPB-2016-0025-90999-1 <tibble [1,295 × 2]> <tibble [50 × 2]>
## 6 CFPB-2016-0025-91049-1 <tibble [4,418 × 2]> <tibble [50 × 2]>
## 7 CFPB-2016-0025-91053-1 <tibble [2,678 × 2]> <tibble [50 × 2]>
## 8 CFPB-2016-0025-91058-1 <tibble [4,245 × 2]> <tibble [50 × 2]>
## 9 CFPB-2016-0025-91130-1 <tibble [18,922 × 2]> <tibble [50 × 2]>
## 10 CFPB-2016-0025-91131-1 <tibble [16,642 × 2]> <tibble [50 × 2]>
## # … with 40 more rows
Each document_id
now has two associated data frames:
text
contains two columns:
word
is the document’s first 10,000 wordsnprm_match
is whether the that word is part of a ten-word phrase that is in the NPRMother_docs
contains a data frame for each other comment:
document_id2
match
indicates whether each word in the document,document_id
, is part of a ten-word phrase that appears in the second document, document_id2
.$text[1:2] d
## [[1]]
## # A tibble: 3,172 x 2
## nprm_phrase word
## <lgl> <chr>
## 1 FALSE short
## 2 FALSE h
## 3 FALSE ome
## 4 FALSE mortgage
## 5 FALSE lenders
## 6 FALSE have
## 7 FALSE long
## 8 FALSE enjoyed
## 9 FALSE favorable
## 10 FALSE treat
## # … with 3,162 more rows
##
## [[2]]
## # A tibble: 2,967 x 2
## nprm_phrase word
## <lgl> <chr>
## 1 FALSE an
## 2 FALSE analysis
## 3 FALSE of
## 4 FALSE pawn
## 5 FALSE shops
## 6 FALSE legislation
## 7 FALSE series
## 8 FALSE overview
## 9 FALSE of
## 10 FALSE article
## # … with 2,957 more rows
d %>%
nprm_percent_match <- select(-other_docs) %>% # select just the doc name and text table (otherwise unnest duplicates the other_docs table for every observation)
unnest(text) %>%
group_by(document_id) %>%
summarise(percent_match_with_NPRM = sum(nprm_match)/n() )
%>%
nprm_percent_match mutate(document_id = document_id %>% str_remove(".*0025-") ) %>%
ggplot() +
aes(x = document_id, y = percent_match_with_NPRM) +
geom_col() +
theme(panel.border = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 1))
The attachment 9521-1 is several pages of the NPRM. Because it was read in from the PDF attachment, only 80% of these tengrams appeared in the text of the NPRM read directly from the federal register’s XML.
Comments 95976 and 95977 were duplicate uploads of an email containing the text of a different NPRM, which had ~2% overlap with this NPRM.
Comment 90980 was a policy brief from the LBJ School of Public Affairs, which contained text that also appeared in the NPRM:
%>%
d filter(document_id == "CFPB-2016-0025-90980-1") %>%
select(text) %>%
unnest() %>%
mutate(word = ifelse(nprm_match == T & lead(nprm_match) == F, str_c(word, "..."), word) ) %>%
filter(nprm_match) %>%
summarise(matching_text = word %>% paste(collapse = " ") %>% {str_c("\"...", ., "\"")} ) %>%
kable()
matching_text |
---|
“…pew charitable trusts payday lending in america who borrows where they borrow and why… a complex portrait an examination of small dollar credit consumers…” |
These 24 words come from two citations, each at least 10-words long, that were also cited in the NPRM:
The ability to quickly recall shared text is a key advantage of storing the output in this format.
Just as nprm_match
indicates whether each word in a comment was part of a tengram that also appeared in the NPRM, the column match
indicates whether each word was part of a tengram that also appeared in each other comment in the data.
$other_docs[1:2] d
## [[1]]
## # A tibble: 50 x 2
## document_id2 match
## <chr> <list>
## 1 CFPB-2016-0025-90947-1 <lgl [3,171]>
## 2 CFPB-2016-0025-90967-1 <lgl [3,171]>
## 3 CFPB-2016-0025-90980-1 <lgl [3,171]>
## 4 CFPB-2016-0025-90995-1 <lgl [3,171]>
## 5 CFPB-2016-0025-90999-1 <lgl [3,171]>
## 6 CFPB-2016-0025-91049-1 <lgl [3,171]>
## 7 CFPB-2016-0025-91053-1 <lgl [3,171]>
## 8 CFPB-2016-0025-91058-1 <lgl [3,171]>
## 9 CFPB-2016-0025-91130-1 <lgl [3,171]>
## 10 CFPB-2016-0025-91131-1 <lgl [3,171]>
## # … with 40 more rows
##
## [[2]]
## # A tibble: 50 x 2
## document_id2 match
## <chr> <list>
## 1 CFPB-2016-0025-90947-1 <lgl [2,967]>
## 2 CFPB-2016-0025-90967-1 <lgl [2,967]>
## 3 CFPB-2016-0025-90980-1 <lgl [2,967]>
## 4 CFPB-2016-0025-90995-1 <lgl [2,967]>
## 5 CFPB-2016-0025-90999-1 <lgl [2,967]>
## 6 CFPB-2016-0025-91049-1 <lgl [2,967]>
## 7 CFPB-2016-0025-91053-1 <lgl [2,967]>
## 8 CFPB-2016-0025-91058-1 <lgl [2,967]>
## 9 CFPB-2016-0025-91130-1 <lgl [2,967]>
## 10 CFPB-2016-0025-91131-1 <lgl [2,967]>
## # … with 40 more rows
d %>%
comment_percent_match <- top_n(50, document_id) %>%
select(-text) %>% # select just the doc name and other_docs table (otherwise unnest duplicates the text table for every observation)
unnest(other_docs) %>% # unnest other_docs tibble
unnest(document_id2, match) %>% # unnest document_id and match lists
group_by(document_id, document_id2) %>%
summarise(percent_match = sum(match)/n() )
%>%
comment_percent_match mutate(document_id = document_id %>% str_remove(".*0025-"),
document_id2 = document_id2 %>% str_remove(".*0025-") ) %>%
ggplot() +
aes(x = document_id,
y = document_id2,
fill = percent_match) +
geom_tile(color = "grey") +
scale_fill_gradient(low = "white", high = "black") +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 1))
Notice that 100% of the words from CFPB-2016-0025-95976 (pdf, txt ) are part of some tengram that also appears in CFPB-2016-0025-95977 (pdf, txt) because the exact same comment was uploaded twice.
Comments 91130 through 91156 are all partial or exact matches. Most are part of a mass comment campaign by Access Financial. The percent of the identical text is lower than many mass-comment campaigns because these are hand-written comments, but the n-gram method still picks up overlap in the OCRed text in the header and footer.
Partial matches are not always exactly symmetrical; the percent of comment x in comment y may not be the same as the percent of y in x. For example, 100% of A is in B, but only 50% of B is in A.
A. “While I commend the Bureau’s efforts to protect consumers, the proposed protections could be stronger.”
B. “While I commend the Bureau’s efforts to protect consumers, the proposed protections could be stronger. As part of its mandate, the Bureau has authority to prevent unfair, deceptive, and abusive acts or practices.”
The aim here is to identify standard data formats to allow for plug-and-play analysis.
Token matches stored as Run Length Encoding take up about half the space as a logical vector of token matches and can easily be converted back. Download the RLE version of these data here.
Here is an example of what matches look like when stored as RLE, how to convert them back to a logical vector, and ultimately, the matching text:
# functions to convert between logical and run-length encoding
. %>% mutate(match = map(match, rle))
lgl_to_rle <-
. %>% mutate(match = map(match, inverse.rle))
rle_to_lgl <-
# put the data above into RLE
d %>%
d_rle <- mutate(other_docs = map(other_docs, lgl_to_rle))
# back out logical format
d_rle %>%
d_lgl <- mutate(other_docs = map(other_docs, rle_to_lgl))
# compare object_size()
object_size(d_rle)
## 6.49 MB
object_size(d_lgl)
## 65.4 MB
# compare file size on disk
save(d_rle, file = here::here("data", "payday_comments_rle.Rdata"))
save(d_lgl, file = here::here("data", "payday_comments_lgl.Rdata"))
# inspect comment 91153 matched with 91154
d_rle %>%
example_rle <- filter(document_id == "CFPB-2016-0025-91153-1") %>%
unnest(other_docs) %>%
filter(document_id2 == "CFPB-2016-0025-91154-1") %>%
flatten()
# matching sequences as rle
$match example_rle
## Run Length Encoding
## lengths: int [1:259] 2 13 43 19 2 17 7 19 167 19 ...
## values : logi [1:259] FALSE TRUE FALSE TRUE FALSE TRUE ...
# matching sequences as logical
$match %>% inverse.rle() %>% head() example_rle
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
# words in CFPB-2016-0025-91153 where match is TRUE
$text$word[example_rle$match %>% inverse.rle()] %>% head(10) example_rle
## [1] "i" "docket" "no" "cfp" "consumer"
## [6] "financial" "protection" "bureau" "g" "street"