get_cr_df.Rd
Scrape URLs for each subsection of the Congressional Record As we do so, we retain some helpful metadata
This function puts linked text (`html_nodes("a")`), URLs (`html_attr("href")`) and their text (`html_text()`) for each date and each section of the record into a data frame. With `map_dfr` from the `purrr` package, it does this to a range of dates.
get_cr_df(date, section)
A vector of dates to scrape in YYYY-MM-DD format
"senate-section", "house-section", "all"
A data frame of URLs and metadata for the Congressional Record for specified date(s) and section(s)
https://judgelord.github.io/congressionalrecord/
- The record is divided by date - The record is divided into three sections: Senate, House, Extensions of Remarks (text submitted to the record later) - The page number ("S3253" is the 3,253rd page of the record, featuring remarks from the Senate)
The Congressional Record has a page for each day: https://www.congress.gov/congressional-record/2017/6/6/senate-section
On this page are URLs for each subsection. These URLs look like this: https://www.congress.gov/congressional-record/2017/6/6/senate-section/article/S3253-6
# example date and section
date <- "2007/03/01"
section <- "senate-section"
## The function is currently defined as
function (date, section)
{
message(date)
url <- str_c("https://www.congress.gov/congressional-record",
date %>% str_replace_all("-", "/"), section, sep = "/")
pages <- read_html(url) %>% html_nodes("a")
d <- tibble(header = html_text(pages), date = date, section = section,
url = str_c("https://www.congress.gov", html_attr(pages,
"href"))) %>% filter(url %>% str_detect("article"))
return(d)
}
#> function (date, section)
#> {
#> message(date)
#> url <- str_c("https://www.congress.gov/congressional-record",
#> date %>% str_replace_all("-", "/"), section, sep = "/")
#> pages <- read_html(url) %>% html_nodes("a")
#> d <- tibble(header = html_text(pages), date = date, section = section,
#> url = str_c("https://www.congress.gov", html_attr(pages,
#> "href"))) %>% filter(url %>% str_detect("article"))
#> return(d)
#> }
#> <environment: 0x110401620>