Scrape URLs for each subsection of the Congressional Record As we do so, we retain some helpful metadata

This function puts linked text (`html_nodes("a")`), URLs (`html_attr("href")`) and their text (`html_text()`) for each date and each section of the record into a data frame. With `map_dfr` from the `purrr` package, it does this to a range of dates.

get_cr_df(date, section)

Arguments

date

A vector of dates to scrape in YYYY-MM-DD format

section

"senate-section", "house-section", "all"

Details

Value

A data frame of URLs and metadata for the Congressional Record for specified date(s) and section(s)

References

https://judgelord.github.io/congressionalrecord/

Author

Devin Judge-Lord

Note

- The record is divided by date - The record is divided into three sections: Senate, House, Extensions of Remarks (text submitted to the record later) - The page number ("S3253" is the 3,253rd page of the record, featuring remarks from the Senate)

The Congressional Record has a page for each day: https://www.congress.gov/congressional-record/2017/6/6/senate-section

On this page are URLs for each subsection. These URLs look like this: https://www.congress.gov/congressional-record/2017/6/6/senate-section/article/S3253-6

See also

Examples

# example date and section
date <- "2007/03/01"
section <- "senate-section"

## The function is currently defined as
function (date, section)
{
    message(date)
    url <- str_c("https://www.congress.gov/congressional-record",
        date %>% str_replace_all("-", "/"), section, sep = "/")
    pages <- read_html(url) %>% html_nodes("a")
    d <- tibble(header = html_text(pages), date = date, section = section,
        url = str_c("https://www.congress.gov", html_attr(pages,
            "href"))) %>% filter(url %>% str_detect("article"))
    return(d)
  }
#> function (date, section)
#> {
#>     message(date)
#>     url <- str_c("https://www.congress.gov/congressional-record",
#>         date %>% str_replace_all("-", "/"), section, sep = "/")
#>     pages <- read_html(url) %>% html_nodes("a")
#>     d <- tibble(header = html_text(pages), date = date, section = section,
#>         url = str_c("https://www.congress.gov", html_attr(pages,
#>             "href"))) %>% filter(url %>% str_detect("article"))
#>     return(d)
#>   }
#> <environment: 0x110401620>