rvest
In addition to tidyverse
packages, this example uses the rvest
, magrittr
, pdftools
, and here
packages.
library(tidyverse)
library(rvest)
library(magrittr)
library(pdftools)
library(here)
The aim is to make a table with both the text of letters and information from the FERC website, such as the date the document was received.
Here are the major steps
Scrape the table of document metadata
Extract the links to download the files
Convert the pdfs to text and add this text to the table
Some challenges:
FERC’s database can only be accessed in an interactive .asp session where the URL does not vary.
- SOLUTION: I just downloaded the raw HTML for the results pages. One could use the navigation functions of rvest
or selenium
, but with only ~30 pages, it was faster to click through and download them.
dplyr
’s gather()
make the table into one row per file, and then reverse this with spread()
after merging in the text from the pdfs.Note: The R Markdown file that made this document is here, and you can download an example .htm file here (put it in a folder “FERC/html” or change the file path arguments below to match the location of your .htm files).
In this case, the HTML is search results from the FERC elibrary saved in a folder named “FERC/html”.
web_pages <- list.files(here("FERC", "html"), pattern = ".htm")
# web_pages <- web_pages[1] ## for testing the whole function
# web_page <- web_pages[1] ## for testing parts of the function
web_pages
## [1] "1.htm"
html <- read_html(here("FERC", "html", web_page))
html_table
to turn "