This example uses the rvest, stringr, purrr, here and magrittr packages.

library(tidyverse)
library(rvest)
library(magrittr)
library(here)

The aim is to get the names of oil, gas, electric, and hydropower companies that lobby FERC from the FERC website and match them to (1) letters Members of Congress wrote to FERC and (2) corporate and executive campaign contributions from DIME.

Note: The R Markdown file that made this document is here.

Get corporations and corporate officials from the FERC website

1: Instpect the URLs and source html

(in Chrome: view -> developer -> view source)

Here is the first page under the “Oil Matters” tab, oil companies with “A” names:

html <- read_html("https://www.ferc.gov/docs-filing/corp-off/oil.asp") 

html %>% 
  html_nodes("strong") %>% # company names appear with the HTML tag "<strong>"
  html_text() %>% # to plain text
  .[1:6] # the first 3 rows seem to be web page headers

## [1] "(Revised 10/16/2019)A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z "
## [2] "(Revised 10/16/2019)"                                                                                                      
## [3] "A "                                                                                                                        
## [4] "Alpine Transportation Company"                                                                                             
## [5] "Anadarko Gulf of Mexico Pipeline, Inc."                                                                                    
## [6] "Anschutz Ranch East Pipeline, Inc."

Gas companies with “A” names. Notice how the URL changed:

html <- read_html("https://www.ferc.gov/docs-filing/corp-off/gas.asp") 

html %>% 
  html_nodes("strong") %>%
  html_text() %>% 
  .[4:6]

## [1] "A "                          "Acacia Natural Gas, L.L.C." 
## [3] "Acadian Gas Pipeline System"

Gas companies with “B” names. Notice how the URL changed:

html <- read_html("https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp") 

# "strong" is bold in HTML
# in this case bolded text is company names
html %>% 
  html_nodes("strong") %>% 
  html_text() %>%
  .[4:6]

## [1] "Barclays Bank PLC"                 
## [2] "Beacon Generating LLC"             
## [3] "Bear Creek Storage Company, L.L.C."

In addition to “oil” and “gas,” we have “electric” and “hydro” with pages A-Z (4 x 26 = 104 pages!), so let’s make a function to grab the names from each page.

2. Make a function, turning things that vary into variables.

A test run using str_c to combine strings:

industry <- "gas"
letter <- "B"
str_c("https://www.ferc.gov/docs-filing/corp-off/", industry, "/", industry, "-", letter, ".asp")

## [1] "https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp"

Looks good! Forward!

cross() lists each possible combination of URL parts. Because it returns a list, we use lift() to allow str_c() to take a list and paste it together. map() then does this for each.

url_parts <- list(
  base_url = "https://www.ferc.gov/docs-filing/corp-off/",
  industry = c("oil/oil-", "gas/gas-", "electric/electric-", "hydro/hydro-"),
  letter = LETTERS, # LETTERS (A-Z) is built in to R
  end = ".asp"
)

url <- url_parts %>%
  cross() %>%
  map_chr(lift(str_c)) %>%
  str_remove("/[a-z]*-A") # remove "/industry-A"

## Add odd ULS
url <- c(url,
         "https://www.ferc.gov/docs-filing/corp-off/gas/gas-U-V.asp",
         "https://www.ferc.gov/docs-filing/corp-off/gas/gas-X-Y-Z.asp")

head(url)

## [1] "https://www.ferc.gov/docs-filing/corp-off/oil.asp"      
## [2] "https://www.ferc.gov/docs-filing/corp-off/gas.asp"      
## [3] "https://www.ferc.gov/docs-filing/corp-off/electric.asp" 
## [4] "https://www.ferc.gov/docs-filing/corp-off/hydro.asp"    
## [5] "https://www.ferc.gov/docs-filing/corp-off/oil/oil-B.asp"
## [6] "https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp"

Define a function to apply to each URL.

scraper <- function(url){
  html <-  read_html(url) 

  ## Extract company names
  corps <- html %>% 
    html_nodes("strong") %>% 
    html_text() %>%
    .[-c(1:3)]
  
  ## Extract the industry back out of the URL
  industry <- str_extract(url, "off/[a-z]*") %>% 
    str_remove("off/")

  ## Return a data frame with two variables, company and industry
  d <- tibble(company = str_squish(corps),
              industry = industry) %>% 
    distinct()
  
  ## Get the names and addresses of company reps
  ## "ul" is an unordered list in HTML
  ## in this case, ul nodes contain the names and addresses of company representatives 
  contacts <- html %>% 
    html_nodes("strong, ul") %>%
    # instead of using html_text(), here I am just collapse the raw HTML into a single string
    str_c(collapse = "") %>% 
    # and split it out by the strong node so that each string starts with the company names I extracted above
    str_split("<strong>") %>% 
    unlist()
  
  ## Make a second dataframe of company names and addresses
  d2 <- tibble(contacts = contacts) %>% 
    separate(contacts, sep = "</strong>",
             into = c("company", "contacts"),
             extra = "merge") %>% 
    ## Clean up a bit
    mutate(company = str_remove(company, "^[^A-Z]*")) %>%
      mutate(company = str_remove(company, "amp;")) %>% # remove & symbol
    mutate(contacts = str_remove(contacts, "^[^A-Z]*")) %>% 
    mutate_all(str_squish) # remove extra white space
  
  ## keep just one contact per company
  d2 %<>% 
    distinct() %>% 
    filter(!is.na(contacts)) %>% 
    group_by(company) %>% 
    top_n(1)
  
  ## Join the company-contact data to the company-industry data
  # FIXME may fail where specials other than anp&, also Penola line break
  d %<>% left_join(d2) %>% distinct()
  
  return(d)
}

3. Apply the function to each URL

map() each URL into the function, i.e. apply the function to each. I wrap the scraper function in possibly() so that if a URL does not work, it keeps going, just returning NA rather than stopping at an error. Specifically, I use map_dfr which returns a data frame, combining results with rbind(). See ?map() and the purrr cheatsheet.

# Test function on one url
FERC_corps <- map_dfr(url[1], possibly(scraper, NA_real_) ) 

# Apply to all urls
FERC_corps <- map_dfr(url, possibly(scraper, NA_real_) )  %>% distinct()
head(FERC_corps)

## # A tibble: 6 x 3
##   company                 industry contacts                                
##   <chr>                   <chr>    <chr>                                   
## 1 Alpine Transportation … oil      "Edward R. Hendrickson <br> Vice Presid…
## 2 Anadarko Gulf of Mexic… oil      "Bradley E. Boister <br> Manager, Domes…
## 3 Anschutz Ranch East Pi… oil      "Richard M. Jones <br> Vice President <…
## 4 Apache Corporation      oil      "Stephen S. Williams<br> Senior Counsel…
## 5 Apache GOM Pipeline, I… oil      "Stephen S. Williams<br> Senior Counsel…
## 6 B                       oil      <NA>

Clean up a bit, using stringr functions to make this list of companies more likely to be found in other datasets.

FERC_corps$company %<>% 
  str_replace_all("\r\n", ", ") %>% 
  str_remove_all("^, ") %>%
  trimws() # trim white space before and after company names

FERC_corps %<>% 
  filter(nchar(company)>1) %>% # filter out names less than 2 characters 
  filter(!company %in% c("FERC Online Support", "Note:","LLC", "Transportation, LLC")) %>% # not companies
  filter(!str_detect(company, "[0-9]{2}/[0-9]{2}/[0-9]{4}")) %>% # dates not companies
  distinct() # drop duplicates

## For matching, remove endings like ", Inc." etc. 
trim <- . %>%
  str_remove_all("Companies of |Village of |\\(.*|,.*| -.*| Pipe.*| Company.*| Corp.*| Energy.*| Holdings.*| Gas$| Project.*| Co\\.| Co$| Gen.| LLC| L.L.C.|  L.P.| Inc.| L.P.| Ltd.| LP| Limited Partnership| Limited Liability| Transportation and Storage| Storage and Transportation| Arkansas$") %>% 
  str_remove_all("^, ") %>%
  trimws()  # filter out names less than 2 characters 

FERC_corps$company_short <- trim(FERC_corps$company) %>% trim()

FERC_corps %<>% filter(nchar(company_short)>1,
                       company_short != "Â")

head(FERC_corps %>% select(company_short, everything()))

## # A tibble: 6 x 4
##   company_short     company           industry contacts                    
##   <chr>             <chr>             <chr>    <chr>                       
## 1 Alpine Transport… Alpine Transport… oil      "Edward R. Hendrickson <br>…
## 2 Anadarko Gulf of… Anadarko Gulf of… oil      "Bradley E. Boister <br> Ma…
## 3 Anschutz Ranch E… Anschutz Ranch E… oil      "Richard M. Jones <br> Vice…
## 4 Apache            Apache Corporati… oil      "Stephen S. Williams<br> Se…
## 5 Apache GOM        Apache GOM Pipel… oil      "Stephen S. Williams<br> Se…
## 6 Barclays Bank PLC Barclays Bank PLC oil      "Mark D. Higgins<br> Direct…

Inspect especially short and long company names:

## Short to long names refrence variable
FERC_corps %<>% group_by(company_short) %>% 
  mutate(company_short_refs = str_c(company, collapse = "; "),
         contacts_refs = str_c(contacts, collapse = "; ")) %>% 
  ungroup() %>% 
  distinct()

FERC_corps %>% count(company_short, company_short_refs) %>% arrange(-n)

## # A tibble: 1,873 x 3
##    company_short    company_short_refs                                    n
##    <chr>            <chr>                                             <int>
##  1 Emera            Emera Energy Services Inc.; Emera Energy Service…    18
##  2 Duke             Duke Energy Carolinas, LLC; Duke Energy Indiana,…    13
##  3 South Jersey     South Jersey Energy Company; South Jersey Energy…    10
##  4 Stream Energy    Stream Energy Columbia, LLC; Stream Energy Conne…    10
##  5 Dominion         Dominion Energy Cove Point LNG, LP; Dominion Ene…     9
##  6 American Ref-Fu… American Ref-Fuel Company; American Ref-Fuel Com…     8
##  7 DC Energy        DC Energy California, LLC; DC Energy Dakota, LLC…     8
##  8 Liberty Utiliti… Liberty Utilities (EnergyNorth Natural Gas) Corp…     8
##  9 AP Gas & Electr… AP Gas & Electric (TX), LLC; AP Gas & Electric (…     7
## 10 Direct Energy    Direct Energy Business, LLC; Direct Energy Servi…     6
## # … with 1,863 more rows

Get congressional districts by company zipcode

Here we use psvR to access VoteSmart data via their API (this requires an API key)

These data are saved as DOE_FERC-corps.Rdata in the data folder of this repository.

# Save 
save(FERC_corps, file = here("data", "DOE_FERC-corps.Rdata"))

Tidy text example: web scraping and RegEx

Companies that lobby the U.S. Federal Energy Regulatory Commission (FERC)

Get corporations and corporate officials from the FERC website

1: Instpect the URLs and source html

2. Make a function, turning things that vary into variables.

3. Apply the function to each URL

Get congressional districts by company zipcode