This example uses the rvest
, stringr
, purrr
, here
and magrittr
packages.
library(tidyverse)
library(rvest)
library(magrittr)
library(here)
The aim is to get the names of oil, gas, electric, and hydropower companies that lobby FERC from the FERC website and match them to (1) letters Members of Congress wrote to FERC and (2) corporate and executive campaign contributions from DIME.
Note: The R Markdown file that made this document is here.
(in Chrome: view -> developer -> view source)
Here is the first page under the “Oil Matters” tab, oil companies with “A” names:
html <- read_html("https://www.ferc.gov/docs-filing/corp-off/oil.asp")
html %>%
html_nodes("strong") %>% # company names appear with the HTML tag "<strong>"
html_text() %>% # to plain text
.[1:6] # the first 3 rows seem to be web page headers
## [1] "(Revised 10/16/2019)A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z "
## [2] "(Revised 10/16/2019)"
## [3] "A "
## [4] "Alpine Transportation Company"
## [5] "Anadarko Gulf of Mexico Pipeline, Inc."
## [6] "Anschutz Ranch East Pipeline, Inc."
Gas companies with “A” names. Notice how the URL changed:
html <- read_html("https://www.ferc.gov/docs-filing/corp-off/gas.asp")
html %>%
html_nodes("strong") %>%
html_text() %>%
.[4:6]
## [1] "A " "Acacia Natural Gas, L.L.C."
## [3] "Acadian Gas Pipeline System"
Gas companies with “B” names. Notice how the URL changed:
html <- read_html("https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp")
# "strong" is bold in HTML
# in this case bolded text is company names
html %>%
html_nodes("strong") %>%
html_text() %>%
.[4:6]
## [1] "Barclays Bank PLC"
## [2] "Beacon Generating LLC"
## [3] "Bear Creek Storage Company, L.L.C."
In addition to “oil” and “gas,” we have “electric” and “hydro” with pages A-Z (4 x 26 = 104 pages!), so let’s make a function to grab the names from each page.
A test run using str_c
to combine strings:
industry <- "gas"
letter <- "B"
str_c("https://www.ferc.gov/docs-filing/corp-off/", industry, "/", industry, "-", letter, ".asp")
## [1] "https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp"
Looks good! Forward!
cross()
lists each possible combination of URL parts. Because it returns a list, we use lift()
to allow str_c()
to take a list and paste it together. map()
then does this for each.
url_parts <- list(
base_url = "https://www.ferc.gov/docs-filing/corp-off/",
industry = c("oil/oil-", "gas/gas-", "electric/electric-", "hydro/hydro-"),
letter = LETTERS, # LETTERS (A-Z) is built in to R
end = ".asp"
)
url <- url_parts %>%
cross() %>%
map_chr(lift(str_c)) %>%
str_remove("/[a-z]*-A") # remove "/industry-A"
## Add odd ULS
url <- c(url,
"https://www.ferc.gov/docs-filing/corp-off/gas/gas-U-V.asp",
"https://www.ferc.gov/docs-filing/corp-off/gas/gas-X-Y-Z.asp")
head(url)
## [1] "https://www.ferc.gov/docs-filing/corp-off/oil.asp"
## [2] "https://www.ferc.gov/docs-filing/corp-off/gas.asp"
## [3] "https://www.ferc.gov/docs-filing/corp-off/electric.asp"
## [4] "https://www.ferc.gov/docs-filing/corp-off/hydro.asp"
## [5] "https://www.ferc.gov/docs-filing/corp-off/oil/oil-B.asp"
## [6] "https://www.ferc.gov/docs-filing/corp-off/gas/gas-B.asp"
Define a function to apply to each URL.
scraper <- function(url){
html <- read_html(url)
## Extract company names
corps <- html %>%
html_nodes("strong") %>%
html_text() %>%
.[-c(1:3)]
## Extract the industry back out of the URL
industry <- str_extract(url, "off/[a-z]*") %>%
str_remove("off/")
## Return a data frame with two variables, company and industry
d <- tibble(company = str_squish(corps),
industry = industry) %>%
distinct()
## Get the names and addresses of company reps
## "ul" is an unordered list in HTML
## in this case, ul nodes contain the names and addresses of company representatives
contacts <- html %>%
html_nodes("strong, ul") %>%
# instead of using html_text(), here I am just collapse the raw HTML into a single string
str_c(collapse = "") %>%
# and split it out by the strong node so that each string starts with the company names I extracted above
str_split("<strong>") %>%
unlist()
## Make a second dataframe of company names and addresses
d2 <- tibble(contacts = contacts) %>%
separate(contacts, sep = "</strong>",
into = c("company", "contacts"),
extra = "merge") %>%
## Clean up a bit
mutate(company = str_remove(company, "^[^A-Z]*")) %>%
mutate(company = str_remove(company, "amp;")) %>% # remove & symbol
mutate(contacts = str_remove(contacts, "^[^A-Z]*")) %>%
mutate_all(str_squish) # remove extra white space
## keep just one contact per company
d2 %<>%
distinct() %>%
filter(!is.na(contacts)) %>%
group_by(company) %>%
top_n(1)
## Join the company-contact data to the company-industry data
# FIXME may fail where specials other than anp&, also Penola line break
d %<>% left_join(d2) %>% distinct()
return(d)
}
map()
each URL into the function, i.e. apply the function to each. I wrap the scraper
function in possibly()
so that if a URL does not work, it keeps going, just returning NA
rather than stopping at an error. Specifically, I use map_dfr
which returns a data frame, combining results with rbind()
. See ?map()
and the purrr
cheatsheet.
# Test function on one url
FERC_corps <- map_dfr(url[1], possibly(scraper, NA_real_) )
# Apply to all urls
FERC_corps <- map_dfr(url, possibly(scraper, NA_real_) ) %>% distinct()
head(FERC_corps)
## # A tibble: 6 x 3
## company industry contacts
## <chr> <chr> <chr>
## 1 Alpine Transportation … oil "Edward R. Hendrickson <br> Vice Presid…
## 2 Anadarko Gulf of Mexic… oil "Bradley E. Boister <br> Manager, Domes…
## 3 Anschutz Ranch East Pi… oil "Richard M. Jones <br> Vice President <…
## 4 Apache Corporation oil "Stephen S. Williams<br> Senior Counsel…
## 5 Apache GOM Pipeline, I… oil "Stephen S. Williams<br> Senior Counsel…
## 6 B oil <NA>
Clean up a bit, using stringr
functions to make this list of companies more likely to be found in other datasets.
FERC_corps$company %<>%
str_replace_all("\r\n", ", ") %>%
str_remove_all("^, ") %>%
trimws() # trim white space before and after company names
FERC_corps %<>%
filter(nchar(company)>1) %>% # filter out names less than 2 characters
filter(!company %in% c("FERC Online Support", "Note:","LLC", "Transportation, LLC")) %>% # not companies
filter(!str_detect(company, "[0-9]{2}/[0-9]{2}/[0-9]{4}")) %>% # dates not companies
distinct() # drop duplicates
## For matching, remove endings like ", Inc." etc.
trim <- . %>%
str_remove_all("Companies of |Village of |\\(.*|,.*| -.*| Pipe.*| Company.*| Corp.*| Energy.*| Holdings.*| Gas$| Project.*| Co\\.| Co$| Gen.| LLC| L.L.C.| L.P.| Inc.| L.P.| Ltd.| LP| Limited Partnership| Limited Liability| Transportation and Storage| Storage and Transportation| Arkansas$") %>%
str_remove_all("^, ") %>%
trimws() # filter out names less than 2 characters
FERC_corps$company_short <- trim(FERC_corps$company) %>% trim()
FERC_corps %<>% filter(nchar(company_short)>1,
company_short != "Â")
head(FERC_corps %>% select(company_short, everything()))
## # A tibble: 6 x 4
## company_short company industry contacts
## <chr> <chr> <chr> <chr>
## 1 Alpine Transport… Alpine Transport… oil "Edward R. Hendrickson <br>…
## 2 Anadarko Gulf of… Anadarko Gulf of… oil "Bradley E. Boister <br> Ma…
## 3 Anschutz Ranch E… Anschutz Ranch E… oil "Richard M. Jones <br> Vice…
## 4 Apache Apache Corporati… oil "Stephen S. Williams<br> Se…
## 5 Apache GOM Apache GOM Pipel… oil "Stephen S. Williams<br> Se…
## 6 Barclays Bank PLC Barclays Bank PLC oil "Mark D. Higgins<br> Direct…
Inspect especially short and long company names:
## Short to long names refrence variable
FERC_corps %<>% group_by(company_short) %>%
mutate(company_short_refs = str_c(company, collapse = "; "),
contacts_refs = str_c(contacts, collapse = "; ")) %>%
ungroup() %>%
distinct()
FERC_corps %>% count(company_short, company_short_refs) %>% arrange(-n)
## # A tibble: 1,873 x 3
## company_short company_short_refs n
## <chr> <chr> <int>
## 1 Emera Emera Energy Services Inc.; Emera Energy Service… 18
## 2 Duke Duke Energy Carolinas, LLC; Duke Energy Indiana,… 13
## 3 South Jersey South Jersey Energy Company; South Jersey Energy… 10
## 4 Stream Energy Stream Energy Columbia, LLC; Stream Energy Conne… 10
## 5 Dominion Dominion Energy Cove Point LNG, LP; Dominion Ene… 9
## 6 American Ref-Fu… American Ref-Fuel Company; American Ref-Fuel Com… 8
## 7 DC Energy DC Energy California, LLC; DC Energy Dakota, LLC… 8
## 8 Liberty Utiliti… Liberty Utilities (EnergyNorth Natural Gas) Corp… 8
## 9 AP Gas & Electr… AP Gas & Electric (TX), LLC; AP Gas & Electric (… 7
## 10 Direct Energy Direct Energy Business, LLC; Direct Energy Servi… 6
## # … with 1,863 more rows
Here we use psvR
to access VoteSmart data via their API (this requires an API key)
These data are saved as DOE_FERC-corps.Rdata
in the data
folder of this repository.
# Save
save(FERC_corps, file = here("data", "DOE_FERC-corps.Rdata"))