Corporations Regextable • regextable

Introduction

This vignette demonstrates how to extract mentions of corporations using a regex lookup table of corporation names, aliases, tickers, and other metadata. The regextable::extract() workflow allows consistent identification of corporation names across datasets, even when multiple aliases or variants are used.

Install and load the package:

library(regextable)
library(kableExtra)
library(googledrive)

Regex Table of Corporations

The corporations_regex table contains a sample of corporations, their common aliases, ticker symbols, and reference sources used for matching and standardization. The aliases column includes alternative names, abbreviations, or common variants for each corporation to ensure comprehensive matching.

googledrive::drive_deauth()

googledrive::drive_download(
  googledrive::as_id("1CgrHKy3jkWFffJN-cjNBzlHdoIIxO62t"),
  path = "corporations_regex.csv",
  overwrite = TRUE
)

corporations_regex <- read.csv("corporations_regex.csv", stringsAsFactors = FALSE)
kable(head(corporations_regex))

aliases	cik	FED_RSSD	naics	sources	pattern
booz allen & hamilton\|booz allen hamilton\|booz allen hamilton	13222	NA	NA	cik	\b(?:booz allen & hamilton\|booz allen hamilton)\b
taft stettinius & hollister	909789	NA	NA	cik	\b(?:taft stettinius & hollister)\b
case\|case	922321	NA	NA	cik	\b(?:case)\b
young america	1058951	NA	NA	cik	\b(?:young america)\b
alliance	1086796	NA	NA	cik	\b(?:alliance)\b
boyden	1124262	NA	NA	cik	\b(?:boyden)\b

Sample Dataset: Project 2025

The following dataset contains organizations and contributors involved in Project 2025. It is used to demonstrate matching and standardizing corporation names using the regex table.

project_2025_url <- "https://raw.githubusercontent.com/judgelord/corporations/main/data/project_2025_coalition_and_contributors.rda"
tmp <- tempfile(fileext = ".rda")
download.file(project_2025_url, tmp, mode = "wb")
load(tmp)
unlink(tmp)
remove(tmp, project_2025_url)
kable(head(project_2025_coalition_and_contributors))

type	organization	role
Organization	Alabama Policy Institute	Advisory Board Member
Organization	Alliance Defending Freedom	Advisory Board Member
Organization	American Accountability Foundation	Advisory Board Member
Organization	American Center for Law and Justice	Advisory Board Member
Organization	American Compass	Advisory Board Member
Organization	The American Conservative	Advisory Board Member

Extracting Corporation Names

The extract() function searches the organization column of the dataset using the patterns column in the regex table. It returns the standardized corporation names while optionally removing acronym-only matches to reduce false positives.

spacyr::spacy_initialize()

corp_df <- extract(
  data = project_2025_coalition_and_contributors,
  col_name = "organization",
  regex_table = corporations_regex,
  data_return_cols = "organization",
  remove_acronyms = TRUE,
  use_ner = TRUE
)
kable(head(corp_df))

row_id	organization	pattern	match
3	American Accountability Foundation	\b(?:american)\b	American
4	American Center for Law and Justice	\b(?:american)\b	American
5	American Compass	\b(?:american)\b	American
5	American Compass	\b(?:urban compass\|compass)\b	Compass
7	American Cornerstone Institute	\b(?:american)\b	American
7	American Cornerstone Institute	\b(?:cornerstone)\b	Cornerstone