Introduction
This vignette demonstrates how to extract mentions of corporations using a regex lookup table of corporation names, aliases, tickers, and other metadata. The regextable::extract() workflow allows consistent identification of corporation names across datasets, even when multiple aliases or variants are used.
Install and load the package:
Regex Table of Corporations
The corporations_regex table contains a sample of corporations, their common aliases, ticker symbols, and reference sources used for matching and standardization. The aliases column includes alternative names, abbreviations, or common variants for each corporation to ensure comprehensive matching.
corporations_regex <- read.csv("/Users/shirl/Downloads/corporations_regex.csv", stringsAsFactors = FALSE)
kable(head(corporations_regex))| aliases | cik | FED_RSSD | ticker | naics | sources | pattern |
|---|---|---|---|---|---|---|
| booz allen & hamilton|booz allen hamilton|booz allen hamilton | 13222 | NA | NA | cik | \b(?:booz allen & hamilton|booz allen hamilton)\b | |
| taft stettinius & hollister | 909789 | NA | NA | cik | \b(?:taft stettinius & hollister)\b | |
| case|case | 922321 | NA | NA | cik | \b(?:case)\b | |
| young america | 1058951 | NA | NA | cik | \b(?:young america)\b | |
| alliance | 1086796 | NA | NA | cik | \b(?:alliance)\b |
Data Corporations
The following dataset contains organizations and contributors involved in Project 2025. It is used to demonstrate matching and standardizing corporation names using the regex table.
project_2025_url <- "https://raw.githubusercontent.com/judgelord/corporations/main/data/project_2025_coalition_and_contributors.rda"
tmp <- tempfile(fileext = ".rda")
download.file(project_2025_url, tmp, mode = "wb")
load(tmp)
unlink(tmp)
remove(tmp, project_2025_url)
kable(head(project_2025_coalition_and_contributors))| type | organization | individual | role |
|---|---|---|---|
| Organization | Alabama Policy Institute | Advisory Board Member | |
| Organization | Alliance Defending Freedom | Advisory Board Member | |
| Organization | American Accountability Foundation | Advisory Board Member | |
| Organization | American Center for Law and Justice | Advisory Board Member | |
| Organization | American Compass | Advisory Board Member |
Extracting Aliases from Corporation Crosswalk
The extract() function searches the organization column of the dataset using the aliases patterns in the regex table. It returns the standardized corporation names while optionally removing acronym-only matches to reduce false positives.
corp_df <- extract(data = project_2025_coalition_and_contributors,
col_name = "organization",
regex_table = corporations_regex,
data_return_cols = "organization",
remove_acronyms = TRUE,
use_ner = TRUE)
kable(head(corp_df))| row_id | organization | pattern | match |
|---|---|---|---|
| 3 | American Accountability Foundation | \b(?:american)\b | American |
| 4 | American Center for Law and Justice | \b(?:american)\b | American |
| 5 | American Compass | \b(?:american)\b | American |
| 5 | American Compass | \b(?:urban compass|compass)\b | Compass |
| 7 | American Cornerstone Institute | \b(?:american)\b | American |
