Introduction
This vignette demonstrates how to extract mentions of corporations using a regex lookup table of corporation names, aliases, tickers, and other metadata. The regextable::extract() workflow allows consistent identification of corporation names across datasets, even when multiple aliases or variants are used.
Install and load the package:
Regex Table of Corporations
The corporations_regex table contains a sample of corporations, their common aliases, ticker symbols, and reference sources used for matching and standardization. The aliases column includes alternative names, abbreviations, or common variants for each corporation to ensure comprehensive matching.
corporations_regex <- read.csv("/Users/shirl/Downloads/corporations_regex.csv", stringsAsFactors = FALSE)
kable(head(corporations_regex))| aliases | cik | FED_RSSD | ticker | naics | sources | pattern |
|---|---|---|---|---|---|---|
| booz allen & hamilton|booz allen hamilton|booz allen hamilton | 13222 | NA | NA | cik | \b(?:booz allen & hamilton|booz allen hamilton)\b | |
| taft stettinius & hollister | 909789 | NA | NA | cik | \b(?:taft stettinius & hollister)\b | |
| case|case | 922321 | NA | NA | cik | \b(?:case)\b | |
| young america | 1058951 | NA | NA | cik | \b(?:young america)\b | |
| alliance | 1086796 | NA | NA | cik | \b(?:alliance)\b | |
| boyden | 1124262 | NA | NA | cik | \b(?:boyden)\b |
Sample Dataset: Project 2025
The following dataset contains organizations and contributors involved in Project 2025. It is used to demonstrate matching and standardizing corporation names using the regex table.
project_2025_url <- "https://raw.githubusercontent.com/judgelord/corporations/main/data/project_2025_coalition_and_contributors.rda"
tmp <- tempfile(fileext = ".rda")
download.file(project_2025_url, tmp, mode = "wb")
load(tmp)
unlink(tmp)
remove(tmp, project_2025_url)
kable(head(project_2025_coalition_and_contributors))| type | organization | individual | role |
|---|---|---|---|
| Organization | Alabama Policy Institute | Advisory Board Member | |
| Organization | Alliance Defending Freedom | Advisory Board Member | |
| Organization | American Accountability Foundation | Advisory Board Member | |
| Organization | American Center for Law and Justice | Advisory Board Member | |
| Organization | American Compass | Advisory Board Member | |
| Organization | The American Conservative | Advisory Board Member |
Extracting Corporation Names
The extract() function searches the organization column of the dataset using the patterns column in the regex table. It returns the standardized corporation names while optionally removing acronym-only matches to reduce false positives.
corp_df <- extract(data = project_2025_coalition_and_contributors,
col_name = "organization",
regex_table = corporations_regex,
data_return_cols = "organization",
remove_acronyms = TRUE,
use_ner = TRUE)
kable(head(corp_df))| row_id | organization | pattern | match |
|---|---|---|---|
| 3 | American Accountability Foundation | \b(?:american)\b | American |
| 4 | American Center for Law and Justice | \b(?:american)\b | American |
| 5 | American Compass | \b(?:american)\b | American |
| 5 | American Compass | \b(?:urban compass|compass)\b | Compass |
| 7 | American Cornerstone Institute | \b(?:american)\b | American |
| 7 | American Cornerstone Institute | \b(?:cornerstone)\b | Cornerstone |
