Introduction
regextable extracts regex-based pattern matches from a
data frame or character vector using a pattern lookup table, returning
the matched pattern, the exact text match, an internal row identifier,
and all original columns from the input data unless specified, with
optional additional columns from the pattern table. Performance on large
datasets is improved by stopping the search for a row once a match is
found.
Install and load the package:
devtools::install_github("judgelord/regextable")Data
For demonstration, we use two included datasets:
-
members: A lookup table of regex patterns for member names. -
cr2007_03_01: A sample text dataset to search.
| congress | chamber | bioname | pattern | icpsr | state_abbrev | district_code | first_name | last_name |
|---|---|---|---|---|---|---|---|---|
| 110 | President | BUSH, George Walker | george bush|george walker bush|bush|george w bush|bush|(^|senator |representative )bush|bush, george|bush george|bush, g|president bush|g w bush | 99910 | USA | 0 | George | BUSH |
| 110 | House | BONNER, Jr., Josiah Robins (Jo) | josiah bonner|josiah josiah robins bonner|bonner|josiah j bonner|jo bonner|jo josiah robins bonner|jo j bonner|(^|senator |representative )bonner|bonner, jo|bonner, josiah|bonner josiah|bonner, j|representative bonner|j j bonner | 20300 | AL | 1 | Josiah | BONNER |
| 110 | House | ROGERS, Mike Dennis | mike rogers|mike dennis rogers|rogers.{1,4}al|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|representative rogers{1,4}al|m d rogers | 20301 | AL | 3 | Mike | ROGERS |
| 110 | House | DAVIS, Artur | artur davis|davis|(^|senator |representative )davis{1,4}al|davis, artur|davis artur|davis, a|representative davis{1,4}al | 20302 | AL | 7 | Artur | DAVIS |
| 110 | House | CRAMER, Robert E. (Bud), Jr. | robert cramer|robert e cramer|cramer|bud cramer|bud e cramer|cramer|(^|senator |representative )cramer|cramer, bud|cramer, robert|cramer robert|cramer, r|cramer, b|representative cramer|r e cramer | 29100 | AL | 5 | Robert | CRAMER |
Text Cleaning
extract() cleans text by default, so the user does not
need to call it manually. Cleaning standardizes spacing, punctuation,
and capitalization, which helps regex pattern matching.
Example of clean_text():
text <- " HELLO---WORLD "
clean_text(text)
#> [1] "hello world"Basic Extraction
The simplest use of extract():
result <- extract(
data = cr2007_03_01,
regex_table = members,
return_cols = c("text"),
regex_return_cols = c("icpsr")
)
kable(head(result))| text | pattern | match | row_id | icpsr |
|---|---|---|---|---|
| HON. SAM GRAVES;Mr. GRAVES | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | sam graves | 1 | 20124 |
| HON. MARK UDALL;Mr. UDALL | mark udall|udall|mark e udall|udall|(^|senator |representative )udall{1,4}co|udall, mark|udall mark|udall, m|representative udall{1,4}co|m e udall | mark udall | 2 | 29906 |
| HON. JAMES R. LANGEVIN;Mr. LANGEVIN | james langevin|langevin|james r langevin|jim langevin|jim r langevin|(^|senator |representative )langevin|langevin, jim|langevin, james|langevin james|langevin, j|representative langevin|j r langevin | james r langevin | 3 | 20136 |
| HON. JIM COSTA;Mr. COSTA | jim costa|costa|james costa|(^|senator |representative )costa|costa, james|costa, jim|costa jim|costa, j|representative costa/td> | jim costa | 4 | 20501 |
| HON. SAM GRAVES;Mr. GRAVES | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | sam graves | 5 | 20124 |
Explanation: - data: the text dataset to search. -
col_name: which column contains the text. -
regex_table: the lookup table of patterns. -
return_cols: columns from data to include in the result. -
regex_return_cols: additional columns from the pattern
table to attach. Each row in the output corresponds to a detected match,
and includes both the original text and the matching pattern. —
Advanced Usage
extract() can also filter data by date, remove acronyms
(all-uppercase patterns with 2+ characters), and select specific output
columns. This is useful for more controlled extraction.
Explanation: - date_col, date_start, date_end: filter
rows by date. - remove_acronyms: skip patterns like “NASA”
or “USA”. You can combine these filters with any subset of columns for
flexible outputs. —
How the Matching Works
Internally, extract() processes each row of text against
the regex patterns in order:
- Start with all text rows.
- Check each regex pattern against unmatched rows.
- When a match is found, remove that row from further checks.
- Continue until all patterns have been applied.
This approach improves performance on large datasets because rows stop being checked once a match is found.
Summary
-
regextableis a tool for extracting data from text. - Use the included datasets to get started or supply your own lookup tables.
-
extract()by default handles text cleaning and efficient matching. - Optional parameters allow advanced control over filtering and output.
