regextable

Introduction

regextable extracts regex-based pattern matches from a data frame or character vector using a pattern lookup table, returning the matched pattern, the exact text match, an internal row identifier, and all original columns from the input data unless specified, with optional additional columns from the pattern table. Performance on large datasets is improved by stopping the search for a row once a match is found.

Install and load the package:

devtools::install_github("judgelord/regextable")

library(regextable)
library(kableExtra)

Data

For demonstration, we use two included datasets:

members: A lookup table of regex patterns for member names.
cr2007_03_01: A sample text dataset to search.

data("members")
kable(members)

congress	chamber	bioname	pattern	icpsr	state_abbrev	district_code	first_name	last_name
110	President	BUSH, George Walker	george bush\|george walker bush\|bush\|george w bush\|bush\|(^\|senator \|representative )bush\|bush, george\|bush george\|bush, g\|president bush\|g w bush	99910	USA	0	George	BUSH
110	House	BONNER, Jr., Josiah Robins (Jo)	josiah bonner\|josiah josiah robins bonner\|bonner\|josiah j bonner\|jo bonner\|jo josiah robins bonner\|jo j bonner\|(^\|senator \|representative )bonner\|bonner, jo\|bonner, josiah\|bonner josiah\|bonner, j\|representative bonner\|j j bonner	20300	AL	1	Josiah	BONNER
110	House	ROGERS, Mike Dennis	mike rogers\|mike dennis rogers\|rogers.{1,4}al\|mike d rogers\|michael rogers\|michael dennis rogers\|michael d rogers\|(^\|senator \|representative )rogers{1,4}al\|rogers, michael\|rogers, mike\|rogers mike\|representative rogers{1,4}al\|m d rogers	20301	AL	3	Mike	ROGERS
110	House	DAVIS, Artur	artur davis\|davis\|(^\|senator \|representative )davis{1,4}al\|davis, artur\|davis artur\|davis, a\|representative davis{1,4}al	20302	AL	7	Artur	DAVIS
110	House	CRAMER, Robert E. (Bud), Jr.	robert cramer\|robert e cramer\|cramer\|bud cramer\|bud e cramer\|cramer\|(^\|senator \|representative )cramer\|cramer, bud\|cramer, robert\|cramer robert\|cramer, r\|cramer, b\|representative cramer\|r e cramer	29100	AL	5	Robert	CRAMER


data("cr2007_03_01")
kable(cr2007_03_01)

date	text	header	url	url_txt
2007-03-01	HON. SAM GRAVES;Mr. GRAVES	RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35	https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-2	https://www.congress.gov/117/crec/2007/03/01/modified/CREC-2007-03-01-pt1-PgE431-2.htm
2007-03-01	HON. MARK UDALL;Mr. UDALL	INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35	https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-3	https://www.congress.gov/117/crec/2007/03/01/modified/CREC-2007-03-01-pt1-PgE431-3.htm
2007-03-01	HON. JAMES R. LANGEVIN;Mr. LANGEVIN	BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35	https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-4	https://www.congress.gov/117/crec/2007/03/01/modified/CREC-2007-03-01-pt1-PgE431-4.htm
2007-03-01	HON. JIM COSTA;Mr. COSTA	A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35	https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-5	https://www.congress.gov/117/crec/2007/03/01/modified/CREC-2007-03-01-pt1-PgE431-5.htm
2007-03-01	HON. SAM GRAVES;Mr. GRAVES	RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT	https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-1	https://www.congress.gov/117/crec/2007/03/01/modified/CREC-2007-03-01-pt1-PgE431.htm

Text Cleaning

extract() cleans text by default, so the user does not need to call it manually. Cleaning standardizes spacing, punctuation, and capitalization, which helps regex pattern matching.

Example of clean_text():

text <- "  HELLO---WORLD  "
clean_text(text)
#> [1] "hello world"

Basic Extraction

The simplest use of extract():

result <- extract(
  data = cr2007_03_01,
  regex_table = members,
  return_cols = c("text"),
  regex_return_cols = c("icpsr")
)

kable(head(result))

text	pattern	match	row_id	icpsr
HON. SAM GRAVES;Mr. GRAVES	samuel graves\|graves\|sam graves\|(^\|senator \|representative )graves\|graves, sam\|graves, samuel\|graves samuel\|graves, s\|representative graves/td>	sam graves	1	20124
HON. MARK UDALL;Mr. UDALL	mark udall\|udall\|mark e udall\|udall\|(^\|senator \|representative )udall{1,4}co\|udall, mark\|udall mark\|udall, m\|representative udall{1,4}co\|m e udall	mark udall	2	29906
HON. JAMES R. LANGEVIN;Mr. LANGEVIN	james langevin\|langevin\|james r langevin\|jim langevin\|jim r langevin\|(^\|senator \|representative )langevin\|langevin, jim\|langevin, james\|langevin james\|langevin, j\|representative langevin\|j r langevin	james r langevin	3	20136
HON. JIM COSTA;Mr. COSTA	jim costa\|costa\|james costa\|(^\|senator \|representative )costa\|costa, james\|costa, jim\|costa jim\|costa, j\|representative costa/td>	jim costa	4	20501
HON. SAM GRAVES;Mr. GRAVES	samuel graves\|graves\|sam graves\|(^\|senator \|representative )graves\|graves, sam\|graves, samuel\|graves samuel\|graves, s\|representative graves/td>	sam graves	5	20124

Explanation: - data: the text dataset to search. - col_name: which column contains the text. - regex_table: the lookup table of patterns. - return_cols: columns from data to include in the result. - regex_return_cols: additional columns from the pattern table to attach. Each row in the output corresponds to a detected match, and includes both the original text and the matching pattern. —

Advanced Usage

extract() can also filter data by date, remove acronyms (all-uppercase patterns with 2+ characters), and select specific output columns. This is useful for more controlled extraction.

Explanation: - date_col, date_start, date_end: filter rows by date. - remove_acronyms: skip patterns like “NASA” or “USA”. You can combine these filters with any subset of columns for flexible outputs. —

How the Matching Works

Internally, extract() processes each row of text against the regex patterns in order:

Start with all text rows.
Check each regex pattern against unmatched rows.
When a match is found, remove that row from further checks.
Continue until all patterns have been applied.

This approach improves performance on large datasets because rows stop being checked once a match is found.

Summary

regextable is a tool for extracting data from text.
Use the included datasets to get started or supply your own lookup tables.
extract() by default handles text cleaning and efficient matching.
Optional parameters allow advanced control over filtering and output.

Shirlyn Dong