Uses a regex lookup table to extract pattern matches.
Usage
extract(
data,
col_name = "text",
regex_table,
pattern_col = "pattern",
data_return_cols = NULL,
regex_return_cols = NULL,
date_col = NULL,
date_start = NULL,
date_end = NULL,
remove_acronyms = FALSE,
do_clean_text = TRUE,
verbose = TRUE,
unique_match = FALSE,
cl = NULL,
use_ner = FALSE,
ner_entity_types = c("ORG")
)Arguments
- data
A data frame or character vector containing the text to search.
- col_name
Column name in data frame containing text to search through.
- regex_table
A regex lookup table with a pattern column.
- pattern_col
Name of the regex pattern column in regex_table.
- data_return_cols
Optional vector of column names to include from 'data'.
- regex_return_cols
Optional vector of column names to include from 'regex_table'.
- date_col
Optional column in 'data' for date filtering.
- date_start
Optional start date for filtering 'data'.
- date_end
Optional end date for filtering 'data'.
- remove_acronyms
Logical; if TRUE, removes all-uppercase patterns from regex_table.
- do_clean_text
Logical; if TRUE, applies basic text cleaning to the input before matching.
- verbose
Logical; if TRUE, displays progress messages.
- unique_match
Logical; if TRUE, stops searching after first match to find at most one match per row. If FALSE, returns all matches for all patterns.
- cl
A cluster object created by
parallel::makeCluster(), or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evaluations. Passed topbapply::pblapply().- use_ner
Logical; if TRUE, uses the 'spacyr' package to validate that matches are actual Named Entities (e.g., organizations). Requires 'spacyr' to be installed and initialized.
- ner_entity_types
Character vector; the types of Named Entities to keep if
use_neris TRUE. Default is "ORG".
Value
A tibble (data frame) with columns:
row_idInteger row identifier corresponding to the input dataAdditional columns from
dataifdata_return_colsspecifiedAdditional columns from
regex_tableifregex_return_colsspecifiedpatternThe matched regex pattern(s)matchThe specific text extracted from the data (original casing preserved)
Details
Pattern matching is performed using R's regular expression engine and is
case-insensitive by default. For each input row, the function checks patterns
in regex_table and returns matches based on the unique_match parameter.
Examples
# Create sample data
data <- data.frame(
id = 1:3,
text = c("I love apples", "Bananas are great", "Oranges and apples"),
stringsAsFactors = FALSE
)
# Create regex patterns
patterns <- data.frame(
pattern = c("apples", "bananas", "oranges"),
category = c("fruit", "fruit", "fruit")
)
# Extract all matches
extract(data, "text", patterns)
#> Scanning 3 patterns against 3 text entries...
#>
| | 0 % ~calculating
|+++++++++++++++++ | 33% ~00s
|++++++++++++++++++++++++++++++++++ | 67% ~00s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
#> Number of rows with matches: 4
#> # A tibble: 4 × 3
#> row_id pattern match
#> <int> <chr> <chr>
#> 1 1 apples apples
#> 2 2 bananas Bananas
#> 3 3 apples apples
#> 4 3 oranges Oranges
# Extract one match per row
extract(data, "text", patterns, unique_match = TRUE)
#> Scanning: 3 patterns against 3 text entries...
#>
| | 0 % ~calculating
|+++++++++++++++++ | 33% ~00s
|++++++++++++++++++++++++++++++++++ | 67% ~00s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
#> Number of rows with matches: 3
#> # A tibble: 3 × 3
#> row_id pattern match
#> <int> <chr> <chr>
#> 1 1 apples apples
#> 2 2 bananas Bananas
#> 3 3 apples apples
