Find U.S. legislator names in messy text with typos and inconsistent name formats
Install this package with
devtools::install_github("judgelord/legislators")
Data
This package relies on a dataframe of permutations of the names of
members of Congress. This dataframe builds on the basic structure of
voteview.com data, especially the bioname field. From this
and other corrections, it constructs a regular expression search
pattern and conditions under which this pattern should
yield a match (e.g., when that pattern has a unique match to a member of
Congress in a given Congress). pattern differs from
Congress to Congress because some member move from the House to the
Senate and because members with similar names join or leave Congress.
Users can customize the provided members data and supply
their updated version to extractMemberName().
#> [1] 51053 11
head(members)| chamber | congress | bioname | pattern | icpsr | state | state_abbrev | district_code | bioguide_id | first_name | last_name |
|---|---|---|---|---|---|---|---|---|---|---|
| House | 119 | ROGERS, Mike Dennis | mike rogers|mike dennis rogers|rogers|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|rogers, mrepresentative rogers{1,4}al|m d rogers | 20301 | 1 | AL | 3 | R000575 | Mike | ROGERS |
| House | 119 | SEWELL, Terri | terri sewell|sewell|terri a sewell|sewell|(^|senator |representative )sewellsewell, terri|sewell terri|sewell, trepresentative sewellt a sewell | 21102 | 1 | AL | 7 | S001185 | Terri | SEWELL |
| House | 119 | PALMER, Gary James | gary palmer|gary james palmer|palmer|gary j palmer|palmer|(^|senator |representative )palmerpalmer, gary|palmer gary|palmer, grepresentative palmerg j palmer | 21500 | 1 | AL | 6 | P000609 | Gary | PALMER |
| House | 119 | MOORE, Barry | barry moore|moore.{1,4}al|(^|senator |representative )moore{1,4}al|moore, barry|moore barry|representative moore{1,4}al | 22140 | 1 | AL | 1 | M001212 | Barry | MOORE |
| House | 119 | STRONG, Dale | dale strong|strong|strong|(^|senator |representative )strongstrong, dale|strong dale|strong, drepresentative strong/td> | 22366 | 1 | AL | 5 | S001220 | Dale | STRONG |
| House | 119 | FIGURES, Shomari | shomari figures|figures|figures|(^|senator |representative )figuresfigures, shomari|figures shomari|figures, srepresentative figures/td> | 22515 | 1 | AL | 2 | F000481 | Shomari | FIGURES |
Before searching the text, several functions clean it and “fix”
common human typos and OCR errors that frustrate matching. Some of these
corrections are currently supplied by MemberNameTypos.R. In
future versions, typos will be supplied as a dataframe
instead, and all types of corrections (cleaning, typos, OCR errors) will
be optional. Additionally, users will be able to customize the
typos dataframe and provide it as an argument to
extractMemberName().
| typos | correct |
|---|---|
| ( 0 | 0, ) | o |
| aaron( | [a-z]* )s chock($| |,|;)|s chock, aaron | aaron schock |
| adam( | [a-z]* )(schif|sdxiff)($| |,|;)|(schif|sdxiff), adam | adam schiff |
| adrian( | [a-z]* )espaillat|espaillat, adrian|adriano( | [a-z]* )espaillet($| |,|;)|espaillet, adriano | adriano espaillat |
| al( | [a-z]* )fianken($| |,|;)|fianken, al | al franken |
| (alccc|ateee)( | [a-z]* )hastings|hastings, (alccc|ateee) | alcee hastings |
Basic Usage
The main function is extractMemberName() returns a
dataframe of the names and ICPSR ID numbers of members of Congress in a
supplied vector of text.
- in the future,
extractMemberName()may default to returning a list of dataframes the same length as the supplied data
For example, we can use extractMemberName() to detect
the names of members of Congress in the text of the Congressional
Record. Let’s start with the text of the Congressional Record from
3/1/2007, scraped and parsed using methods described here.
| date | speaker | header | url | congress |
|---|---|---|---|---|
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-2 | 110 |
| 2007-03-01 | HON. MARK UDALL;Mr. UDALL | INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-3 | 110 |
| 2007-03-01 | HON. JAMES R. LANGEVIN;Mr. LANGEVIN | BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-4 | 110 |
| 2007-03-01 | HON. JIM COSTA;Mr. COSTA | A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-5 | 110 |
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-1 | 110 |
| 2007-03-01 | HON. SANFORD D. BISHOP;Mr. BISHOP | IN HONOR OF SYNOVUS BEING NAMED ONE OF THE BEST COMPANIES IN AMERICA; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E432-2 | 110 |
This is an extremely simple example because the text strings
containing the names of the members of Congress (speaker)
are short and do not contain much other text. However,
extractMemberName() is also capable of searching longer and
much messier texts, including text where names are not consistently
formatted or where they contain common typos introduced by humans or
common OCR errors. Indeed, these functions were developed to identify
members of Congress in ugly text data like this.
To better match member names, this function currently requires either:
- a column “congress” (this can be created from a date) or
- a vector of congresses to limit the search to
(
congresses)
extractMemberName()
cr2007_03_01$congress <- 110
# extract legislator names and match to voteview ICPSR numbers
cr <- extractMemberName(data = cr2007_03_01,
col_name = "speaker", # The text strings to search
congress = "congress", # This argument is not required in this case because the data contain a "congress" column
members = members # Member names augmented from voteview come with this package, but users can also supply a customized data frame
)#> Fixing typos...
#> Searching data for members of the 110th congress, n = 154 (123 distinct strings).
head(cr)| data_id | icpsr | bioname | last_name | first_name | congress | chamber | state_abbrev | district_code | date | speaker | header | url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 20124 | GRAVES, Samuel | GRAVES | Samuel | 110 | House | MO | 6 | 2007-03-01 | hon sam graves;mr graves | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-2 |
| 2 | 29906 | UDALL, Mark | UDALL | Mark | 110 | House | CO | 2 | 2007-03-01 | hon mark udall;mr udall | INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-3 |
| 3 | 20136 | LANGEVIN, James | LANGEVIN | James | 110 | House | RI | 2 | 2007-03-01 | hon james r langevin;mr langevin | BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-4 |
| 4 | 20501 | COSTA, Jim | COSTA | Jim | 110 | House | CA | 20 | 2007-03-01 | hon jim costa;mr costa | A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-5 |
| 5 | 20124 | GRAVES, Samuel | GRAVES | Samuel | 110 | House | MO | 6 | 2007-03-01 | hon sam graves;mr graves | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-1 |
| 6 | 29339 | BISHOP, Sanford Dixon, Jr. | BISHOP | Sanford | 110 | House | GA | 2 | 2007-03-01 | hon sanford d bishop;mr bishop | IN HONOR OF SYNOVUS BEING NAMED ONE OF THE BEST COMPANIES IN AMERICA; Congressional Record Vol. 153, No. 35 | https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E432-2 |
In this example, all observations are in the 110th Congress, so we
only search for members who served in the 110th. Because each row’s
speaker text contains only one member in this case,
data_row_id and match_id are the same. Where
multiple members are detected, there may be multiple matches per
data_row_id.
Because extractMemberName links each detected name to
ICPSR IDs from voteview.com, we already have some information, like
state and district for each legislator detected in the text (scroll all
the way to the right).
Other data from voteview.com and other sources can be merged in on
icpsr.
