Skip to contents

Find U.S. legislator names in messy text with typos and inconsistent name formats

Install this package with

devtools::install_github("judgelord/legislators")

Data

This package relies on a dataframe of permutations of the names of members of Congress. This dataframe builds on the basic structure of voteview.com data, especially the bioname field. From this and other corrections, it constructs a regular expression search pattern and conditions under which this pattern should yield a match (e.g., when that pattern has a unique match to a member of Congress in a given Congress). pattern differs from Congress to Congress because some member move from the House to the Senate and because members with similar names join or leave Congress. Users can customize the provided members data and supply their updated version to extractMemberName().

data("members")

dim(members)
#> [1] 51053    11
head(members)
chamber congress bioname pattern icpsr state state_abbrev district_code bioguide_id first_name last_name
House 119 ROGERS, Mike Dennis mike rogers|mike dennis rogers|rogers|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|rogers, mrepresentative rogers{1,4}al|m d rogers 20301 1 AL 3 R000575 Mike ROGERS
House 119 SEWELL, Terri terri sewell|sewell|terri a sewell|sewell|(^|senator |representative )sewellsewell, terri|sewell terri|sewell, trepresentative sewellt a sewell 21102 1 AL 7 S001185 Terri SEWELL
House 119 PALMER, Gary James gary palmer|gary james palmer|palmer|gary j palmer|palmer|(^|senator |representative )palmerpalmer, gary|palmer gary|palmer, grepresentative palmerg j palmer 21500 1 AL 6 P000609 Gary PALMER
House 119 MOORE, Barry barry moore|moore.{1,4}al|(^|senator |representative )moore{1,4}al|moore, barry|moore barry|representative moore{1,4}al 22140 1 AL 1 M001212 Barry MOORE
House 119 STRONG, Dale dale strong|strong|strong|(^|senator |representative )strongstrong, dale|strong dale|strong, drepresentative strong/td> 22366 1 AL 5 S001220 Dale STRONG
House 119 FIGURES, Shomari shomari figures|figures|figures|(^|senator |representative )figuresfigures, shomari|figures shomari|figures, srepresentative figures/td> 22515 1 AL 2 F000481 Shomari FIGURES

Before searching the text, several functions clean it and “fix” common human typos and OCR errors that frustrate matching. Some of these corrections are currently supplied by MemberNameTypos.R. In future versions, typos will be supplied as a dataframe instead, and all types of corrections (cleaning, typos, OCR errors) will be optional. Additionally, users will be able to customize the typos dataframe and provide it as an argument to extractMemberName().

data("typos")

head(typos)
typos correct
( 0 | 0, ) o
aaron( | [a-z]* )s chock($| |,|;)|s chock, aaron aaron schock
adam( | [a-z]* )(schif|sdxiff)($| |,|;)|(schif|sdxiff), adam adam schiff
adrian( | [a-z]* )espaillat|espaillat, adrian|adriano( | [a-z]* )espaillet($| |,|;)|espaillet, adriano adriano espaillat
al( | [a-z]* )fianken($| |,|;)|fianken, al al franken
(alccc|ateee)( | [a-z]* )hastings|hastings, (alccc|ateee) alcee hastings

Basic Usage

The main function is extractMemberName() returns a dataframe of the names and ICPSR ID numbers of members of Congress in a supplied vector of text.

  • in the future, extractMemberName() may default to returning a list of dataframes the same length as the supplied data

For example, we can use extractMemberName() to detect the names of members of Congress in the text of the Congressional Record. Let’s start with the text of the Congressional Record from 3/1/2007, scraped and parsed using methods described here.

data("cr2007_03_01")

head(cr2007_03_01)
date speaker header url congress
2007-03-01 HON. SAM GRAVES;Mr. GRAVES RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-2 110
2007-03-01 HON. MARK UDALL;Mr. UDALL INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-3 110
2007-03-01 HON. JAMES R. LANGEVIN;Mr. LANGEVIN BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-4 110
2007-03-01 HON. JIM COSTA;Mr. COSTA A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-5 110
2007-03-01 HON. SAM GRAVES;Mr. GRAVES RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-1 110
2007-03-01 HON. SANFORD D. BISHOP;Mr. BISHOP IN HONOR OF SYNOVUS BEING NAMED ONE OF THE BEST COMPANIES IN AMERICA; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E432-2 110

This is an extremely simple example because the text strings containing the names of the members of Congress (speaker) are short and do not contain much other text. However, extractMemberName() is also capable of searching longer and much messier texts, including text where names are not consistently formatted or where they contain common typos introduced by humans or common OCR errors. Indeed, these functions were developed to identify members of Congress in ugly text data like this.

To better match member names, this function currently requires either:

  • a column “congress” (this can be created from a date) or
  • a vector of congresses to limit the search to (congresses)

extractMemberName()

cr2007_03_01$congress <- 110

# extract legislator names and match to voteview ICPSR numbers
cr <- extractMemberName(data = cr2007_03_01, 
                        col_name = "speaker", # The text strings to search
                        congress = "congress", # This argument is not required in this case because the data contain a "congress" column
                        members = members # Member names augmented from voteview come with this package, but users can also supply a customized data frame
                        )
#> Fixing typos...
#> Searching data for members of the 110th congress, n = 154 (123 distinct strings).
head(cr)
data_id icpsr bioname last_name first_name congress chamber state_abbrev district_code date speaker header url
1 20124 GRAVES, Samuel GRAVES Samuel 110 House MO 6 2007-03-01 hon sam graves;mr graves RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-2
2 29906 UDALL, Mark UDALL Mark 110 House CO 2 2007-03-01 hon mark udall;mr udall INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-3
3 20136 LANGEVIN, James LANGEVIN James 110 House RI 2 2007-03-01 hon james r langevin;mr langevin BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-4
4 20501 COSTA, Jim COSTA Jim 110 House CA 20 2007-03-01 hon jim costa;mr costa A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-5
5 20124 GRAVES, Samuel GRAVES Samuel 110 House MO 6 2007-03-01 hon sam graves;mr graves RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E431-1
6 29339 BISHOP, Sanford Dixon, Jr.  BISHOP Sanford 110 House GA 2 2007-03-01 hon sanford d bishop;mr bishop IN HONOR OF SYNOVUS BEING NAMED ONE OF THE BEST COMPANIES IN AMERICA; Congressional Record Vol. 153, No. 35 https://www.congress.gov/congressional-record/2007/03/01/extensions-of-remarks-section/article/E432-2

In this example, all observations are in the 110th Congress, so we only search for members who served in the 110th. Because each row’s speaker text contains only one member in this case, data_row_id and match_id are the same. Where multiple members are detected, there may be multiple matches per data_row_id.

Because extractMemberName links each detected name to ICPSR IDs from voteview.com, we already have some information, like state and district for each legislator detected in the text (scroll all the way to the right).

Other data from voteview.com and other sources can be merged in on icpsr.