rvest
Devon Cantwell had a fun idea.
Anyone interested in a virtual happy hour for grad students at APSA in a few weeks? I can organize a few games and we can keep it to like an hour or so! You wouldn’t have to be presenting or registered to participate either!
— Devon Cantwell (@devon_cantwell) August 22, 2020
This reminded me of playing Pictionary with my research lab.
After lab meetings, we play https://t.co/ILwcWbPhT9 Pictionary 🧑🎨🖼️ using words from our data:
— Devin Judge-Lord (@JudgeLord) June 22, 2020
d %>%
unnest_tokens(word, text) %>% #tidytext!
inner_join(get_sentiments(“nrc”)) %>%
count(word) %>%
slice_sample(n = 100, weight_by = n) %>%
.$word %>%
str_c(collapse = “,”) pic.twitter.com/MIobNupp03
Scraping websites with rvest
is easy! (I’ll also use tidyverse
and magritter
functions here.)
A fun minimal example is the UN website:
library(rvest)
read_html("https://UN.org") # The UN homepage
html <- html_nodes(html, "a") # "a" nodes are linked text
links <-html_text(links)
## [1] "مرحبا بكم في موقع الأمم المتحدة"
## [2] "欢迎来到联合国网站"
## [3] "Welcome to the United Nations website"
## [4] "Bienvenue sur le site Internet des Nations Unies"
## [5] "Добро пожаловать в ООН!"
## [6] "Bienvenido al sitio web de las Naciones Unidas"
## [7] "\r\n عربي\r\n "
## [8] "\r\n 中文\r\n "
## [9] "\r\n English\r\n "
## [10] "\r\n Français\r\n "
## [11] "\r\n Русский\r\n "
## [12] "\r\n Español\r\n "
APSA complicates things by making us select a timezone:
"https://convention2.allacademic.com/one/apsa/apsa20/"
url <-
read_html(url) %>% html_text()
## [1] "APSA Annual Meeting & Exhibition 2020\n<!--\nform, ul, dl, dt, dd, li, h1, h2, h3, h4, h5, h6 { margin: 0; padding: 0; }\n\n\n-->\nSet your timezoneThe online program contains content that is remote in nature. Setting your timezone will allow for localization of the times.To change your timezone later, select \"Change Preferences\" from the \"Navigation Menu\" on the left side of the online program home page.Please select your timezone:-- Please select a timezone --[UTC+00:00] - Africa/Abidjan[UTC+00:00] - Africa/Accra[UTC+03:00] - Africa/Addis_Ababa[UTC+01:00] - Africa/Algiers[UTC+03:00] - Africa/Asmara[UTC+00:00] - Africa/Bamako[UTC+01:00] - Africa/Bangui[UTC+00:00] - Africa/Banjul[UTC+00:00] - Africa/Bissau[UTC+02:00] - Africa/Blantyre[UTC+01:00] - Africa/Brazzaville[UTC+02:00] - Africa/Bujumbura[UTC+02:00] - Africa/Cairo[UTC+01:00] - Africa/Casablanca[UTC+02:00] - Africa/Ceuta[UTC+00:00] - Africa/Conakry[UTC+00:00] - Africa/Dakar[UTC+03:00] - Africa/Dar_es_Salaam[UTC+03:00] - Africa/Djibouti[UTC+01:00] - Africa/Douala[UTC+01:00] - Africa/El_Aaiun[UTC+00:00] - Africa/Freetown[UTC+02:00] - Africa/Gaborone[UTC+02:00] - Africa/Harare[UTC+02:00] - Africa/Johannesburg[UTC+03:00] - Africa/Juba[UTC+03:00] - Africa/Kampala[UTC+02:00] - Africa/Khartoum[UTC+02:00] - Africa/Kigali[UTC+01:00] - Africa/Kinshasa[UTC+01:00] - Africa/Lagos[UTC+01:00] - Africa/Libreville[UTC+00:00] - Africa/Lome[UTC+01:00] - Africa/Luanda[UTC+02:00] - Africa/Lubumbashi[UTC+02:00] - Africa/Lusaka[UTC+01:00] - Africa/Malabo[UTC+02:00] - Africa/Maputo[UTC+02:00] - Africa/Maseru[UTC+02:00] - Africa/Mbabane[UTC+03:00] - Africa/Mogadishu[UTC+00:00] - Africa/Monrovia[UTC+03:00] - Africa/Nairobi[UTC+01:00] - Africa/Ndjamena[UTC+01:00] - Africa/Niamey[UTC+00:00] - Africa/Nouakchott[UTC+00:00] - Africa/Ouagadougou[UTC+01:00] - Africa/Porto-Novo[UTC+00:00] - Africa/Sao_Tome[UTC+02:00] - Africa/Tripoli[UTC+01:00] - Africa/Tunis[UTC+02:00] - Africa/Windhoek[UTC-09:00] - America/Adak[UTC-08:00] - America/Anchorage[UTC-04:00] - America/Anguilla[UTC-04:00] - America/Antigua[UTC-03:00] - America/Araguaina[UTC-03:00] - America/Argentina/Buenos_Aires[UTC-03:00] - America/Argentina/Catamarca[UTC-03:00] - America/Argentina/Cordoba[UTC-03:00] - America/Argentina/Jujuy[UTC-03:00] - America/Argentina/La_Rioja[UTC-03:00] - America/Argentina/Mendoza[UTC-03:00] - America/Argentina/Rio_Gallegos[UTC-03:00] - America/Argentina/Salta[UTC-03:00] - America/Argentina/San_Juan[UTC-03:00] - America/Argentina/San_Luis[UTC-03:00] - America/Argentina/Tucuman[UTC-03:00] - America/Argentina/Ushuaia[UTC-04:00] - America/Aruba[UTC-04:00] - America/Asuncion[UTC-05:00] - America/Atikokan[UTC-03:00] - America/Bahia[UTC-05:00] - America/Bahia_Banderas[UTC-04:00] - America/Barbados[UTC-03:00] - America/Belem[UTC-06:00] - America/Belize[UTC-04:00] - America/Blanc-Sablon[UTC-04:00] - America/Boa_Vista[UTC-05:00] - America/Bogota[UTC-06:00] - America/Boise[UTC-06:00] - America/Cambridge_Bay[UTC-04:00] - America/Campo_Grande[UTC-05:00] - America/Cancun[UTC-04:00] - America/Caracas[UTC-03:00] - America/Cayenne[UTC-05:00] - America/Cayman[UTC-05:00] - America/Chicago[UTC-06:00] - America/Chihuahua[UTC-06:00] - America/Costa_Rica[UTC-07:00] - America/Creston[UTC-04:00] - America/Cuiaba[UTC-04:00] - America/Curacao[UTC+00:00] - America/Danmarkshavn[UTC-07:00] - America/Dawson[UTC-07:00] - America/Dawson_Creek[UTC-06:00] - America/Denver[UTC-04:00] - America/Detroit[UTC-04:00] - America/Dominica[UTC-06:00] - America/Edmonton[UTC-05:00] - America/Eirunepe[UTC-06:00] - America/El_Salvador[UTC-07:00] - America/Fort_Nelson[UTC-03:00] - America/Fortaleza[UTC-03:00] - America/Glace_Bay[UTC-03:00] - America/Goose_Bay[UTC-04:00] - America/Grand_Turk[UTC-04:00] - America/Grenada[UTC-04:00] - America/Guadeloupe[UTC-06:00] - America/Guatemala[UTC-05:00] - America/Guayaquil[UTC-04:00] - America/Guyana[UTC-03:00] - America/Halifax[UTC-04:00] - America/Havana[UTC-07:00] - America/Hermosillo[UTC-04:00] - America/Indiana/Indianapolis[UTC-05:00] - America/Indiana/Knox[UTC-04:00] - America/Indiana/Marengo[UTC-04:00] - America/Indiana/Petersburg[UTC-05:00] - America/Indiana/Tell_City[UTC-04:00] - America/Indiana/Vevay[UTC-04:00] - America/Indiana/Vincennes[UTC-04:00] - America/Indiana/Winamac[UTC-06:00] - America/Inuvik[UTC-04:00] - America/Iqaluit[UTC-05:00] - America/Jamaica[UTC-08:00] - America/Juneau[UTC-04:00] - America/Kentucky/Louisville[UTC-04:00] - America/Kentucky/Monticello[UTC-04:00] - America/Kralendijk[UTC-04:00] - America/La_Paz[UTC-05:00] - America/Lima[UTC-07:00] - America/Los_Angeles[UTC-04:00] - America/Lower_Princes[UTC-03:00] - America/Maceio[UTC-06:00] - America/Managua[UTC-04:00] - America/Manaus[UTC-04:00] - America/Marigot[UTC-04:00] - America/Martinique[UTC-05:00] - America/Matamoros[UTC-06:00] - America/Mazatlan[UTC-05:00] - America/Menominee[UTC-05:00] - America/Merida[UTC-08:00] - America/Metlakatla[UTC-05:00] - America/Mexico_City[UTC-02:00] - America/Miquelon[UTC-03:00] - America/Moncton[UTC-05:00] - America/Monterrey[UTC-03:00] - America/Montevideo[UTC-04:00] - America/Montserrat[UTC-04:00] - America/Nassau[UTC-04:00] - America/New_York[UTC-04:00] - America/Nipigon[UTC-08:00] - America/Nome[UTC-02:00] - America/Noronha[UTC-05:00] - America/North_Dakota/Beulah[UTC-05:00] - America/North_Dakota/Center[UTC-05:00] - America/North_Dakota/New_Salem[UTC-02:00] - America/Nuuk[UTC-06:00] - America/Ojinaga[UTC-05:00] - America/Panama[UTC-04:00] - America/Pangnirtung[UTC-03:00] - America/Paramaribo[UTC-07:00] - America/Phoenix[UTC-04:00] - America/Port-au-Prince[UTC-04:00] - America/Port_of_Spain[UTC-04:00] - America/Porto_Velho[UTC-04:00] - America/Puerto_Rico[UTC-03:00] - America/Punta_Arenas[UTC-05:00] - America/Rainy_River[UTC-05:00] - America/Rankin_Inlet[UTC-03:00] - America/Recife[UTC-06:00] - America/Regina[UTC-05:00] - America/Resolute[UTC-05:00] - America/Rio_Branco[UTC-03:00] - America/Santarem[UTC-03:00] - America/Santiago[UTC-04:00] - America/Santo_Domingo[UTC-03:00] - America/Sao_Paulo[UTC+00:00] - America/Scoresbysund[UTC-08:00] - America/Sitka[UTC-04:00] - America/St_Barthelemy[UTC-02:30] - America/St_Johns[UTC-04:00] - America/St_Kitts[UTC-04:00] - America/St_Lucia[UTC-04:00] - America/St_Thomas[UTC-04:00] - America/St_Vincent[UTC-06:00] - America/Swift_Current[UTC-06:00] - America/Tegucigalpa[UTC-03:00] - America/Thule[UTC-04:00] - America/Thunder_Bay[UTC-07:00] - America/Tijuana[UTC-04:00] - America/Toronto[UTC-04:00] - America/Tortola[UTC-07:00] - America/Vancouver[UTC-07:00] - America/Whitehorse[UTC-05:00] - America/Winnipeg[UTC-08:00] - America/Yakutat[UTC-06:00] - America/Yellowknife[UTC+08:00] - Antarctica/Casey[UTC+07:00] - Antarctica/Davis[UTC+10:00] - Antarctica/DumontDUrville[UTC+11:00] - Antarctica/Macquarie[UTC+05:00] - Antarctica/Mawson[UTC+12:00] - Antarctica/McMurdo[UTC-03:00] - Antarctica/Palmer[UTC-03:00] - Antarctica/Rothera[UTC+03:00] - Antarctica/Syowa[UTC+02:00] - Antarctica/Troll[UTC+06:00] - Antarctica/Vostok[UTC+02:00] - Arctic/Longyearbyen[UTC+03:00] - Asia/Aden[UTC+06:00] - Asia/Almaty[UTC+03:00] - Asia/Amman[UTC+12:00] - Asia/Anadyr[UTC+05:00] - Asia/Aqtau[UTC+05:00] - Asia/Aqtobe[UTC+05:00] - Asia/Ashgabat[UTC+05:00] - Asia/Atyrau[UTC+03:00] - Asia/Baghdad[UTC+03:00] - Asia/Bahrain[UTC+04:00] - Asia/Baku[UTC+07:00] - Asia/Bangkok[UTC+07:00] - Asia/Barnaul[UTC+03:00] - Asia/Beirut[UTC+06:00] - Asia/Bishkek[UTC+08:00] - Asia/Brunei[UTC+09:00] - Asia/Chita[UTC+08:00] - Asia/Choibalsan[UTC+05:30] - Asia/Colombo[UTC+03:00] - Asia/Damascus[UTC+06:00] - Asia/Dhaka[UTC+09:00] - Asia/Dili[UTC+04:00] - Asia/Dubai[UTC+05:00] - Asia/Dushanbe[UTC+03:00] - Asia/Famagusta[UTC+03:00] - Asia/Gaza[UTC+03:00] - Asia/Hebron[UTC+07:00] - Asia/Ho_Chi_Minh[UTC+08:00] - Asia/Hong_Kong[UTC+07:00] - Asia/Hovd[UTC+08:00] - Asia/Irkutsk[UTC+07:00] - Asia/Jakarta[UTC+09:00] - Asia/Jayapura[UTC+03:00] - Asia/Jerusalem[UTC+04:30] - Asia/Kabul[UTC+12:00] - Asia/Kamchatka[UTC+05:00] - Asia/Karachi[UTC+05:45] - Asia/Kathmandu[UTC+09:00] - Asia/Khandyga[UTC+05:30] - Asia/Kolkata[UTC+07:00] - Asia/Krasnoyarsk[UTC+08:00] - Asia/Kuala_Lumpur[UTC+08:00] - Asia/Kuching[UTC+03:00] - Asia/Kuwait[UTC+08:00] - Asia/Macau[UTC+11:00] - Asia/Magadan[UTC+08:00] - Asia/Makassar[UTC+08:00] - Asia/Manila[UTC+04:00] - Asia/Muscat[UTC+03:00] - Asia/Nicosia[UTC+07:00] - Asia/Novokuznetsk[UTC+07:00] - Asia/Novosibirsk[UTC+06:00] - Asia/Omsk[UTC+05:00] - Asia/Oral[UTC+07:00] - Asia/Phnom_Penh[UTC+07:00] - Asia/Pontianak[UTC+09:00] - Asia/Pyongyang[UTC+03:00] - Asia/Qatar[UTC+06:00] - Asia/Qostanay[UTC+05:00] - Asia/Qyzylorda[UTC+03:00] - Asia/Riyadh[UTC+11:00] - Asia/Sakhalin[UTC+05:00] - Asia/Samarkand[UTC+09:00] - Asia/Seoul[UTC+08:00] - Asia/Shanghai[UTC+08:00] - Asia/Singapore[UTC+11:00] - Asia/Srednekolymsk[UTC+08:00] - Asia/Taipei[UTC+05:00] - Asia/Tashkent[UTC+04:00] - Asia/Tbilisi[UTC+04:30] - Asia/Tehran[UTC+06:00] - Asia/Thimphu[UTC+09:00] - Asia/Tokyo[UTC+07:00] - Asia/Tomsk[UTC+08:00] - Asia/Ulaanbaatar[UTC+06:00] - Asia/Urumqi[UTC+10:00] - Asia/Ust-Nera[UTC+07:00] - Asia/Vientiane[UTC+10:00] - Asia/Vladivostok[UTC+09:00] - Asia/Yakutsk[UTC+06:30] - Asia/Yangon[UTC+05:00] - Asia/Yekaterinburg[UTC+04:00] - Asia/Yerevan[UTC+00:00] - Atlantic/Azores[UTC-03:00] - Atlantic/Bermuda[UTC+01:00] - Atlantic/Canary[UTC-01:00] - Atlantic/Cape_Verde[UTC+01:00] - Atlantic/Faroe[UTC+01:00] - Atlantic/Madeira[UTC+00:00] - Atlantic/Reykjavik[UTC-02:00] - Atlantic/South_Georgia[UTC+00:00] - Atlantic/St_Helena[UTC-03:00] - Atlantic/Stanley[UTC+09:30] - Australia/Adelaide[UTC+10:00] - Australia/Brisbane[UTC+09:30] - Australia/Broken_Hill[UTC+10:00] - Australia/Currie[UTC+09:30] - Australia/Darwin[UTC+08:45] - Australia/Eucla[UTC+10:00] - Australia/Hobart[UTC+10:00] - Australia/Lindeman[UTC+10:30] - Australia/Lord_Howe[UTC+10:00] - Australia/Melbourne[UTC+08:00] - Australia/Perth[UTC+10:00] - Australia/Sydney[UTC+02:00] - Europe/Amsterdam[UTC+02:00] - Europe/Andorra[UTC+04:00] - Europe/Astrakhan[UTC+03:00] - Europe/Athens[UTC+02:00] - Europe/Belgrade[UTC+02:00] - Europe/Berlin[UTC+02:00] - Europe/Bratislava[UTC+02:00] - Europe/Brussels[UTC+03:00] - Europe/Bucharest[UTC+02:00] - Europe/Budapest[UTC+02:00] - Europe/Busingen[UTC+03:00] - Europe/Chisinau[UTC+02:00] - Europe/Copenhagen[UTC+01:00] - Europe/Dublin[UTC+02:00] - Europe/Gibraltar[UTC+01:00] - Europe/Guernsey[UTC+03:00] - Europe/Helsinki[UTC+01:00] - Europe/Isle_of_Man[UTC+03:00] - Europe/Istanbul[UTC+01:00] - Europe/Jersey[UTC+02:00] - Europe/Kaliningrad[UTC+03:00] - Europe/Kiev[UTC+03:00] - Europe/Kirov[UTC+01:00] - Europe/Lisbon[UTC+02:00] - Europe/Ljubljana[UTC+01:00] - Europe/London[UTC+02:00] - Europe/Luxembourg[UTC+02:00] - Europe/Madrid[UTC+02:00] - Europe/Malta[UTC+03:00] - Europe/Mariehamn[UTC+03:00] - Europe/Minsk[UTC+02:00] - Europe/Monaco[UTC+03:00] - Europe/Moscow[UTC+02:00] - Europe/Oslo[UTC+02:00] - Europe/Paris[UTC+02:00] - Europe/Podgorica[UTC+02:00] - Europe/Prague[UTC+03:00] - Europe/Riga[UTC+02:00] - Europe/Rome[UTC+04:00] - Europe/Samara[UTC+02:00] - Europe/San_Marino[UTC+02:00] - Europe/Sarajevo[UTC+04:00] - Europe/Saratov[UTC+03:00] - Europe/Simferopol[UTC+02:00] - Europe/Skopje[UTC+03:00] - Europe/Sofia[UTC+02:00] - Europe/Stockholm[UTC+03:00] - Europe/Tallinn[UTC+02:00] - Europe/Tirane[UTC+04:00] - Europe/Ulyanovsk[UTC+03:00] - Europe/Uzhgorod[UTC+02:00] - Europe/Vaduz[UTC+02:00] - Europe/Vatican[UTC+02:00] - Europe/Vienna[UTC+03:00] - Europe/Vilnius[UTC+04:00] - Europe/Volgograd[UTC+02:00] - Europe/Warsaw[UTC+02:00] - Europe/Zagreb[UTC+03:00] - Europe/Zaporozhye[UTC+02:00] - Europe/Zurich[UTC+03:00] - Indian/Antananarivo[UTC+06:00] - Indian/Chagos[UTC+07:00] - Indian/Christmas[UTC+06:30] - Indian/Cocos[UTC+03:00] - Indian/Comoro[UTC+05:00] - Indian/Kerguelen[UTC+04:00] - Indian/Mahe[UTC+05:00] - Indian/Maldives[UTC+04:00] - Indian/Mauritius[UTC+03:00] - Indian/Mayotte[UTC+04:00] - Indian/Reunion[UTC+13:00] - Pacific/Apia[UTC+12:00] - Pacific/Auckland[UTC+11:00] - Pacific/Bougainville[UTC+12:45] - Pacific/Chatham[UTC+10:00] - Pacific/Chuuk[UTC-05:00] - Pacific/Easter[UTC+11:00] - Pacific/Efate[UTC+13:00] - Pacific/Enderbury[UTC+13:00] - Pacific/Fakaofo[UTC+12:00] - Pacific/Fiji[UTC+12:00] - Pacific/Funafuti[UTC-06:00] - Pacific/Galapagos[UTC-09:00] - Pacific/Gambier[UTC+11:00] - Pacific/Guadalcanal[UTC+10:00] - Pacific/Guam[UTC-10:00] - Pacific/Honolulu[UTC+14:00] - Pacific/Kiritimati[UTC+11:00] - Pacific/Kosrae[UTC+12:00] - Pacific/Kwajalein[UTC+12:00] - Pacific/Majuro[UTC-09:30] - Pacific/Marquesas[UTC-11:00] - Pacific/Midway[UTC+12:00] - Pacific/Nauru[UTC-11:00] - Pacific/Niue[UTC+11:00] - Pacific/Norfolk[UTC+11:00] - Pacific/Noumea[UTC-11:00] - Pacific/Pago_Pago[UTC+09:00] - Pacific/Palau[UTC-08:00] - Pacific/Pitcairn[UTC+11:00] - Pacific/Pohnpei[UTC+10:00] - Pacific/Port_Moresby[UTC-10:00] - Pacific/Rarotonga[UTC+10:00] - Pacific/Saipan[UTC-10:00] - Pacific/Tahiti[UTC+12:00] - Pacific/Tarawa[UTC+13:00] - Pacific/Tongatapu[UTC+12:00] - Pacific/Wake[UTC+12:00] - Pacific/Wallis\n©2020 All Academic, Inc. | Privacy Policy\n .ui-li-link-alt-left {\n left: 0;\n right: auto;\n }\n \n .ul-li-has-alt-left {\n padding-right: auto !important;\n padding-left: 48px !important;\n margin-right: 0 !important;\n }\n \n ul.program_content {\n margin-left: 40px;\n }\n\n\n @media all and (max-width: 50em) {\n \t.my-breakpoint .ui-block-a, \n \t.my-breakpoint .ui-block-b, \n \t.my-breakpoint .ui-block-c,\n \t.my-breakpoint .ui-block-d,\n \t.my-breakpoint .ui-block-e { \n \t\twidth: 100%; \n \t\tfloat:none; \n \t}\n }\n\n\n\n @media print {\n .non-printable {\n display: none;\n }\n\n #online_program1_panel_open_btn,\n #online_program_back_btn,\n #online_program_search_btn,\n #online_program_home_btn {\n display: none;\n }\n\n #back {\n display: none;\n }\n\n .ui-btn {\n display: none;\n }\n }\n \n<!--\n\n window.onload = function updateTimezoneSelect() {\n // (new Date()).getTimezoneOffset()/60 will return the current number of hours offset from UTC.\n user_timezone = Intl.DateTimeFormat().resolvedOptions().timeZone;\n selectObject = $(\"#new_timezone\");\n \n if (selectObject.val() == \"unselected\") {\n selectObject.val(user_timezone).attr(\"selected\", true).siblings(\"option\").removeAttr(\"selected\");\n selectObject.selectmenu(\"refresh\", true);\n alert(\"Set to \" + user_timezone);\n }\n }\n \n (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');\n \n ga('create', 'UA-79637004-2', 'auto');\n ga('send', 'pageview');\n ga('create', 'UA-55209081-6', 'auto', 'extraTracker');\n ga('extraTracker.send', 'pageview');\n-->\n"
So, we must set a timezone for our session. rvest
has several tools that allow us to submit web forms.
html_session(url)
mysession <-
html_form(mysession)[[1]] %>%
timzone_form <- set_values(new_timezone = "Africa/Abidjan")
submit_form(mysession, timzone_form)
## <session> https://convention2.allacademic.com/one/apsa/apsa20/
## Status: 200
## Type: text/html; charset=utf-8
## Size: 15344
After providing our newly-created browser session info, we can now navitgate to APSA’s “Created Panels” page with rvest
’s follow_link
function. We can then read the html and grab the linked text nodes.
jump_to(mysession, url) %>%
html <- follow_link("Browse By Session or Event Type") %>%
follow_link("Created Panel") %>%
read_html()
html_nodes(html, "a") # "a" nodes are linked text
links <-
html_text(links) %>% head(20)
## [1] "Search"
## [2] "Browse By Day"
## [3] "Browse By Time"
## [4] "Browse By Person"
## [5] "Browse By Mini-Conference"
## [6] "Browse By Division"
## [7] "Browse By Session or Event Type"
## [8] "Change Preferences"
## [9] "Sign In"
## [10] "Search Tips"
## [11] "Twitter"
## [12] ""
## [13] "Back"
## [14] ""
## [15] "Home"
## [16] "2:00 to 3:30pm MDT (8:00 to 9:30pm GMT)TBAThat's Entertainment! Celebrities, Comedy, and Crime in Political CommunicationSub Unit: Division 38: Political CommunicationSession Submission Type: Created Panel"
## [17] "2:00 to 3:30pm MDT (8:00 to 9:30pm GMT)TBAPolitical Effects of Social MediaSub Unit: Division 40: Information Technology, & PoliticsSession Submission Type: Created Panel"
## [18] "2:00 to 3:30pm MDT (8:00 to 9:30pm GMT)TBAAre Women Electable?Sub Unit: Division 31: Women and Politics ResearchSession Submission Type: Created Panel"
## [19] "2:00 to 3:30pm MDT (8:00 to 9:30pm GMT)TBAThe Politics and Economics of Bilateral Investment TreatiesSub Unit: Division 16: International Political EconomySession Submission Type: Created Panel"
## [20] "2:00 to 3:30pm MDT (8:00 to 9:30pm GMT)TBAMedia & AutocracySub Unit: Division 44: Democracy and AutocracySession Submission Type: Created Panel"
html_text
extracts text from HTML nodes. On this page the linked text is the title of each panel (except for the first 14 links).
To clean up the panel titles, I remove all text before “TBA” or after “Sub Unit” using the one regular expression to rule them all .*
, which matches anything (.
) anynumber of times (*
).
html_attr
extracts other HTML attributes. Linked URLs are in the “href” attribute.
Let’s put both into a tidy dataframe:
tibble(title = html_text(links) %>%
d <- str_remove_all(".*TBA|Sub Unit.*"),
url = html_attr(links, "href")
)
# filter to rows that contain a "session_id" in their URL
%<>% filter( str_detect(url, "session_id") )
d
d
## # A tibble: 578 x 2
## title url
## <chr> <chr>
## 1 That's Entertainment! Celebrities, Co… https://convention2.allacademic.com//…
## 2 Political Effects of Social Media https://convention2.allacademic.com//…
## 3 Are Women Electable? https://convention2.allacademic.com//…
## 4 The Politics and Economics of Bilater… https://convention2.allacademic.com//…
## 5 Media & Autocracy https://convention2.allacademic.com//…
## 6 NGO’s in Politics: Mobilization, Ineq… https://convention2.allacademic.com//…
## 7 Group Representation https://convention2.allacademic.com//…
## 8 Governing AAPI People Through Immigra… https://convention2.allacademic.com//…
## 9 The Politics of Fear and Violence https://convention2.allacademic.com//…
## 10 Populism and Discursive Governance https://convention2.allacademic.com//…
## # … with 568 more rows
Now that we have a tidy dataframe with a column of text, the world is our oyster. We could follow each URL to get more details on each panel using purrr
s map_dfr
like I did here, but I should get back to writing my APSA paper.
For pictionary, we just need a sample of common words. The tidytext
package has a number of helpful tools for doing this. Most importantly, unnest_tokens
“tokenizes” text–here breaking it up by word. filter
, count
, and slice
from dplyr
help us clean up, sample, and collapse these words into a block of text.
Tip: to get drawable words from messier text, try keeping only words in the NRC dictionary by adding
inner_join(get_sentiments("nrc"))
anywhere between unnesting and sampling them.
library(tidytext)
d %>%
word_counts <- # get words from a column "title"
unnest_tokens(word, title) %>%
# remove common words (such as "the")
anti_join(stop_words) %>%
# filter out words less than 5 letters or with apostrophes
filter( nchar(word) > 5, !str_detect(word, "\\'") ) %>%
# sample 500 words, weighted by their frequency
count(word)
%>% arrange(-n) word_counts
## # A tibble: 780 x 2
## word n
## <chr> <int>
## 1 political 91
## 2 politics 73
## 3 public 34
## 4 policy 32
## 5 social 20
## 6 gender 19
## 7 international 19
## 8 conflict 18
## 9 democratic 18
## 10 democracy 16
## # … with 770 more rows
%>%
word_counts slice_sample(n = 500, weight_by = n) %>%
# collapse to a block of text, separating words with commas
.$word %>%
str_c(collapse = ", ")
## [1] "methods, consequences, appointments, parties, global, twitter, entering, environmental, empire, thinking, ethnicity, gaining, resentment, dimensionality, undergraduate, political, regime, injustice, innovative, canadian, research, hybrid, backsliding, approaches, causal, theory, politics, quality, sexual, intervention, influence, authoritarian, freedom, nativity, online, economy, digital, networks, engagement, conflict, crisis, expedience, foundations, generalization, opinion, vulnerable, resistance, public, behavior, immigration, perceptions, security, foreign, poetry, technology, policy, building, broader, movement, unpacking, supreme, rights, understanding, classroom, diversity, responds, leadership, measures, responsibility, nationalism, citizenship, disinformation, packing, electoral, immigrant, ethnic, prejudice, elections, illiberal, civilians, environment, identity, bureaucrats, voting, constitutive, female, perspective, localities, campaign, fragility, social, rollback, biology, communication, strategic, rebels, issues, emergency, judicial, invisibilized, creative, development, divides, contracting, globalization, popular, legitimacy, strategies, perspectives, experiments, donors, ukraine, diffusion, change, tension, investment, populism, persistent, economic, inequality, discursive, alternative, speech, violence, negotiations, chinese, nativism, governance, mobilization, substantive, resilience, reconsidered, indigenous, executive, bargaining, claiming, tendencies, municipal, controversies, lending, resurgence, framing, pulpit, russia, question, models, jurisprudence, pedagogies, contemporary, physical, institutions, presidents, administration, taiwanese, capital, representatives, iberia, challenges, population, conservation, system, turnout, destabilization, participation, simulations, ethnographic, effects, democracy, redistricting, dynamics, europe, inclusion, trends, presidency, integration, elites, adversity, migration, positions, transitions, experimental, democratic, qualitative, student, american, regimes, ballot, activism, behaving, gerrymandering, complicated, insecurity, unrepresentability, comparative, refracting, eurasia, communities, societal, tactics, crises, international, follow, service, matter, multilevel, upheaval, socialist, agendas, blaming, america, alliances, insider, populist, formation, autocracy, minorities, inference, voters, toleration, climate, alliance, future, comparing, people, escalation, organized, diverse, control, coding, systems, difficult, military, current, legislative, territorial, religion, contexts, suffering, assessing, effectiveness, outsider, sovereignty, strangers, memory, loyalty, personal, constituents, bodies, information, european, leaving, directions, legislature, practices, emotions, civility, organizations, scientists, lobbying, constitutional, cities, accounting, difference, effect, constitutionalism, executives, institutional, gender, borders, transition, conversion, taiwan, psychology, experience, automation, conditions, relations, reactions, congress, intergovernmental, financial, education, movements, attitudes, backlash, deception, processing, science, conference, ambition, respond, character, structural, classrooms, progressive, seeking, sexuality, countries, transnationalism, misinformation, legislatures, precarity, imagination, visibility, campaigns, responsiveness, protests, learning, connections, collaboration, colonial, advancement, standards, delivery, modern, narratives, celebrities, promote, impacts, dissatisfaction, militaries, intersectionality, sexism, policies, nuclear, comedy, support, aspirations, combatants, protest, analysis, heterogeneity, collection, normative, studies, context, minority, motivations, labeling, sources, geography, measurement, branch, feminism, brexit, polarization, tariffs, spring, evolving, cooperation, management, communist, feminist, equality, advances, durability, segregation, accountability, forced, worlds, evidence, neoliberalism, tonight, addressing, credibility, private, financing, forecasts, intelligence, representation, brazil, performance, incivility, provocation, responses, influences, epistemological, sentiment, division, organization, teaching, policymaking, proliferation, dimensions, authoritarianism, epistography, relevance, feedback, lessons, sacralization, africa, theories, technological, welfare, markets, constitution, retreat, violent, process, objects, tradition, advantage, entertainment, evaluations, neighborhoods, actors, representativeness, coalition, peacekeeping, middle, differences, vulnerability, displaying, boundaries, health, unassimilable, terrorism, identities, patterns, judges, eastern, ontology, adults, genera, candidates, station, democratization, interactions, society, opportunities, schemes, exclusion, strategy, environments, duration, developing, governing, forces, mexico, lenses, repression, machiavelli, colonialism, origins, natural, pretty, procedures, challenge, emerging, disputes, fiction, reproduction, women’s, negotiating, communicating, indexing, signaling, housing, fooled, capitalism, disaster, experiential, deterrence, relationship, constraints, services, futures, organizing, corruption, content, confronting, liberal, history, legacy, administrative, dilemmas, shadow, response, persecution, forecasting, pedagogy, surveillance, playing, racism, rhetoric, collapse, deaths, boards, fieldwork, machiavelli’s, spatial, community, russian, vision, simulation, mobilizing, domestic, paleolithic, judging, beliefs"