class: center, middle, inverse, title-slide # Reproducable scripts ### Devin Judge-Lord --- # Be nice to others ![](https://imgs.xkcd.com/comics/exploits_of_a_mom.png) --- # Be nice to yourself ![](http://www.phdcomics.com/comics/archive/phd031214s.gif) --- class: center inverse ## How to break a research project ![](https://media.giphy.com/media/xUA7aLL1iKj62wBYTC/giphy.gif) --- ## 1. `setwd("user\path\only\I\have")` -- ![](https://media.giphy.com/media/146SJ7jy4BFXHy/giphy.gif) ### We don't have your file paths! You won't have those file paths on your next computer, a lab computer, a remote server. --- ## 2. [Factors](https://www.r-bloggers.com/factors-are-not-first-class-citizens-in-r/) ```r strings <- c("a","a","b",NA,"c","c") f <- factor(strings) f ``` ``` ## [1] a a b <NA> c c ## Levels: a b c ``` ```r ifelse(is.na(f),'b',f) ``` ``` ## [1] "1" "1" "2" "b" "3" "3" ``` ### If working with text, begin with: `options(stringsAsFactors = F)` Let R coerce strings to factors as needed. Eventually, learn to master factors with [`forcats`](https://forcats.tidyverse.org/) (e.g., if you want to plot things in a specific order.) --- ## 3. ### `library(SomoneoneOnTheInternetWroteThis)` -- ![](https://media.giphy.com/media/l2R05FZf4dVOrgIaQ/giphy.gif) ### I may not have your packages! --- ## Tip: A seperate script to set it all up ### `source("setup.R")` ``` options(stringsAsFactors = F) ## Add any R packages you require. ## Here are some we will use in 811: requires <- c("tidyverse", # tidyverse includes dplyr and ggplot2 "magrittr", "here") ## Install any you don't have to_install <- c(requires %in% rownames(installed.packages()) == FALSE) install.packages(c(requires[to_install], "NA"), repos = "https://cloud.r-project.org/" ) ``` ``` ## Load all required R packages library(tidyverse) library(ggplot2); theme_set(theme_bw()) ## Set global ggplot theme library(magrittr) library(here) ``` --- ### `source("setup.R")` continued: ``` ## Set defaults for R chunks knitr::opts_chunk$set(echo = TRUE, # echo = TRUE means that your code will show warning=FALSE, message=FALSE, fig.retina = 2, fig.align = "center", dpi = 100, # fig.path='Figs/', ## where to save figures fig.height = 3, fig.width = 3) ``` --- ## 4. `data[2:7,]`: always the same observations? -- ## Use `filter()` Subset to rows (observations!) matching certain conditions ![](https://media.giphy.com/media/hO0L2DHNW1h4c/giphy.gif) --- ## 5. `data[, 2:7]`: always the same variables? -- ## Use `select()` Select columns (variables!) or column names matching certain conditions --- ### What is the UN up to? ```r html <- read_html("https://www.UN.org/") # The UN homepage links <- html_nodes(html, "a") # "a" nodes are linked text tibble(text = html_text(links), url = html_attr(links, "href") ) %>% # "href"" attributes are urls filter(str_detect(url, "events")) %>% # urls containing "events" select(text) %>% # or select(-url) distinct() ``` ``` ## # A tibble: 6 x 1 ## text ## <chr> ## 1 "اليوم الدولي للتوعية بخطر الألغام\r\n\r\n4 نيسان/أبريل\r\n" ## 2 国际提高地雷意识和协助地雷行动日-4月4日 ## 3 International Mine Awareness Day - 4 April ## 4 Journée internationale pour la sensibilisation au problème des mines et … ## 5 Международный день просвещения по вопросам минной опасности и помощи в д… ## 6 Día Internacional de información sobre el peligro de las minas y de asis… ``` --- ### To do something more than once, use [functions](https://r4ds.had.co.nz/functions.html#functions). ```r top_story <- function(x){ html <- read_html(str_c("https://www.UN.org/", x)) # The UN homepage links <- html_nodes(html, "a") # "a" nodes are linked text d <- tibble(text = html_text(links), url = html_attr(links, "href") # href attributes are urls ) %>% filter(text != "", str_detect(url, "/story")) %>% # urls that contain "story" .[1,] return(d) } ``` ```r top_story(x = "en") ``` ``` ## # A tibble: 1 x 2 ## text url ## <chr> <chr> ## 1 ‘The welfare of the Libyan people’ the UN’s … https://news.un.org/en/sto… ``` ```r top_story(x = "ru") ``` ``` ## # A tibble: 1 x 2 ## text url ## <chr> <chr> ## 1 "Генсек ООН покинул Ливию «с тяжелым … https://news.un.org/ru/story/2019… ``` --- ## Apply functions with [`purrr`](https://www.rstudio.com/resources/cheatsheets/#purrr) (Instead of writing loops where we have to index everything.) ```r languages <- c("ar", "zh", "en", "fr", "ru", "es") map_dfr(languages, top_story) ``` ``` ## # A tibble: 6 x 2 ## text url ## <chr> <chr> ## 1 غوتيريش: طفرة في العمل الدبلوماسي لحل الصراع… https://news.un.org/ar/sto… ## 2 利比亚国民军向首都挺进 联合国秘书长呼吁避免“血腥对抗”… https://news.un.org/zh/sto… ## 3 ‘The welfare of the Libyan people’ the UN’s … https://news.un.org/en/sto… ## 4 En visite en Libye, le chef de l’ONU se dit … https://news.un.org/fr/sto… ## 5 "Генсек ООН покинул Ливию «с тяжелым сердцем… https://news.un.org/ru/sto… ## 6 Guterres reitera el compromiso de la ONU de … https://news.un.org/es/sto… ``` --- ## Simpler is often stronger. ![](Figs/data_pipeline.png) -- - Use `%>%` and `%<>%` instead of making a million objects - Use `map()` and other `purrr` functions instead of loops --- ## Computing Have long scripts [send you a text or email](https://github.com/jennybc/send-email-with-r) when they are done with a task. Spread computation across several cores with [`multidplyr`](https://github.com/hadley/multidplyr/blob/master/vignettes/multidplyr.md) Use a remote, like [SSCC's linstat cluster](https://www.ssc.wisc.edu/sscc/pubs/linstat.htm) - Easy to `git clone` if you have a self-contained project folder on git