R Workshop

Module 3: Data analysis with R (2)

2018-03-28
Bobae Kang
(Bobae.Kang@illinois.gov)

Agenda

Part 1: Getting started with tidyverse
Part 2: More on data analysis

More on Data Analysis

plot of chunk unnamed-chunk-1

Source: Iconfinder.com

Working with Strings

plot of chunk unnamed-chunk-2

Source: Iconfinder.com

stringr from tidyverse

plot of chunk unnamed-chunk-3

Source: RStudio

Key stringr functions

str_to_upper() str_to_lower() str_to_title()
str_trim() str_squish()
str_c()
str_detect()
str_subset()
str_sub()

Note: Many stringr functions have base R alternatives

Convert letter case

str_to_upper(string, locale = "en")
str_to_lower(string, locale = "en")
str_to_title(string, locale = "en")

string input is a character vector
- Or something that can be coerced into a character vector
The default locale is “en”, for English
str_to_title() capitalizes only the first letter of each word

Example

str_to_upper("hello world")

[1] "HELLO WORLD"

str_to_lower("HELLO WORLD")

[1] "hello world"

str_to_title("hello WORLD")

[1] "Hello World"

Base R alternative

# equivalent to str_to_upper()
toupper(string)

# equivalent to str_to_lower()
tolower(string)

Trim whitespace

str_trim(string, side = c("both", "left", "right"))
str_squish(string)

string input is a character vector
side input determines which side of a string to trim
- “both” trims whitespaces on both the beginning and the end
- “left” trims whitespaces only on the beginning
- “right” trims whitespaces only on the end

Example

str_trim("  trim both  ", side = "both")

[1] "trim both"

str_trim("  trim left only  ", side = "left")

[1] "trim left only  "

str_trim("  trim right only  ", side = "right")

[1] "  trim right only"

str_squish("  whitespaces all    over the  place   ")

[1] "whitespaces all over the place"

Base R alternative

# equivalent to str_trim()
trimws(x, which = c("both", "left", "right"))

Concatenate strings into one

str_c(..., sep = "", collapse = NULL)

The first argument (...) is one more more character vectors
sep is a separator string between input vectors; default value is none ("").
collapse is an optional string used to combined input vectors into a single string

Example

str_c(c("one", "two"), c("plus three", "minus four"), sep = " ")

[1] "one plus three" "two minus four"

str_c(c("one", "two", "three"), "plus four", sep = " ")

[1] "one plus four"   "two plus four"   "three plus four"

str_c(c("one", "two", "three"), collapse = " plus ")

[1] "one plus two plus three"

str_c(c("one", "two", "three"), "plus four", sep = " ", collapse = " and ")

[1] "one plus four and two plus four and three plus four"

Base R alternative

# equivalent to str_c()
paste (..., sep = " ", collapse = NULL)

Detect a pattern

str_detect(string, pattern)

string input is a character vector
pattern input is a character vector of length 1 that is a pattern to look for. A pattern input can include regualr expressions.
Output is a boolean vector of the same length (TRUE or FALSE)

Note: We will discuss regular expressions later.

Example

str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_detect(str, pat)

[1]  TRUE FALSE  TRUE

Base R alternative

# equivalent to str_detect()
grepl(pattern, x, ...)

Get strings/positions matching a pattern

str_subset(string, pattern)
str_which(string, pattern)

string input is a character vector
pattern input is a character vector of length 1 that is a pattern to look for. A pattern input can include regualr expressions.
str_subset() returns the matching strings while str_which() returns the index for the matches

Example

str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_subset(str, pat)

[1] "I like apple"  "Apple, I like"

str_which(str, pat)

[1] 1 3

Base R alternative

# equivalent to str_subset()
grep(pattern, x, value = TRUE, ...)

# equivalent to str_which()
grep(pattern, x, value = FALSE, ...)

Extract and replace substrings

str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value

string input is a character vector
start and end are integer vectors
- start is the position of the first substring character; default is the first character
- end is the position of the last substring character; default is the last character
Output is a character vector of substring from start to end.
str_sub() can be used to replace substrings when used with the assignment operator (<-)

Example

str <- "Hello world"
str_sub(str, start = 7)

[1] "world"

str_sub(str, end = 5) <- "Hi"
str

[1] "Hi world"

Base R alternative

# equivalent to str_sub()
substr(x, start, stop)

# equivalent to str_sub() <- value
substr(x, start, stop) <- value

Regular expression (regex)

“Regular expressions are a concise and flexible tool for describing patterns in strings.”
-stringr.tidyverse.org

Regex allows us to use more complex and dynamic patterns for manipulating and working with strings

Regular expression in R

Character classes
Metacharacters
Groups
Anchors
Quantifiers

Character classes

Class	Description
`[[:digit:]]` or `\d`	Any digits; i.e. `[0-9]`
`\\D`	Non-digits; i.e. `[^0-9]`
`[[:lower:]]`	Lower-case letters; i.e. `[a-z]`
`[[:upper:]]`	Upper-case letters; i.e. `[A-Z]`
`[[:alpha:]]`	Alphabetic characters; `[A-z]`
`[[:alnum:]]`	Alphanumeric characters; i.e. `[A-z0-9]`
`\\w`	Any Word characters; i.e. `[A-z0-9_]`
`\\W`	Non-word characters
`[[:blank:]]`	Space and tab
`[[:space:]]` or `\s`	Space, tab, vertical tab, newline, form feed, carriage return
`\\S`	Not space; i.e. `[^[:space:]]`

Example

str <- c("HELLO", "world", "123", "\n")
str_detect(str, "\\d") # has any digit

[1] FALSE FALSE  TRUE FALSE

str_detect(str, "\\D") # has no digit

[1]  TRUE  TRUE FALSE  TRUE

str_detect(str, "\\w") # has any alphanumetic character

[1]  TRUE  TRUE  TRUE FALSE

str_detect(str, "\\s") # has any whitespate

[1] FALSE FALSE FALSE  TRUE

Metacharacters

Metacharacter	Description
`\n`	New line
`\r`	Carriage return
`\t`	Tab
`\v`	Vertical tab
`\f`	Form feed

Groups

Group	Description
`.`	Any character except `\n`
\|	Or, e.g. `(a`\|`b)`
`[...]`	List permitted characters, e.g. `[abc]`
`[a-z]`	Specify character ranges
`[^...]`	List excluded characters
`(...)`	Grouping, enables back referencing using `\\N` where `N` is integer

Example

str <- c("HELLO", "world", "123", "\n")
str_detect(str, ".") # has any character except \n

[1]  TRUE  TRUE  TRUE FALSE

str_detect(str, "(d|1)") # has d or 1

[1] FALSE  TRUE  TRUE FALSE

str_detect(str, "[Oo]") # has O or o

[1]  TRUE  TRUE FALSE FALSE

str_detect(str, "[^HELLO123]") # has characters other than...

[1] FALSE  TRUE FALSE  TRUE

Anchors

Anchor	Description
`^`	Start of the string
`$`	End of the string
`\\b`	Empty string at either edge of a word
`\\B`	NOT the edge of a word
`\\<`	Beginning of a word
`\\>`	End of a word

Example

str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "^(a|ba)")

[1]  TRUE  TRUE  TRUE FALSE

str_detect(str, "apple$")

[1]  TRUE FALSE FALSE  TRUE

Quantifiers

Quantifier	Description
`*`	Matches at least 0 times
`+`	Matches at least 1 time
`?`	Matches at most 1 time; optional string
`{n}`	Matches extactly `n` times
`{n,}`	Matches at least `n` times
`{,n}`	Matches at most `n` times
`{n,m}`	Matches between `n` and `m` times

Example

str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "p*")

[1] TRUE TRUE TRUE TRUE

str_detect(str, "p+")

[1]  TRUE  TRUE FALSE  TRUE

str_detect(str, "p{2,}")

[1]  TRUE FALSE FALSE  TRUE

Working with Dates/Datetimes

plot of chunk unnamed-chunk-31

Source: Iconfinder.com

Dates/Datetimes basics

Date class
POSIXct and POSIXlt classes

lubridate from tidyverse

plot of chunk unnamed-chunk-32

Source: RStudio

Key lubridate functions

as_date() as_datetime()
year(), month(), day(), hour(), …
parse_date_time() fast_strptime()
ymd_hms(), ymd(), …

Convert to a date or date-time

as_date(x, tz = NULL, origin = lubridate::origin)
as_datetime(x, tz = NULL, origin = lubridate::origin)

x is a vector of POSIXt, numeric or character objects
tz is a time zone name
origin is a Date object or something that can be coerced into a Date object
- Default value corresponds to "1970-01-01"

Example

as_date(17618)

[1] "2018-03-28"

class(as_date("20180328"))

[1] "Date"

as_datetime("2018/03/28")

[1] "2018-03-28 UTC"

class(as_datetime("2018-03-28"))

[1] "POSIXct" "POSIXt"

Base R alternative

# equivalent to as_date()
as.Date(x, ...)

# equivalent to as_datetime()
as.POSIXct(x, tz = "", ...)

Get/set time component

year(x)
year(x) <- value

x is a date-time object
value is a numeric object

Function	Description
`year()`	Get/set year component of a date-time
`month()`	Get/set months component of a date-time
`week()`	Get/set weeks component of a date-time
`day()`	Get/set days component of a date-time
`hour()`	Get/set hours component of a date-time
`minute()`	Get/set minutes component of a date-time
`second()`	Get/set seconds component of a date-time
`tz()`	Get/set time zone component of a date-time

Any of these can be used to set the corresponding time component when used with the assignment operator (<-).

Example

today <- as_date("2018-03-28")
year(today)

[1] 2018

month(today) <- 4
today

[1] "2018-04-28"

Parse date-time

parse_date_time(x, orders, tz = "UTC", truncated = 0, locale = Sys.getlocale("LC_TIME"), exact = FALSE, drop = FALSE, ...)
fast_strptime(x, format, tz = "UTC", lt = TRUE, cutoff_2000 = 68L)

x is a character or numeric vector of dates
orders is a character vector of date-time order format
- e.g. "ymd" for year-month-date format
exact is a boolean value for using the “exact” match for the date-time format specificed by orders
drop is a boolean value for dropping, or removing the values not matching the format
format is a character string of formats

Date format symbols

Symbol	Description	Example
%Y	Year in 4 digits	2018
%y	Year in 2 digits	18
%B	Month in words	March
%b	Month in words, abbriviated	Mar
%m	Month in 2 digits	03
%d	Date in 2 digits	28

Example

dates = c("2018-03-28", "2018/03/28", "20180328")
parse_date_time(dates, "ymd")

[1] "2018-03-28 UTC" "2018-03-28 UTC" "2018-03-28 UTC"

fast_strptime(dates[1], "%Y-%m-%d")

[1] "2018-03-28 UTC"

fast_strptime(dates[2], "%Y/%m/%d")

[1] "2018-03-28 UTC"

fast_strptime(dates[3], "%Y%m%d")

[1] "2018-03-28 UTC"

Base R alternative

# equivalent to fast_strptime()
strptime(x, format = "", tz = "")

No base R alternative for parse_date_time()
- In fact, being able to handle multiple dates formats with the same order is one of the advantages of using parse_date_time()

Quickly parse date-time

ymd_hms(..., quiet = FALSE, tz = NULL, ...)
ymd(..., quiet = FALSE, tz = "UTC", ...)

The first ... argument is a character vector of dates in appropriate format
quiet is a boolean value for displaying messages
tz is a character string speficiying time zone
ymd_hms and other similar functions does the same work parse_date_time(), but with a predefined order.

Date-time	Date only	Time only
`ymd_hms()`	`ymd()`	`hms()`
`ymd_hm()`	`ydm()`	`hm()`
`ymd_h()`	`mdy()`	`ms()`
`mdy_hms()`	`myd()`
`mdy_hm()`	`dmy()`
`mdy_h()`	`dym()`
`dmy_hms()`
`dmy_hm()`
`dmy_h()`

y is year
first m is month
d is date
h is hour
second m is minute
s is second

Importing/Exporting Data

plot of chunk unnamed-chunk-45

Source: Wikimedia Commons

Comma-separated values (.csv)

readr package (tidyverse)
- read_csv()
- write_csv()
data.table package
- fread()
- fwrite()

Using readr

read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"), trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), ...)
write_csv(x, path, na = "NA", append = FALSE, col_names = !append)

file is a path to the .csv file to import
x is a data object to export
path is a path to the directory where the exported data will be created
The output of read_csv() is a tibble object
The output of write_csv() is a .csv file

Using data.table

fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA", stringsAsFactors=FALSE, skip=0L, colClasses=NULL, col.names,
strip.white=TRUE, fill=FALSE, ...)
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", na = "", row.names = FALSE, col.names = TRUE, ...)

input is a path to the .csv file to import
x is a data object to export
file is a path to the directory where the exported data will be created
The output of fread is a data.table object
The output of fwrite is a .csv file in a directory

Base R alternative

read.csv(file,  header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, ...)
write.csv(x, file = "", append = FALSE, quote = TRUE, sep = ",", row.names = TRUE, col.names = TRUE, ...)

Excel spreadsheets (.xlsx/.xls)

readxl package (tidyverse)
- read_excel()
- read_xls() read_xlsx()

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max))
read_xls(path, ...)
read_xlsx(path, ...)

path is a path to the excel file (.xls or .xlsx) to import
sheet is the name of a sheet in the excel file to import
col_names is a boolean value for using the first row to import as column names
skip is a number of rows to skip
guess_max is a number of rows to use to guess the class of each column
The output is a tibble object

SPSS data files (.sav)

haven package (tidyverse)
- read_sav() read_spss()
- write_sav()
haven also has functions to import/export the file formats of other statistical softwares
- STATA
- SAS

read_sav(file, user_na = FALSE)
read_spss(file, user_na = FALSE)
write_sav(data, path)

file is a path to the SPSS file (.sav) to import in read_sav(), or a path to export the data in write_sav()
data is a data object to export
The output of read_sav() is a tibble object
- read_spss() is a simple alias for read_sav()
The output of write_sav() is an SPSS data file

A "fast-on-disk" data frame storage (.feather)

feather package (tidyverse)
- read_feather()
- write_feather()
The .feather format is also supported in Python!

read_feather(path, columns = NULL)
write_feather(x, path)

path is a path to the .feather file to import in read_feather(), or a path to export the data in write_feather()
x is the data object to export
The output of read_feather() is a tibble object
The output of write_feather() is a feather file

Questions?

plot of chunk unnamed-chunk-52

Source: Joel Ploz

References

Grolemund, G. & Wickham, H. (2017). R for Data Science.
Kabacoff, R. (2017). “Date Values”. Quick-R.
RStudio. (2017). “Dates and Times Cheat Sheet”.
RStudio. (2017). “Work with Strings Cheat Sheet”.
Tidyverse. (n.d.). haven.tidyverse.org
Tidyverse. (n.d.). lubridate.tidyverse.org
Tidyverse. (n.d.). readr.tidyverse.org
Tidyverse. (n.d.). readxl.tidyverse.org
Tidyverse. (n.d.). stringr.tidyverse.org

R Workshop

Module 3: Data analysis with R (2)

Agenda

More on Data Analysis

More on Data Analysis

Working with Strings

stringr from tidyverse

Key stringr functions

Convert letter case

Trim whitespace

Concatenate strings into one

Detect a pattern

Get strings/positions matching a pattern

Extract and replace substrings

More on stringr

Regular expression (regex)

Regular expression in R

Character classes

Metacharacters

Groups

Anchors

Quantifiers

More on regular expression

Working with Dates/Datetimes

Dates/Datetimes basics

lubridate from tidyverse

Key lubridate functions

Convert to a date or date-time

Get/set time component

Parse date-time

Quickly parse date-time

More on lubridate

Importing/Exporting Data

Comma-separated values (.csv)

Using readr

Using data.table

Excel spreadsheets (.xlsx/.xls)

More on readxl

SPSS data files (.sav)

More on haven

A "fast-on-disk" data frame storage (.feather)

More on feather

Questions?

References