This page contains the notes for the second part of R Workshop Module 3: Data Analysis with R, which is part of the R Workshop series prepared by ICJIA Research Analyst Bobae Kang to enable and encourage ICJIA researchers to take advantage of R, a statistical programming language that is one of the most powerful modern research tools.

Data Analysis with R (2): More on data analysis

Source: Iconfinder.com

The second part of Module 3 introduces more tidyverse tools for working with data. Specifically, this part covers (1) working with string data using stringr, (2) working with date and date-time data using lubridate, and (3) importing and exporting data using various packages.


Working with Strings

Source: RStudio

In computer programming, a “string” often refers to a string of characters. Manipulating strings is an important skill for data analysis.

stringr from tidyverse

Source: RStudio

Key stringr functions

In this section, we will learn the following stringr functions:

  • str_to_upper() str_to_lower() str_to_title()
  • str_trim() str_squish()
  • str_c()
  • str_detect()
  • str_subset()
  • str_sub()

Many of these stringr functions also have base R alternatives. Accordingly, we will also see the base R functions that are roughly equivalent to each stringr function. Despite the equivalent base R functions, however, using stringr has an advantage of having a more organized API. That is, all stringr functions begin with str_ prefix. Also, their first input is almost always a string object, which facilitates the use of pipe (%>%) operator.

Please note that this section is not intended to be an exhuastive documentation and description of the listed functions–or stringr package for that matter. For more information, please check out the reference materials listed below.

Convert letter case

str_to_upper(string, locale = "en")
str_to_lower(string, locale = "en")
str_to_title(string, locale = "en")

One of the most basic string operations is to change cases of the character. stringr offers three functions for that task: str_to_upper(), str_to_lower(), and str_to_title().

All three functions take a character vector as the first argument input, string. In fact, if it is not a character vector but can be coerced into one, str_to_*() will automatically coerce, or convert, the input into a character vector.

str_to_upper() turns all characters into uppercase letters, and str_to_lower(), to lowercase letters. str_to_title() capitalizes only the first letter of each word, seperated by a whitespace.

Example

Using each of stringr’s case-changing functions is pretty straightforward:

str_to_upper("hello world")
## [1] "HELLO WORLD"
str_to_lower("HELLO WORLD")
## [1] "hello world"
str_to_title("hello WORLD")
## [1] "Hello World"

Base R alternative

# equivalent to str_to_upper()
toupper(string)

# equivalent to str_to_lower()
tolower(string)

Base R offers functions equivalent to str_to_upper() and str_to_lower().

Trim whitespace

str_trim(string, side = c("both", "left", "right"))
str_squish(string)

str_trim() and str_squish() are functions to trim, or remove, unwanted whitespaces in a character string. As before, the first input string is a character vector. The side argument in str_trim() determines which side of a string to trim: “both” trims whitespaces on both the beginning and the end, “left” trims whitespaces only on the beginning, and “right” trims whitespaces only on the end. str_squish() detects any execssive whitespace and removes it rom the input.

Example

The following code example shows how trimming whitespaces work. Notice that str_squish() also takes care of whitespaces within the string in addition to the left and right ends.

str_trim("  trim both  ", side = "both")
## [1] "trim both"
str_trim("  trim left only  ", side = "left")
## [1] "trim left only  "
str_trim("  trim right only  ", side = "right")
## [1] "  trim right only"
str_squish("  whitespaces all    over the  place   ")
## [1] "whitespaces all over the place"

Base R alternative

# equivalent to str_trim()
trimws(x, which = c("both", "left", "right"))

Base R also offers a function to remove leading and/or trailing whitespaces.

Concatenate strings into one

str_c(..., sep = "", collapse = NULL)

str_c() concatenates multiple strings into a single string. The first argument (...) is one more more character vectors for the concatenating operation. There are two ways to concatenate strings. First is an element-wise concatenation with sep, which a separator string between input vectors. The default value is none (""). Second way of concatenating strings use collapse, an optional string used to combined input vectors into a single string.

Example

The following examples illustrate how concatenating strings with sep and collapse works.

str_c(c("one", "two"), c("plus three", "minus four"), sep = " ")
## [1] "one plus three" "two minus four"

In the first case, we are doing element-wise concatenation between two character vectors. Here we see ethat the first element of the first vector and the first element of the second vector is concatenated with a single whitespace in between. The same thing goes for the second element of both vectors. The output is a character vector of length 2, whose elements are concatenated strings that result from the elements of the input vectors.

str_c(c("one", "two", "three"), "plus four", sep = " ")
## [1] "one plus four"   "two plus four"   "three plus four"

When the length of input vectors do not match, the shorter vector is recycled. Here, the first input vector has a length of 3 while the second input vector has a length of 1. Consequently, the second vector element, "plus four" is recycled and used for concatenation for all three elements of the first input vector. The output has a lenght of 3, which matches the lenght of the longest input vector.

str_c(c("one", "two", "three"), collapse = " plus ")
## [1] "one plus two plus three"

Here we see how collapse works. Note that the final output is a vector of lenght 1, whose only element is a string with all elements of the input vector concatenated with the collapse input in between.

str_c(c("one", "two", "three"), "plus four", sep = " ", collapse = " and ")
## [1] "one plus four and two plus four and three plus four"

Of course, we can combine sep and collapse to try a more complicated concatenation task.

Base R alternative

# equivalent to str_c()
paste (..., sep = " ", collapse = NULL)

Base R alternative to str_c() is paste(), which works pretty much the same.

Detect a pattern

str_detect(string, pattern)

str_detect() is used to detect the presence/absence of a speficied pattern in the input string. As before, string input is a character vector. pattern input is a character vector of length 1 that is a pattern to look for. Finally, output is a boolean (TRUE or FALSE) vector of the same length as the string input.

Note that the pattern input can include regualr expressions, which will be discussed later in the current note.

Example

In this example, we have an input string of length 3. The pattern we want to detect is "I like". As expected, the output is a boolean vecotr of length 3, whose element has a value of either TRUE or FALSE based on the presence of the pattern in the correspoding element in the input vector.

str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_detect(str, pat)
## [1]  TRUE FALSE  TRUE

Base R alternative

# equivalent to str_detect()
grepl(pattern, x, ...)

The grepl() function of base R is equivalent to str_detect(), although the order of input argument is different: pattern input comes before the string input x.

Get strings/positions matching a pattern

str_subset(string, pattern)
str_which(string, pattern)

str_subset() and str_which() are similar functions that take a character vector as the string input and a pattern to look for as the pattern input. As in str_detect(), the pattern can be speficied using regular expression.

The key difference between str_subset() and str_which() is that the former returns the strings that match the pattern while the latter returns the index of the matching strings.

Example

The following example illustrates the different outputs of str_subset() and str_which():

str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_subset(str, pat)
## [1] "I like apple"  "Apple, I like"
str_which(str, pat)
## [1] 1 3

Base R alternative

# equivalent to str_subset()
grep(pattern, x, value = TRUE, ...)

# equivalent to str_which()
grep(pattern, x, value = FALSE, ...)

Base R offers grep(), which can do the work of both str_subset() and str_which(), depending on the value argument input.

Extract and replace substrings

str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value
  • string input is a character vector
  • start and end are integer vectors
    • start is the position of the first substring character; default is the first character
    • end is the position of the last substring character; default is the last character
  • Output is a character vector of substring from start to end.
  • str_sub() can be used to replace substrings when used with the assignment operator (<-)

Example

The following example code illustate using str_sub() to both get and set substring value for the input:

str <- "Hello world"
str_sub(str, start = 7)
## [1] "world"
str_sub(str, end = 5) <- "Hi"
str
## [1] "Hi world"

Base R alternative

# equivalent to str_sub()
substr(x, start, stop)

# equivalent to str_sub() <- value
substr(x, start, stop) <- value

The base R alternative str_sub() is substr(), which can also be used to both get and set the substring value.

More on stringr

What we have touched upon so far is only a small (though highly useful) part of stringr. I recommend you to check out the following resources:


Regular expression (regex)

“Regular expressions are a concise and flexible tool for describing patterns in strings.”
-stringr.tidyverse.org

Regular expression is a collection of special characters and syntax for character strings that allow us to use more complex and dynamic patterns for manipulating and working with strings. Regular expression can be a class of its own. In this section, we will briefly take a look at various elements of regular expression as used in R.

Regular expression in R

In the following, we will explore the following types of regular expression in R:

  • Character classes
  • Metacharacters
  • Groups
  • Anchors
  • Quantifiers

Character classes

Character classes provide a consice way to refer to certain sets of characters, such as alphebetical letters, digits, and various types of whitespaces. The following table offers a quick summary of character classes in R:

Class Description
[[:digit:]] or \d Any digits; i.e. [0-9]
\\D Non-digits; i.e. [^0-9]
[[:lower:]] Lower-case letters; i.e. [a-z]
[[:upper:]] Upper-case letters; i.e. [A-Z]
[[:alpha:]] Alphabetic characters; [A-z]
[[:alnum:]] Alphanumeric characters; i.e. [A-z0-9]
\\w Any Word characters; i.e. [A-z0-9_]
\\W Non-word characters
[[:blank:]] Space and tab
[[:space:]] or \s Space, tab, vertical tab, newline, form feed, carriage return
\\S Not space; i.e. [^[:space:]]

Example

The following code shows how regular expression character classes work:

str <- c("HELLO", "world", "123", "\n")
str_detect(str, "\\d") # has any digit
## [1] FALSE FALSE  TRUE FALSE
str_detect(str, "\\D") # has no digit
## [1]  TRUE  TRUE FALSE  TRUE
str_detect(str, "\\w") # has any alphanumetic character
## [1]  TRUE  TRUE  TRUE FALSE
str_detect(str, "\\s") # has any whitespate
## [1] FALSE FALSE FALSE  TRUE

Metacharacters

Metacharacters are characters with special meaning to the computer program. The following table lists a selection of metacharacters and their meanings:

Metacharacter Description
\n New line
\r Carriage return
\t Tab
\v Vertical tab
\f Form feed

Groups

Regular expression groups offer various ways to specify groups of characters. The following table provides a summary of groups in R:

Group Description
. Any character except \n
| Or, e.g. (a|b)
[...] List permitted characters, e.g. [abc]
[a-z] Specify character ranges
[^...] List excluded characters
(...) Grouping, enables back referencing using \\N where N is integer

Example

The following code example is to illustrate how regular expression groups work.

str <- c("HELLO", "world", "123", "\n")
str_detect(str, ".") # has any character except \n
## [1]  TRUE  TRUE  TRUE FALSE
str_detect(str, "(d|1)") # has d or 1
## [1] FALSE  TRUE  TRUE FALSE
str_detect(str, "[Oo]") # has O or o
## [1]  TRUE  TRUE FALSE FALSE
str_detect(str, "[^HELLO123]") # has characters other than...
## [1] FALSE  TRUE FALSE  TRUE

Anchors

Regular expression anchors make it possible to add locational information to a particular character, e.g., the beginning or end of the whole string. The following table provides a summary of anchors in R:

Anchor Description
^ Start of the string
$ End of the string
\\b Empty string at either edge of a word
\\B NOT the edge of a word
\\< Beginning of a word
\\> End of a word

Example

The following code shows how regular expression anchors can be used:

str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "^(a|ba)")
## [1]  TRUE  TRUE  TRUE FALSE
str_detect(str, "apple$")
## [1]  TRUE FALSE FALSE  TRUE

Quantifiers

Quantifiers capture the pattenr of repetition for specified characters. The following table offers a summary of quantifiers in R:

Quantifier Description
* Matches at least 0 times
+ Matches at least 1 time
? Matches at most 1 time; optional string
{n} Matches extactly n times
{n, } Matches at least n times
{, n} Matches at most n times
{n, m} Matches between n and m times

Example

The example below illustrates the use of various quntifiers to specify repetition patterns:

str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "p*")
## [1] TRUE TRUE TRUE TRUE
str_detect(str, "p+")
## [1]  TRUE  TRUE FALSE  TRUE
str_detect(str, "p{2,}")
## [1]  TRUE FALSE FALSE  TRUE

More on regular expression

As mentioned earlier, regular expression is a broad topic. To get a better hold of regular expression, I recommend you to check out the following resources:


Working with Datetimes

Source: Iconfinder.com

The work of data analysis often deals with dates. R has data types for date and datetime, which come with useful properties for understanding and analyzing temporal characteristics of data.

Dates/Datetimes basics

As just mentioned, R offers roughly two data types for date and datetime. First, Date is a class that represents calendar dates, which is represented as the number of days since 1970-01-01. Second, POSIXct and POSIXlt are two classes for representing calendar dates and times. Technically, the POSIXct class represents the specified datetime as a number of seconds since 1970-01-01, and POSIXlt represents it in a mixed text and character format. In practice, however, these two prints the identical value and can be used almost interchangeably.

Date and POSIX* classes have useful properties for working with date and datetime, one of which is the possibility of breaking down the data into the time components.

lubridate from tidyverse

type:section

Source: RStudio

Key lubridate functions

In this section, we will learn the following lubridate functions:

  • as_date() as_datetime()
  • year(), month(), day(), hour(), …
  • parse_date_time() fast_strptime()
  • ymd_hms(), ymd(), …

As in stringr, many of lubridate functions have base R alternatives, which we will see whenever relevant. Again, the benefit of using lubridate functions is the cleaner API in addition to some extra convenience.

Convert to a date or date-time

as_date(x, tz = NULL, origin = lubridate::origin)
as_datetime(x, tz = NULL, origin = lubridate::origin)

as_date() and as_datetime() take a vector of POSIXt, numeric or character objects as the input for the first argument, x, and returns a date/datetime object. tz input is a time zone name, such as "CST". Finally, origin is a Date object or something that can be coerced into a Date object, with the default value of "1970-01-01".

The difference between as_date() and as_datetime() is that the former results in a Date objet while the latter returns a POSIXct object.

Example

Let’s take a quick look at how as_date() and as_datetime() works. Note that the input for these functions can be in varying forms, including a numeric value.

as_date(17618)
## [1] "2018-03-28"
class(as_date("20180328"))
## [1] "Date"
as_datetime("2018/03/28")
## [1] "2018-03-28 UTC"
class(as_datetime("2018-03-28"))
## [1] "POSIXct" "POSIXt"

Base R alternative

# equivalent to as_date()
as.Date(x, ...)

# equivalent to as_datetime()
as.POSIXct(x, tz = "", ...)

Base R offers as.Date(), which corresponds to as_date(), and as.POSIXct(), which is equivalent to as_datetime().

Get/set time component

year(x)
year(x) <- value

lubridate offers a set of functiosn to easily get and set time component of a date/datetime object. The first input, x, is a date/datetime object. When used with the assignment operator, <-, we can give a new value to the specified component of the input object.

The following table provides a list of selected functions for getting/setting time component:

Function Description
year() Get/set year component of a date-time
month() Get/set months component of a date-time
week() Get/set weeks component of a date-time
day() Get/set days component of a date-time
hour() Get/set hours component of a date-time
minute() Get/set minutes component of a date-time
second() Get/set seconds component of a date-time
tz() Get/set time zone component of a date-time

Example

The following example illustrates how to get/set time components of a date object:

today <- as_date("2018-03-28")
year(today)
## [1] 2018
month(today) <- 4
today
## [1] "2018-04-28"

Parse date-time

parse_date_time(x, orders, tz = "UTC", truncated = 0, locale = Sys.getlocale("LC_TIME"), exact = FALSE, drop = FALSE, ...)
fast_strptime(x, format, tz = "UTC", lt = TRUE, cutoff_2000 = 68L)

In most cases, as_date() and as_datetime() may be sufficient to convert non-date/-datetime objects to date/datetime objects. However, sometimes a more flexible approach is needed to handle a variety of input formats. parse_date_time() and fast_strptime() are functions for just that.

Both parse_date_time() and fast_strptime() take a character or numeric vector of dates as the input for the first argument.

parse_date_time() then takes a character vector of datetime order format as orders input. For example, we can use "ymd" for various year-month-date formats. exact is a boolean value for using the “exact” match for the datetime format specificed by orders, and drop is a boolean value for dropping, or removing the values not matching the format. In

On the other hand, fast_strptime() takes a character string of formats as format input. The datetime format uses special symbols such as those listed in the following table:

Date format symbols

Symbol Description Example
%Y Year in 4 digits 2018
%y Year in 2 digits 18
%B Month in words March
%b Month in words, abbriviated Mar
%m Month in 2 digits 03
d Date in 2 digits 28

Example

Let us take a look at some examples of “parsing” datetimes using parse_date_time() and fast_strptime(). First we have a vector of three date strings as a sample input.

dates = c("2018-03-28", "2018/03/28", "20180328")

As shown below, parse_date_time() is capable of taking care of varying date formats as long as the order remains the same:

parse_date_time(dates, "ymd")
## [1] "2018-03-28 UTC" "2018-03-28 UTC" "2018-03-28 UTC"

In constrast, fast_strptime() has to specify the foramt for each input:

fast_strptime(dates[1], "%Y-%m-%d")
## [1] "2018-03-28 UTC"
fast_strptime(dates[2], "%Y/%m/%d")
## [1] "2018-03-28 UTC"
fast_strptime(dates[3], "%Y%m%d")
## [1] "2018-03-28 UTC"

Base R alternative

# equivalent to fast_strptime()
strptime(x, format = "", tz = "")

Base R offers strptime(), which is equivalent to fast_strptime(). On the other hand, there is no base R alternative for parse_date_time(). That said, being able to handle multiple date/datetime formats with the same order is one of the advantages of using parse_date_time().

Quickly parse date-time

ymd_hms(..., quiet = FALSE, tz = NULL, ...)
ymd(..., quiet = FALSE, tz = "UTC", ...)

lubridate also offers functions for quickly parse datetimes with a predefined order; under the hood, all these functions do the same work as parse_date_time().

The first ... argument is a character vector of dates in appropriate format. quiet is a boolean value for displaying messages, and tz is a character string speficiying time zone.

The following table lists a selection of functions to quickly parse datetimes.

Date-time Date only Time only
ymd_hms() ymd() hms()
ymd_hm() ydm() hm()
ymd_h() mdy() ms()
mdy_hms() myd()
mdy_hm() dmy()
mdy_h() dym()
dmy_hms()
dmy_hm()
dmy_h()
  • y is year
  • first m is month
  • d is date
  • h is hour
  • second m is minute
  • s is second

More on lubridate

As in the case of stringr, what we have seen is only part of what lubridate offers. I recommend you to check out the following resources to learn more about lubridate.


Importing/Exporting Data

Source: Wikimedia Commons

Importing and exporting datasets in different formats is one of the basic operations for any data analysis task. Here we will explore some options for this task with datasets in various file formats.

Comma-separated values (.csv)

Comma-separate values, with a .csv extension, is one of the most common format to store data. R ecosystem offers a few options to import/export .csv files. Here we will take a quick look at two pacakges as well as the base R solution for this task.

As for third-party packages, tidyverse offers readr pacakge, which has read_csv() and write_csv() functions. data.table package also provides functions to import and export .csv files: fread() and write().

Using readr

read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"), trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), ...)
write_csv(x, path, na = "NA", append = FALSE, col_names = !append)

In read_csv(), file is a path to a .csv file to import. col_names is a boolean value for using the first row values as column names. na input defines what values to consider missing values. skip input is a number of rows to skip before what is going to be the first row for the output data object. The output is a tibble object.

In write_csv(), x is a data object to export as a .csv file, and path is a path to the directory where the exported data will be created. append is a boolean value for “appending” to the existing data with the same name (path).

For more on the function arguments, refer to the relevant documentations.

Using data.table

fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA", stringsAsFactors=FALSE, skip=0L, colClasses=NULL, col.names,
strip.white=TRUE, fill=FALSE, ...)
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", na = "", row.names = FALSE, col.names = TRUE, ...)

In fread(), input is a path to the .csv file to import. The output of is a data.table object. sep input specifies the inter-column separator while sep2 input defines the intra-column separator. nrows is the number of rows to import, where the default value is all rows. header input determines whether to use the first row as the “header”, i.e. column names. stringAsFactors value specifies whether to convert character columns into factor columns.

In fwrite(), x is a data object to export, and file is a path to the directory where the exported data will be created. sep input is the inter-column separator, with “,” as the default. With this default option, the output file is a .csv file in a directory.

For more on the function arguments, refer to the relevant documentations.

It is noteworthy that fread and fwrite among of the fastest options to import and export data files. Therefore, using the data.table functions may be a preferable solution when dealing with large datasets (e.g., hundreds of megabytes, or even gigabytes).

Base R alternative

read.csv(file,  header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, ...)
write.csv(x, file = "", append = FALSE, quote = TRUE, sep = ",", row.names = TRUE, col.names = TRUE, ...)

Base R offers functions to import and export .csv files. The arguments are similar to those used in readr and data.table functions.

Excel spreadsheets (.xlsx/.xls)

An Excel spreadsheet is another common way to store a tabular data. While base R offers no functiosn to import and export Excel files (.xlsx or .xls), tidyverse has readxl package for this functionality.

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max))
read_xls(path, ...)
read_xlsx(path, ...)

In read_excel(), path is a path to an Excel file (.xls or .xlsx) to import, and sheet is the name of a sheet in the excel file to import. If sheet is NULL, the default is the first sheet. range input can take a string of Excel ranges, such as “B3:D87”, to import. This can be used with a specified sheet, such as in "sheet2!B2:D78", in which case any sheet input will be ignored. col_names is a boolean value for using the first row to import as column, and skip is a number of rows to skip. guess_max is a number of rows to use to guess the class of each column. Finally, the output is a tibble object.

read_xls and read_xlsx is shortcuts for importing Excel files with the correponding extension. When using these functions, the path input can omit the extension.

For more on the function arguments, refer to the relevant documentations.

More on readxl

SPSS data files (.sav)

SPSS is a popular commercial software for statistical analysis, and has its own data format (.sav). haven package of tidyverse offers functiosn to import and export SPSS data files.

Please note that haven also has functions to import/export the file formats of other statistical softwares, such as STATA and SAS.

read_sav(file, user_na = FALSE)
read_spss(file, user_na = FALSE)
write_sav(data, path)

In read_sav(), file is a path to the SPSS file (.sav) to import. read_spss() is a simple alias for read_sav(). In other case, the output is a tibble object.

In write_sav(), file is a path to export the data in write_sav(), and data is a data object to export. The output of write_sav() is an SPSS data file.

More on haven

A “fast-on-disk” data frame storage (.feather)

feather package is developed by Hadley Wickham, one of the key contributors and authors of many tidyverse packages, and Wes McKinney, the author of Python for Data Analysis and a creator of the popular data-wrangling package for Python called pandas. The package offers a new file format (.feather) that can be used across R and Python, two most popular languages for data analysis and data science. feather is also known for fast importing and exporting of files.

read_feather(path, columns = NULL)
write_feather(x, path)

In read_feather(), path is a path to the .feather file to import. The output of read_feather() is a tibble object.

In write_feather(), path is a path to export the data in write_feather(), and x is the data object to export. The output of write_feather() is a .feather file.

References