This page contains the notes for the second part of R Workshop Module 3: Data Analysis with R, which is part of the R Workshop series prepared by ICJIA Research Analyst Bobae Kang to enable and encourage ICJIA researchers to take advantage of R, a statistical programming language that is one of the most powerful modern research tools.
Click here to go to the workshop home page.
Click here to go to the workshop Modules page.
Click here to view the accompanying slides for Module 3, Part 2.
Navigate to the other workshop materials:
Source: Iconfinder.com
The second part of Module 3 introduces more tidyverse
tools for working with data. Specifically, this part covers (1) working with string data using stringr
, (2) working with date and date-time data using lubridate
, and (3) importing and exporting data using various packages.
Source: RStudio
In computer programming, a “string” often refers to a string of characters. Manipulating strings is an important skill for data analysis.
stringr
functionsIn this section, we will learn the following stringr
functions:
str_to_upper()
str_to_lower()
str_to_title()
str_trim()
str_squish()
str_c()
str_detect()
str_subset()
str_sub()
Many of these stringr
functions also have base R alternatives. Accordingly, we will also see the base R functions that are roughly equivalent to each stringr
function. Despite the equivalent base R functions, however, using stringr
has an advantage of having a more organized API. That is, all stringr
functions begin with str_
prefix. Also, their first input is almost always a string object, which facilitates the use of pipe (%>%
) operator.
Please note that this section is not intended to be an exhuastive documentation and description of the listed functions–or stringr
package for that matter. For more information, please check out the reference materials listed below.
str_to_upper(string, locale = "en")
str_to_lower(string, locale = "en")
str_to_title(string, locale = "en")
One of the most basic string operations is to change cases of the character. stringr
offers three functions for that task: str_to_upper()
, str_to_lower()
, and str_to_title()
.
All three functions take a character vector as the first argument input, string
. In fact, if it is not a character vector but can be coerced into one, str_to_*()
will automatically coerce, or convert, the input into a character vector.
str_to_upper()
turns all characters into uppercase letters, and str_to_lower()
, to lowercase letters. str_to_title()
capitalizes only the first letter of each word, seperated by a whitespace.
Using each of stringr
’s case-changing functions is pretty straightforward:
str_to_upper("hello world")
## [1] "HELLO WORLD"
str_to_lower("HELLO WORLD")
## [1] "hello world"
str_to_title("hello WORLD")
## [1] "Hello World"
# equivalent to str_to_upper()
toupper(string)
# equivalent to str_to_lower()
tolower(string)
Base R offers functions equivalent to str_to_upper()
and str_to_lower()
.
str_trim(string, side = c("both", "left", "right"))
str_squish(string)
str_trim()
and str_squish()
are functions to trim, or remove, unwanted whitespaces in a character string. As before, the first input string
is a character vector. The side
argument in str_trim()
determines which side of a string to trim: “both” trims whitespaces on both the beginning and the end, “left” trims whitespaces only on the beginning, and “right” trims whitespaces only on the end. str_squish()
detects any execssive whitespace and removes it rom the input.
The following code example shows how trimming whitespaces work. Notice that str_squish()
also takes care of whitespaces within the string in addition to the left and right ends.
str_trim(" trim both ", side = "both")
## [1] "trim both"
str_trim(" trim left only ", side = "left")
## [1] "trim left only "
str_trim(" trim right only ", side = "right")
## [1] " trim right only"
str_squish(" whitespaces all over the place ")
## [1] "whitespaces all over the place"
# equivalent to str_trim()
trimws(x, which = c("both", "left", "right"))
Base R also offers a function to remove leading and/or trailing whitespaces.
str_c(..., sep = "", collapse = NULL)
str_c()
concatenates multiple strings into a single string. The first argument (...
) is one more more character vectors for the concatenating operation. There are two ways to concatenate strings. First is an element-wise concatenation with sep
, which a separator string between input vectors. The default value is none (""
). Second way of concatenating strings use collapse
, an optional string used to combined input vectors into a single string.
The following examples illustrate how concatenating strings with sep
and collapse
works.
str_c(c("one", "two"), c("plus three", "minus four"), sep = " ")
## [1] "one plus three" "two minus four"
In the first case, we are doing element-wise concatenation between two character vectors. Here we see ethat the first element of the first vector and the first element of the second vector is concatenated with a single whitespace in between. The same thing goes for the second element of both vectors. The output is a character vector of length 2, whose elements are concatenated strings that result from the elements of the input vectors.
str_c(c("one", "two", "three"), "plus four", sep = " ")
## [1] "one plus four" "two plus four" "three plus four"
When the length of input vectors do not match, the shorter vector is recycled. Here, the first input vector has a length of 3 while the second input vector has a length of 1. Consequently, the second vector element, "plus four"
is recycled and used for concatenation for all three elements of the first input vector. The output has a lenght of 3, which matches the lenght of the longest input vector.
str_c(c("one", "two", "three"), collapse = " plus ")
## [1] "one plus two plus three"
Here we see how collapse
works. Note that the final output is a vector of lenght 1, whose only element is a string with all elements of the input vector concatenated with the collapse
input in between.
str_c(c("one", "two", "three"), "plus four", sep = " ", collapse = " and ")
## [1] "one plus four and two plus four and three plus four"
Of course, we can combine sep
and collapse
to try a more complicated concatenation task.
# equivalent to str_c()
paste (..., sep = " ", collapse = NULL)
Base R alternative to str_c()
is paste()
, which works pretty much the same.
str_detect(string, pattern)
str_detect()
is used to detect the presence/absence of a speficied pattern in the input string. As before, string
input is a character vector. pattern
input is a character vector of length 1 that is a pattern to look for. Finally, output is a boolean (TRUE
or FALSE
) vector of the same length as the string
input.
Note that the pattern
input can include regualr expressions, which will be discussed later in the current note.
In this example, we have an input string of length 3. The pattern we want to detect is "I like"
. As expected, the output is a boolean vecotr of length 3, whose element has a value of either TRUE
or FALSE
based on the presence of the pattern in the correspoding element in the input vector.
str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_detect(str, pat)
## [1] TRUE FALSE TRUE
# equivalent to str_detect()
grepl(pattern, x, ...)
The grepl()
function of base R is equivalent to str_detect()
, although the order of input argument is different: pattern
input comes before the string input x
.
str_subset(string, pattern)
str_which(string, pattern)
str_subset()
and str_which()
are similar functions that take a character vector as the string
input and a pattern to look for as the pattern
input. As in str_detect()
, the pattern can be speficied using regular expression.
The key difference between str_subset()
and str_which()
is that the former returns the strings that match the pattern while the latter returns the index of the matching strings.
The following example illustrates the different outputs of str_subset()
and str_which()
:
str = c("I like apple", "You like apple", "Apple, I like")
pat = "I like"
str_subset(str, pat)
## [1] "I like apple" "Apple, I like"
str_which(str, pat)
## [1] 1 3
# equivalent to str_subset()
grep(pattern, x, value = TRUE, ...)
# equivalent to str_which()
grep(pattern, x, value = FALSE, ...)
Base R offers grep()
, which can do the work of both str_subset()
and str_which()
, depending on the value
argument input.
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value
string
input is a character vectorstart
and end
are integer vectors
start
is the position of the first substring character; default is the first characterend
is the position of the last substring character; default is the last characterstart
to end
.str_sub()
can be used to replace substrings when used with the assignment operator (<-
)The following example code illustate using str_sub()
to both get and set substring value for the input:
str <- "Hello world"
str_sub(str, start = 7)
## [1] "world"
str_sub(str, end = 5) <- "Hi"
str
## [1] "Hi world"
# equivalent to str_sub()
substr(x, start, stop)
# equivalent to str_sub() <- value
substr(x, start, stop) <- value
The base R alternative str_sub()
is substr()
, which can also be used to both get and set the substring value.
stringr
What we have touched upon so far is only a small (though highly useful) part of stringr
. I recommend you to check out the following resources:
stringr
on tidyverse.orgstringr
CRAN documentationstringr
Github repository“Regular expressions are a concise and flexible tool for describing patterns in strings.”
-stringr.tidyverse.org
Regular expression is a collection of special characters and syntax for character strings that allow us to use more complex and dynamic patterns for manipulating and working with strings. Regular expression can be a class of its own. In this section, we will briefly take a look at various elements of regular expression as used in R.
In the following, we will explore the following types of regular expression in R:
Character classes provide a consice way to refer to certain sets of characters, such as alphebetical letters, digits, and various types of whitespaces. The following table offers a quick summary of character classes in R:
Class | Description |
---|---|
[[:digit:]] or \d |
Any digits; i.e. [0-9] |
\\D |
Non-digits; i.e. [^0-9] |
[[:lower:]] |
Lower-case letters; i.e. [a-z] |
[[:upper:]] |
Upper-case letters; i.e. [A-Z] |
[[:alpha:]] |
Alphabetic characters; [A-z] |
[[:alnum:]] |
Alphanumeric characters; i.e. [A-z0-9] |
\\w |
Any Word characters; i.e. [A-z0-9_] |
\\W |
Non-word characters |
[[:blank:]] |
Space and tab |
[[:space:]] or \s |
Space, tab, vertical tab, newline, form feed, carriage return |
\\S |
Not space; i.e. [^[:space:]] |
The following code shows how regular expression character classes work:
str <- c("HELLO", "world", "123", "\n")
str_detect(str, "\\d") # has any digit
## [1] FALSE FALSE TRUE FALSE
str_detect(str, "\\D") # has no digit
## [1] TRUE TRUE FALSE TRUE
str_detect(str, "\\w") # has any alphanumetic character
## [1] TRUE TRUE TRUE FALSE
str_detect(str, "\\s") # has any whitespate
## [1] FALSE FALSE FALSE TRUE
Metacharacters are characters with special meaning to the computer program. The following table lists a selection of metacharacters and their meanings:
Metacharacter | Description |
---|---|
\n |
New line |
\r |
Carriage return |
\t |
Tab |
\v |
Vertical tab |
\f |
Form feed |
Regular expression groups offer various ways to specify groups of characters. The following table provides a summary of groups in R:
Group | Description |
---|---|
. |
Any character except \n |
| | Or, e.g. (a |b) |
[...] |
List permitted characters, e.g. [abc] |
[a-z] |
Specify character ranges |
[^...] |
List excluded characters |
(...) |
Grouping, enables back referencing using \\N where N is integer |
The following code example is to illustrate how regular expression groups work.
str <- c("HELLO", "world", "123", "\n")
str_detect(str, ".") # has any character except \n
## [1] TRUE TRUE TRUE FALSE
str_detect(str, "(d|1)") # has d or 1
## [1] FALSE TRUE TRUE FALSE
str_detect(str, "[Oo]") # has O or o
## [1] TRUE TRUE FALSE FALSE
str_detect(str, "[^HELLO123]") # has characters other than...
## [1] FALSE TRUE FALSE TRUE
Regular expression anchors make it possible to add locational information to a particular character, e.g., the beginning or end of the whole string. The following table provides a summary of anchors in R:
Anchor | Description |
---|---|
^ |
Start of the string |
$ |
End of the string |
\\b |
Empty string at either edge of a word |
\\B |
NOT the edge of a word |
\\< |
Beginning of a word |
\\> |
End of a word |
The following code shows how regular expression anchors can be used:
str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "^(a|ba)")
## [1] TRUE TRUE TRUE FALSE
str_detect(str, "apple$")
## [1] TRUE FALSE FALSE TRUE
Quantifiers capture the pattenr of repetition for specified characters. The following table offers a summary of quantifiers in R:
Quantifier | Description |
---|---|
* |
Matches at least 0 times |
+ |
Matches at least 1 time |
? |
Matches at most 1 time; optional string |
{n} |
Matches extactly n times |
{n, } |
Matches at least n times |
{, n} |
Matches at most n times |
{n, m} |
Matches between n and m times |
The example below illustrates the use of various quntifiers to specify repetition patterns:
str <- c("apple", "apricot", "banana", "pineapple")
str_detect(str, "p*")
## [1] TRUE TRUE TRUE TRUE
str_detect(str, "p+")
## [1] TRUE TRUE FALSE TRUE
str_detect(str, "p{2,}")
## [1] TRUE FALSE FALSE TRUE
As mentioned earlier, regular expression is a broad topic. To get a better hold of regular expression, I recommend you to check out the following resources:
Source: Iconfinder.com
The work of data analysis often deals with dates. R has data types for date and datetime, which come with useful properties for understanding and analyzing temporal characteristics of data.
As just mentioned, R offers roughly two data types for date and datetime. First, Date
is a class that represents calendar dates, which is represented as the number of days since 1970-01-01. Second, POSIXct
and POSIXlt
are two classes for representing calendar dates and times. Technically, the POSIXct
class represents the specified datetime as a number of seconds since 1970-01-01, and POSIXlt
represents it in a mixed text and character format. In practice, however, these two prints the identical value and can be used almost interchangeably.
Date
and POSIX*
classes have useful properties for working with date and datetime, one of which is the possibility of breaking down the data into the time components.
lubridate
functionsIn this section, we will learn the following lubridate
functions:
as_date()
as_datetime()
year()
, month()
, day()
, hour()
, …parse_date_time()
fast_strptime()
ymd_hms()
, ymd()
, …As in stringr
, many of lubridate
functions have base R alternatives, which we will see whenever relevant. Again, the benefit of using lubridate
functions is the cleaner API in addition to some extra convenience.
as_date(x, tz = NULL, origin = lubridate::origin)
as_datetime(x, tz = NULL, origin = lubridate::origin)
as_date()
and as_datetime()
take a vector of POSIXt, numeric or character objects as the input for the first argument, x
, and returns a date/datetime object. tz
input is a time zone name, such as "CST"
. Finally, origin
is a Date object or something that can be coerced into a Date object, with the default value of "1970-01-01"
.
The difference between as_date()
and as_datetime()
is that the former results in a Date
objet while the latter returns a POSIXct
object.
Let’s take a quick look at how as_date()
and as_datetime()
works. Note that the input for these functions can be in varying forms, including a numeric value.
as_date(17618)
## [1] "2018-03-28"
class(as_date("20180328"))
## [1] "Date"
as_datetime("2018/03/28")
## [1] "2018-03-28 UTC"
class(as_datetime("2018-03-28"))
## [1] "POSIXct" "POSIXt"
# equivalent to as_date()
as.Date(x, ...)
# equivalent to as_datetime()
as.POSIXct(x, tz = "", ...)
Base R offers as.Date()
, which corresponds to as_date()
, and as.POSIXct()
, which is equivalent to as_datetime()
.
year(x)
year(x) <- value
lubridate
offers a set of functiosn to easily get and set time component of a date/datetime object. The first input, x
, is a date/datetime object. When used with the assignment operator, <-
, we can give a new value to the specified component of the input object.
The following table provides a list of selected functions for getting/setting time component:
Function | Description |
---|---|
year() |
Get/set year component of a date-time |
month() |
Get/set months component of a date-time |
week() |
Get/set weeks component of a date-time |
day() |
Get/set days component of a date-time |
hour() |
Get/set hours component of a date-time |
minute() |
Get/set minutes component of a date-time |
second() |
Get/set seconds component of a date-time |
tz() |
Get/set time zone component of a date-time |
The following example illustrates how to get/set time components of a date object:
today <- as_date("2018-03-28")
year(today)
## [1] 2018
month(today) <- 4
today
## [1] "2018-04-28"
parse_date_time(x, orders, tz = "UTC", truncated = 0, locale = Sys.getlocale("LC_TIME"), exact = FALSE, drop = FALSE, ...)
fast_strptime(x, format, tz = "UTC", lt = TRUE, cutoff_2000 = 68L)
In most cases, as_date()
and as_datetime()
may be sufficient to convert non-date/-datetime objects to date/datetime objects. However, sometimes a more flexible approach is needed to handle a variety of input formats. parse_date_time()
and fast_strptime()
are functions for just that.
Both parse_date_time()
and fast_strptime()
take a character or numeric vector of dates as the input for the first argument.
parse_date_time()
then takes a character vector of datetime order format as orders
input. For example, we can use "ymd"
for various year-month-date formats. exact
is a boolean value for using the “exact” match for the datetime format specificed by orders
, and drop
is a boolean value for dropping, or removing the values not matching the format. In
On the other hand, fast_strptime()
takes a character string of formats as format
input. The datetime format uses special symbols such as those listed in the following table:
Symbol | Description | Example |
---|---|---|
%Y | Year in 4 digits | 2018 |
%y | Year in 2 digits | 18 |
%B | Month in words | March |
%b | Month in words, abbriviated | Mar |
%m | Month in 2 digits | 03 |
d | Date in 2 digits | 28 |
Let us take a look at some examples of “parsing” datetimes using parse_date_time()
and fast_strptime()
. First we have a vector of three date strings as a sample input.
dates = c("2018-03-28", "2018/03/28", "20180328")
As shown below, parse_date_time()
is capable of taking care of varying date formats as long as the order remains the same:
parse_date_time(dates, "ymd")
## [1] "2018-03-28 UTC" "2018-03-28 UTC" "2018-03-28 UTC"
In constrast, fast_strptime()
has to specify the foramt for each input:
fast_strptime(dates[1], "%Y-%m-%d")
## [1] "2018-03-28 UTC"
fast_strptime(dates[2], "%Y/%m/%d")
## [1] "2018-03-28 UTC"
fast_strptime(dates[3], "%Y%m%d")
## [1] "2018-03-28 UTC"
# equivalent to fast_strptime()
strptime(x, format = "", tz = "")
Base R offers strptime()
, which is equivalent to fast_strptime()
. On the other hand, there is no base R alternative for parse_date_time()
. That said, being able to handle multiple date/datetime formats with the same order is one of the advantages of using parse_date_time()
.
ymd_hms(..., quiet = FALSE, tz = NULL, ...)
ymd(..., quiet = FALSE, tz = "UTC", ...)
lubridate
also offers functions for quickly parse datetimes with a predefined order; under the hood, all these functions do the same work as parse_date_time()
.
The first ...
argument is a character vector of dates in appropriate format. quiet
is a boolean value for displaying messages, and tz
is a character string speficiying time zone.
The following table lists a selection of functions to quickly parse datetimes.
Date-time | Date only | Time only |
---|---|---|
ymd_hms() |
ymd() |
hms() |
ymd_hm() |
ydm() |
hm() |
ymd_h() |
mdy() |
ms() |
mdy_hms() |
myd() |
|
mdy_hm() |
dmy() |
|
mdy_h() |
dym() |
|
dmy_hms() |
||
dmy_hm() |
||
dmy_h() |
y
is yearm
is monthd
is dateh
is hourm
is minutes
is secondAs in the case of stringr
, what we have seen is only part of what lubridate
offers. I recommend you to check out the following resources to learn more about lubridate
.
lubridate
on tidyverse.orglubridate
CRAN documentationlubridate
Github repositorySource: Wikimedia Commons
Importing and exporting datasets in different formats is one of the basic operations for any data analysis task. Here we will explore some options for this task with datasets in various file formats.
Comma-separate values, with a .csv extension, is one of the most common format to store data. R ecosystem offers a few options to import/export .csv files. Here we will take a quick look at two pacakges as well as the base R solution for this task.
As for third-party packages, tidyverse
offers readr
pacakge, which has read_csv()
and write_csv()
functions. data.table
package also provides functions to import and export .csv files: fread()
and write()
.
readr
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"), trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), ...)
write_csv(x, path, na = "NA", append = FALSE, col_names = !append)
In read_csv()
, file
is a path to a .csv file to import. col_names
is a boolean value for using the first row values as column names. na
input defines what values to consider missing values. skip
input is a number of rows to skip before what is going to be the first row for the output data object. The output is a tibble
object.
In write_csv()
, x
is a data object to export as a .csv file, and path
is a path to the directory where the exported data will be created. append
is a boolean value for “appending” to the existing data with the same name (path).
For more on the function arguments, refer to the relevant documentations.
data.table
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA", stringsAsFactors=FALSE, skip=0L, colClasses=NULL, col.names,
strip.white=TRUE, fill=FALSE, ...)
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", na = "", row.names = FALSE, col.names = TRUE, ...)
In fread()
, input
is a path to the .csv file to import. The output of is a data.table
object. sep
input specifies the inter-column separator while sep2
input defines the intra-column separator. nrows
is the number of rows to import, where the default value is all rows. header
input determines whether to use the first row as the “header”, i.e. column names. stringAsFactors
value specifies whether to convert character columns into factor columns.
In fwrite()
, x
is a data object to export, and file
is a path to the directory where the exported data will be created. sep
input is the inter-column separator, with “,” as the default. With this default option, the output file is a .csv file in a directory.
For more on the function arguments, refer to the relevant documentations.
It is noteworthy that fread
and fwrite
among of the fastest options to import and export data files. Therefore, using the data.table
functions may be a preferable solution when dealing with large datasets (e.g., hundreds of megabytes, or even gigabytes).
read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, ...)
write.csv(x, file = "", append = FALSE, quote = TRUE, sep = ",", row.names = TRUE, col.names = TRUE, ...)
Base R offers functions to import and export .csv files. The arguments are similar to those used in readr
and data.table
functions.
An Excel spreadsheet is another common way to store a tabular data. While base R offers no functiosn to import and export Excel files (.xlsx or .xls), tidyverse
has readxl
package for this functionality.
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max))
read_xls(path, ...)
read_xlsx(path, ...)
In read_excel()
, path
is a path to an Excel file (.xls or .xlsx) to import, and sheet
is the name of a sheet in the excel file to import. If sheet
is NULL, the default is the first sheet. range
input can take a string of Excel ranges, such as “B3:D87”, to import. This can be used with a specified sheet, such as in "sheet2!B2:D78"
, in which case any sheet
input will be ignored. col_names
is a boolean value for using the first row to import as column, and skip
is a number of rows to skip. guess_max
is a number of rows to use to guess the class of each column. Finally, the output is a tibble
object.
read_xls
and read_xlsx
is shortcuts for importing Excel files with the correponding extension. When using these functions, the path
input can omit the extension.
For more on the function arguments, refer to the relevant documentations.
readxl
readxl
on tidyverse.orgreadxl
CRAN documentationreadxl
Github repositorySPSS is a popular commercial software for statistical analysis, and has its own data format (.sav). haven
package of tidyverse
offers functiosn to import and export SPSS data files.
Please note that haven
also has functions to import/export the file formats of other statistical softwares, such as STATA and SAS.
read_sav(file, user_na = FALSE)
read_spss(file, user_na = FALSE)
write_sav(data, path)
In read_sav()
, file
is a path to the SPSS file (.sav) to import. read_spss()
is a simple alias for read_sav()
. In other case, the output is a tibble
object.
In write_sav()
, file
is a path to export the data in write_sav()
, and data
is a data object to export. The output of write_sav()
is an SPSS data file.
haven
haven
on tidyverse.orghaven
CRAN documentationhaven
Github repositoryfeather
package is developed by Hadley Wickham, one of the key contributors and authors of many tidyverse
packages, and Wes McKinney, the author of Python for Data Analysis and a creator of the popular data-wrangling package for Python called pandas
. The package offers a new file format (.feather) that can be used across R and Python, two most popular languages for data analysis and data science. feather
is also known for fast importing and exporting of files.
read_feather(path, columns = NULL)
write_feather(x, path)
In read_feather()
, path
is a path to the .feather file to import. The output of read_feather()
is a tibble
object.
In write_feather()
, path
is a path to export the data in write_feather()
, and x
is the data object to export. The output of write_feather()
is a .feather file.
feather
feather
CRAN documentation