This page contains the notes for R Workshop Module 1: Introduction to R, which is part of the R Workshop series prepared by ICJIA Research Analyst Bobae Kang to enable and encourage ICJIA researchers to take advantage of R, a statistical programming language that is one of the most powerful modern research tools.

Introduction to the Workshop

Instructor (Bobae Kang)

My name is Bobae Kang and I am a Research Analyst at ICJIA. I have been using R for approximately one year and a half at the time of writing, and I believe that R is one of the most powerful modern computational tools for research and data analysis. I have developed this workshop to invite my colleagues and fellow researchers at ICJIA to start taking advantage of this amazing tool. See About Me to read more about me.

Workshop objectives

This workshop is developed to offer a gentle introduction to the R programming language for ICJIA researchers. More specifically, the goal of this workshop is to enable and empower its participants to:

  • Import and manipulate tabular data files using R;
  • Create simple data visualizations (scatterplot, histogram, bar chart, line chart, etc.) to extract insight from data using R;
  • Perform basic statistical analysis using R;
  • Generate a report on a simple data analysis task using R;
  • Understand the basic elements of the R programming language;
  • Employ the programmatic approach to research and data analysis projects; and
  • Leverage online resources to find solutions to specific questions on using R for a given task.

Of course, mastering any of the listed skills requires more than this workshop. This workshop alone will not make any of its participants an expert R programmer. However, it will help the participants to get started and provide them with the basic skills and techniques in using R for research and data analysis. Ultimately, this workshop seeks to help the participants to gain the knowledge and confidence necesary to learn what they need to know for their own research projects.

A programming approach to research1

Among the core objectives of this workshop is to convince its participants to try a different approach to research: a programming (or programmatic) approach.

Many researchers and analysts rely on “graphical user interface” (GUI) tools; a lot of point-and-clicks and drag-and-drops happens with such tools. This is understandable since GUI tools offer a highly intuitive way of interacting with data. Also, learning a programming language often appears to be a daunting task to many with little to no prior experience in programming. However, adopting a programming approach has strong advantages that could far outweigh the pain of learning it–which, by the way, can be fun as well!

GUI workflow vs. programmatic workflow

Before discussing specific advantages of the programmatic approach to research, let us compare two different workflows–one dependent on GUI tools, and the other programmatic–with examples.

Here we introduce two reserchers:

  • Sterling Malory Archer is an ICJIA research analyst who relies almost explicitly on the GUI tools for his research projects.

  • Lana Anthony Kane, another ICJIA research analyst, has adopted a programming approach to research.

Let us now compare how these two work differently for a simple data analysis project.

Sterling, a typical GUI-oriented workflow

  1. Sterling finds out that he needs Uniform Crime Reports (UCR) data from Illinois State Police (ISP) website for his project. In fact, he needs data for the five most recent years He goes to the website and download each file separately, which lands on the default Downloads folder.
  2. Sterling browses to the Downloads folder and opens the files in MS Excel:
    • Sterling wants to have data for all years in a single spreadsheet. So he copy pastes multiple times to have all in the same place.
    • Sterling then realizes that he needs only a few variables for his analysis. He deletes all columns but the ones he needs, and saves it (mydata.xlsx) to the research directory that he creates as he is saving the combined data file.
  3. Sterling opens the combined data in SPSS:
    • Sterling runs a quick regression analysis and checks the results. The model fit doesn’t look good–oops, he should have included another variable in the model.
  4. Sterling so goes back to the Downloads folder for the original files, spends some minutes to find the downloaded files (there are so many files in the Downloads folder!), opens all spreadsheets in Excel, and copy-and-pastes the desired column from them to the combined spreadsheet. He saves it as a separate file (mydata2.xlsx).
  5. Sterling opens the updated data file in SPSS and runs the model again with more point-and-clicks. Finally, the results look good.
  6. Sterling opens MS Word to write a report on his model output.
    • He copy-and-pastes the model outputs and graphs from the SPSS report to the Word document.
  7. Sterling sends the report to his supervisor. Done!
    • If his supervisor later comes to Sterling and asks him how he came to his conclusion, will Sterling remember all the steps he took?

Lana, a programmatic workflow

  1. Before anything, Lana sets up a directory specifically for the project at hand and creates some sub-directories where different types of inputs and outputs will be stored. From past experience, Lana knows that she needs at lease the following sub-directories:
    • data: to store data inputs
    • images: to store image outputs
    • reports: to store report outputs
  2. Lana goes to the ISP website to find where the UCR data files are located. Then she writes a program to download and store them to the data subdirectory–all at once! She saves the R scrpt for the current program in the main directory.
  3. Lana writes a separate program that imports the data files and combines them into a single table. Then she takes a closer look at the table to identify which variables she needs, and writes a few more lines of code to select only the corresponding columns and save the modified version of data table into the data folder. Again, she saves the R script in the main directory.
  4. Lana creates another separate program, which imports the clean data file, fits a regression model, and generates some nice-looking plots. The plots are saved in the images folder.
    • Oops, the model fit does not look good. She should have included another variable. Lana goes back to the tab for the data cleaning script and, by tweaking a few lines in her code, quickly re-creates the data file.
    • Lana then comes back to the modeling script tab and, without changing a single line of code, simply re-runs it to get satisfying regression results and plots.
    • She saves the script in the main directory.
  5. Lana creates a R Markdown file for her report. Since R Markdown can easily incorporate R code, no need to copy-and-paste the results. Lana knits the report into a nice-looking html document, now saved in reports subdirectory. The R Markdown script is also saved.
  6. Lana sends the report to her supervisor with a link to her project directory. Done!
    • If the supervisor later comes back to Lana and questions how she came to her conclusion, everything is there and she has nothing to hide!

Benefits of a programming approach

Automation

At its core, a programming approach to research seeks for automating the work of research as much as possible. In practice, automation means implementing the research work in programs (e.g., R script files), which will run later to automatically execute the work. This is opposite to the GUI-based approach where each and every step of the research process takes place in an interactive fashion. By automating certain tasks, a researcher can rest assured that she will get consistent results when she revisit her work or share it with others.

Modularity

In software design, modularity refers to a logical partitioning of the “software design” that allows complex software to be manageable for the purpose of implementation and maintenance.
- “Modularity”, Wikipedia

A programming approach to research enables researchers to acheive a great degree of modularity, where a researcher can break down different stages or steps of research work into smaller but meaningful parts. In our example, this is shown by Lana’s writing separate programs for collecting, clearning, and analyzing her data. The benefit of modularity is clear here. When Lana had to make changes on her data clearning, she only had to change the part of the program that is relevant to the change while leaving everything else pretty much intact–a great gain in efficiency.

In fact, a researcher may accomplish a greater degree of modularity, for instance, by writing custom functions that can be reused in various steps of the project–or even in another project!

Reproducibiltiy

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator. […] Reproducibility is a minimum necessary condition for a finding to be believable and informative.
- U.S. NSF Subcommittee on Replicability in Science

Although it may be possible to ensure reproducibility with a GUI-oriented approach it is no less than an uphill battle, since each step in the procedure has to be documented thoroughly and separately. Also, there is little guarantee that researchers who come later will follow the directions strictly as intended by the original researcher. In contrast, the work of research stored in a program, or a set of programs, is highly reproducible. Replicating such a work does not rely on human interpretation of the documented procedure. Rather, the computer simply and faithfully follows the directions inscribed in the program.

A high degree of reproducibility can also greatly enhance productivity in a collaborative project. Its co-researchers can simply share the code that has all the information about what steps to take, which leaves little room to the possibility of miscommunication while allowing the researchers to focus on what is important.

Version control

The notion of version control refers to the practice of managing changes in a document or a program in a systematic fashion. One major benefit of version control is that it can protect the work of research from unwanted changes, such as mistakenly deleting files, as such changes can be easily identified and reverted.

Although the topic of version control is beyond the scope of this workshop, suffice it to note that a programming approach to research makes it easy to implement version control in research as there exist many excellent version control systems available free of charge. Git is the most popular of such systems; read its documentation to learn more about Git.


Introducing R

## [1] "Hellow World!"

Source: r-project.org

What is R?

“R is a language and environment for statistical computing and graphics.” - The R Foundation

  • Built for data analysis and visualization
  • One of the the most popular choices of programming language among academic researchers and data scientists
    • The following figure captures the growth of R in the recent years, compared to other programming languages. Although the data used to generate this figure does not provide the whole picture, it certainly reflects the reality where R has established itself as one of the most popular tool for researchers and data analysts.

Source: David Robinson, 2017, “The Impresseive Growth of R”

Why R?

  • Open source (free!)
    • Being open source also means that its source code, much of which is written in C and Fortran as well as R itself, is open for anyone to review and (albeit rarely) contribute.
  • Built for statistical analysis
    • R, at its very core, is designed for statistical computing. As such, a wide range of statistical and graphical techniques, which are often offered as preminum add-ons in other commercial tools, are part of the language’s core pacakge.
  • Reproducible and transparent
    • Saving one’s R code as an R script can be used to easily reproduce any data manipulation and analysis, ensuring the maximal clarity regarding which inputs are used for analysis and which steps are taken to generate the output. Collaborators can share their work in the R script format to quickly reach a shared understanding of the progress.
    • As stated above, R is an open source. This open-source nature of R gives the language the level of transparency that other commercial softwares cannot afford. Also most of R packages are also open source, making it possible for any interested user to see precisely how things work under the hood.
  • Extensible through powerful third-party packages
  • Supported by its active user community
    • The strong R community is often considered as one of the key advantages of using the language.2
  • Enabling researchers to tackle a variety of tasks using a single platform
    • Data manipulation and analyiss
    • Basic and advanced statistical modeling
    • Publication-ready data visualization
    • Generating reports in various formats (html webpage, MS word, PDF)
    • Slideshows, websites, interactive dashboards, etc.

Comparisons

R vs MS Excel

  • License cost
    • Although often taken for granted, the MS Office Suite is in fact not free. In contrast, R is free. (Of course, this may be a moot point since other MS Office Suite products are well loved and widely used on daily basis.)
  • Speed and scalability
    • Excel is slow, and its performance quickly deteriorates as the data size grows. R offers many tools highly optimized to work with large data.
    • Excel has limits on the number of rows and columns on a worksheet (1,048,576 rows by 16,384 columns). R does not.
  • Visualization
    • While Excel provides some basic visualzation options, it is difficult in Excel to create highly customized and publication-quality plots. R can achieve this with only several lines of code.
  • Complex and advanced analysis
    • Data manipulation capability of R is simply superior to that of Excel.
    • Although Excel does offer a variety of built-in functions for data analysis to many people’s surprise, users are still limited to the built-in options. Not only R comes with advanced data analysis functionalities, they can also be easily exended with add-on pacakges.
  • Reproducibility
    • Reproducibility can be achieved in Excel to only a limited extent, any marginally complex analysis in Excel becomes extremely difficult to reproduce.

R vs IBM SPSS

  • License cost (again)
    • At the time of writing, the Base pricing option for SPSS is $99 per user per month. In addition, add-ons for other advanced functionalities must be purchased separately.3 R is free.
  • Syntax
    • SPSS provides Syntax to programmatically perform data clearing, analysis and statistical modeling tasks. However, it is often noted that the grammatical/syntactic structure of SPSS Syntax is difficult to use (especially for complex operations) and often at odds with the programming standards in general.4 R, on the other hand, follows much of the standard grammar of modern programming languages. In addition, many tutorials on the basics of R language are readily available.
  • Visualization
    • Although SPSS offers basic visualization options for commonly used charts and graphs, just as in Excel, only minimal customization is available and the outputs are little better than Excel graphs.
  • Reporting
    • SPSS is capable for quickly generating a report on the data, but the output is merely functional. On the other hand, R offers RMarkdown documents that can be used to generate highly customized, publication-ready reports and articles in various formats: html webpage, MS Word, and PDF document.

R vs Tableau

  • License cost (DUH!)
    • At the time of writing, the Professional Edition of Tableau Desktop software costs $70 per user per month. Also, to store data sources and workbooks securely, an additional cost of $35 per user per month for Tableau Server must be paid.5 R is free.
  • Reproducibility
    • Tableau is first and foremost a Graphic User Interface tool. Accordingly, communicating the process of making the workbook is much difficult in Tableau although it is possible to share Tableau workbooks with others. (See above for more on the benefits of reproducibility.)
  • Data manipulation
    • Although Tableau offers some useful data manipulation features, its focus is visualization. As a result, Tableau’s data manipulation features are limited to meet the real-life needs and its users often need to rely on another tool to prepare data for use in Tableau.6
    • It is telling that Tableau offers an R integration functionality to create complex parameters. If this is the case, why not use R itself in its native environment?
  • Complex and advanced analysis
    • Tableau’s focus of quick-and-simple data visualization makes the tool largely unfit for conducting and thereby visualizing complex data analysis and statistical modeling results. R, on the other hand, is fully capable of handle both tasks of high-quality data visualization and advanced data analysis as well as modeling, either separately or jointly.

This is not to say that these tools (Excel, SPSS, and Tableau) are simply inferior to R in all aspects. Each software is designed with a specific set of tasks and comes with the ease of use for the right tasks. Furthermore, compared to these tools, R has a relatively steep learning curve, which makes it difficult for anyone without sufficient background knowledge to quickly master and use it for real tasks.

Nonetheless, R is a highly performant, versatile, and flexible tool for research and data analysis that is also largely compatible with any of the tools above. That said, R is a great addition to any researcher’s toolbox, and it can easily be the most preferred choice of tool.


RStudio

Source: RStudio

What is RStudio? Why use it?

  • Best Integrated Development Environment (IDE) for R
    • Although there are other IDEs supporting R, RStudio is a de facto standard tool in the R community and comes with many great features to extend R’s power beyond statistical computing.
  • Powerful and convenient features
    • Advanced debugging tools
    • RMarkdown integration
    • Shiny integration
    • R Presentation integration
  • Interactive workflow
  • Open source (again!)
    • Although RStudio, Inc. offers paid options for the IDE, which comes with additional support, the open source version, which is free of charge, can meet practicaly all needs of any researcher.
  • … and many more!

RStudio overview

Source: Wikimedia.org

Source editor pane

  • Write and edit R scripts

Console pane

  • Interactively run R commands

Environment/history pane

  • Environment: view objects in the global environment
  • History: search and view command history

Files/Plots/Packages/Help pane

  • Files: navigate directories
  • Plots: view generated plots
  • Packages: manage installed packages in the library
  • Help: view help documentations for any package/function

Customization

Panes

The size and position of panes can be customized. On the top right of each pane, there are buttons to adjust the pane size. Also, place your mouse pointer/cursor on the border line between panes and when the pointer changes its shape, click and drag to adjust the pane size. For more options, go to View > Panes on the menu bar. Alternatively, try Tools > Global Options > Pane Layout.

Appearances

The overall appearance can be customized as well. Go to Tools > Global options > Appearnce on the menu bar to change themes, fonts, and more.


Basic Setup

Installing R

  • Visit https://cran.r-project.org/
  • Or simply google “download R” to find the link to download page.
  • Installation requires the Administrator account; talk to DoIT!

  • Also, check out “Install R” tutorial video by RStudio, Inc.

Installing RStudio


Workshop Overview

Source: Wikimedia.org

Module 2

R basics

Here, participants will learn the basic building blocks of R language. Once participants gain some knowledge in the fundamental topics, they will be introduced to the popular tidyverse framework and provided with a recommendation as to a “good” style for coding in R.

This module consists of the following two parts:

Part 1: Fundamentals of R programming

Part 2: Gearing up for data analysis

Module 3

Data analysis with R

This module focuses on manimuplating and transforming tabular data using R. Participants will learn how to import data into R environment and use common tidyverse syntax to clean and analyze data. Specifically, core functions from dplyr, tidyr, stringr, and lubridate packages will be covered.

This module consists of the following two parts:

Part 1: Getting started with tidyverse

Part 2: More on data analysis

Module 4

Data visualization with R

In this module, participants will get started with generating plots to visually present and communicate insights from data. The main focus of this module is the popular ggplot2 package. Participants will also be introduced to some options for generating maps and interactive plots.

This module consists of the following two parts:

Part 1: The Grammar of Graphics

Part 2: Advanced visualiation

Module 5

Statistical modeling with R

In this module, workshop participants will learn how to conduct basic statistical analysis with R, including Student’s t-test, analysis of variance (ANOVA), basic linear regression model, and generalized linear models. Participants will also be provided with resources for more advanced statistical modeling.

This module consists of the following two parts:

Part 1: Basics of statistical modeling

Part 2: Options for advanced modeling

Module 6

“To Infinity and Beyond”

This module will introduce workshop participants to various ways to share their work by using R Markdown, R Presentation, and Shiny. Participants will also learn how to leverage the power of the Internet to facilitate search for answers to specific questions.

This module consists of the following two parts:

Part 1: Sharing your work

Part 2: Leveraging online resources

ICJIA Workshop on the web

All the workshop materials are available as a website:

And all the materials that constitute this website, including the workshop materials, are available as a GitHub repository:

I encourage you to check out these pages to find out more about this workshop and check out the workshop materials before or after each session.


References


  1. This section is inspired by Dr. Benjamin Soltoff’s Compuation for Social Sciences course at the University of Chicago, and much of this section’s contents is drawn from his work albeit with modifications.

  2. See this article published on R-bloggers for more on the R community.

  3. See the official SPSS Pricing page for more information.

  4. See this Stack Overflow thread and this Reddit thread for more on this point.

  5. See the official Tableau Pricing page for more information.

  6. See this article from AIM Consulting for more on the limitation of Tableau’s data manipulation functionality with an example.