ICJIA R Workshop

R&A Meeting Presentation

2018-02-13
Bobae Kang
(Bobae.Kang@illinois.gov)

plot of chunk unnamed-chunk-1

Agenda

1. What is R and what can it do?
2. Workshop objectives and structure

plot of chunk unnamed-chunk-2

Source: pixabay.com

A brief intro to ...

plot of chunk unnamed-chunk-3

Source: r-project.org

What is R?

“R is a language and environment for statistical computing and graphics.” - The R Foundation

Built for data analysis and visualization
One of the the most popular choices of programming language among academic researchers and data scientists

Benefits of using R

Open source (free!)
Built for statistical analysis
Reproducible and transparent
Extensible through powerful third-party packages
Enabling researchers to tackle a variety of tasks using a single platform

With some practice …

plot of chunk unnamed-chunk-4

Source: The R Graph Gallery, The Comprehensive R Archive Network

plot of chunk unnamed-chunk-5

Source: Time Magazine

Data manipulation

plot of chunk unnamed-chunk-6

Source: Wikimedia.org

ISP UCR data (2011-2015)

# print the state police's crime data
ispcrime_tbl

# A tibble: 510 x 12
    year county violentCrime murder  rape robbery aggAssault propertyCrime
   <int> <fct>         <int>  <int> <int>   <int>      <int>         <int>
 1  2011 Adams           218      0    37      15        166          1555
 2  2011 Alexa~          119      0    14       4        101           290
 3  2011 Bond              6      1     0       0          5           211
 4  2011 Boone            59      0    24       8         27           733
 5  2011 Brown             7      0     1       0          6            38
 6  2011 Bureau           42      0     4       3         35           505
 7  2011 Calho~           13      0     0       0         13            56
 8  2011 Carro~            8      0     1       0          7           206
 9  2011 Cass             12      0     1       0         11           119
10  2011 Champ~         1210      5   127     208        870          5332
# ... with 500 more rows, and 4 more variables: burglary <int>,
#   larcenyTft <int>, MVTft <int>, arson <int>

# get a quick summary of violent crime and property crime
ispcrime_tbl %>%
  select(violentCrime, propertyCrime) %>%
  summary()

  violentCrime   propertyCrime   
 Min.   :    0   Min.   :     0  
 1st Qu.:   19   1st Qu.:   133  
 Median :   42   Median :   349  
 Mean   :  501   Mean   :  2913  
 3rd Qu.:  133   3rd Qu.:  1190  
 Max.   :33348   Max.   :178902  
 NA's   :7       NA's   :7

# filter to keep only counties starting with C for 2015
#   while creating and showing a new variable for total crime count
ispcrime_tbl %>%
  filter(substr(county, 1, 1) == "C", year == 2015) %>%
  mutate(totalCrime = violentCrime + propertyCrime) %>%
  select(year, county, totalCrime)

# A tibble: 12 x 3
    year county     totalCrime
   <int> <fct>           <int>
 1  2015 Calhoun            NA
 2  2015 Carroll           176
 3  2015 Cass              154
 4  2015 Champaign        6486
 5  2015 Christian         292
 6  2015 Clark             103
 7  2015 Clay              191
 8  2015 Clinton           423
 9  2015 Coles             805
10  2015 Cook           153575
11  2015 Crawford          282
12  2015 Cumberland         42

# how about "D" counties in 2014 and 2015?
ispcrime_tbl %>%
  filter(substr(county, 1, 1) == "D", year %in% c(2014, 2015)) %>%
  mutate(totalCrime = violentCrime + propertyCrime) %>%
  select(year, county, totalCrime)

# A tibble: 8 x 3
   year county  totalCrime
  <int> <fct>        <int>
1  2014 De Kalb       2218
2  2014 De Witt        182
3  2014 Douglas        116
4  2014 Du Page      12576
5  2015 De Kalb       2173
6  2015 De Witt        140
7  2015 Douglas        173
8  2015 Du Page      12538

# get annual average crime count by county
ispcrime_tbl %>%
  group_by(county) %>%
  summarise(annualAvgCrime = sum(violentCrime, propertyCrime, na.rm = TRUE) / n())

# A tibble: 102 x 2
   county    annualAvgCrime
   <fct>              <dbl>
 1 Adams             1724  
 2 Alexander          385  
 3 Bond               190  
 4 Boone              426  
 5 Brown               39.0
 6 Bureau             480  
 7 Calhoun             13.8
 8 Carroll            196  
 9 Cass               109  
10 Champaign         6567  
# ... with 92 more rows

# sort by average crime count? 
ispcrime_tbl %>%
  group_by(county) %>%
  summarise(annualAvgCrime = sum(violentCrime, propertyCrime, na.rm = TRUE) / n()) %>%
  arrange(desc(annualAvgCrime))

# A tibble: 102 x 2
   county    annualAvgCrime
   <fct>              <dbl>
 1 Cook              182818
 2 Du Page            14316
 3 Lake               12779
 4 Winnebago          12275
 5 Will               11078
 6 St. Clair           9262
 7 Sangamon            8876
 8 Kane                8332
 9 Peoria              7229
10 Champaign           6567
# ... with 92 more rows

# merging regions data and count the number of counties by region
ispcrime_tbl %>%
  left_join(regions) %>%
  group_by(region) %>%
  count()

# A tibble: 4 x 2
# Groups:   region [4]
  region       n
  <fct>    <int>
1 Central    230
2 Cook         5
3 Northern    85
4 Southern   190

# no duplicates!
ispcrime_tbl %>%
  select(county) %>%
  unique() %>%
  left_join(regions) %>%
  group_by(region) %>%
  count()

# A tibble: 4 x 2
# Groups:   region [4]
  region       n
  <fct>    <int>
1 Central     46
2 Cook         1
3 Northern    17
4 Southern    38

Data visualization

plot of chunk unnamed-chunk-16

Source: Wikimedia.org

Example (1): Word cloud plot of chunk unnamed-chunk-17

Source: The R Graph Gallery

Example (2): Dendrogram plot of chunk unnamed-chunk-18

Source: The Comprehensive R Archive Network

Example (3): Network graph plot of chunk unnamed-chunk-19

Source: The Comprehensive R Archive Network

Example (4): Line graph plot of chunk unnamed-chunk-20

Example (5): Choropleth map plot of chunk unnamed-chunk-21

plot of chunk unnamed-chunk-22

Quick demonstration

Bar plot
Histogram

# bar plot of crime count in 2015 by region
barplot <- ggplot(filter(ispcrime_tbl2, year == 2015), aes(x = region, y = violentCrime, fill = region, group = region)) +
  stat_summary(geom = "bar", fun.y = "sum")

barplot

plot of chunk unnamed-chunk-24

# add title and change appearance
barplot2 <- barplot +
  labs(title = "Violent crime count in 2015 by region") +
  theme_classic(base_size = 15)

barplot2

plot of chunk unnamed-chunk-25

# remove the axes names and legends, and change colors
barplot2 +
  labs(x = "", y = "") +
  theme(legend.position = "None") +
  scale_fill_brewer(palette="Spectral")

plot of chunk unnamed-chunk-26

# histogram of burglary count by county
ggplot(ispcrime_tbl2, aes(x = burglary)) +
  geom_histogram() +
  facet_wrap(~ year) +
  labs(x = "Burglary count", y = "# counties") +
  theme_minimal(base_size = 15)

plot of chunk unnamed-chunk-27

# exclude Cook county data from the histogram and add colors
ggplot(filter(ispcrime_tbl2, county != "Cook"), aes(x = burglary, fill = Year)) +
  geom_histogram() + facet_wrap(~ Year) +
  labs(x = "Burglary count", y = "# counties") +
  theme_minimal(base_size = 15)

plot of chunk unnamed-chunk-28

Statistical modeling

plot of chunk unnamed-chunk-29

Source: pixabay

Example - Simple linear model

lm_fit <- lm(violentCrime ~ propertyCrime, ispcrime)
summary(lm_fit)


Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-2239.5    -2.2    57.0    78.3  3992.9 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -79.768287  16.496961  -4.835 1.77e-06 ***
propertyCrime   0.199367   0.001059 188.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 363.5 on 501 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.9861,    Adjusted R-squared:  0.986 
F-statistic: 3.546e+04 on 1 and 501 DF,  p-value: < 2.2e-16

# put model fit results in a data frame format
tidy(lm_fit)

           term    estimate   std.error  statistic      p.value
1   (Intercept) -79.7682868 16.49696109  -4.835332 1.771126e-06
2 propertyCrime   0.1993675  0.00105876 188.302852 0.000000e+00

# get predictions and residuals for each data point
ispcrime %>%
  select(year, county, propertyCrime, violentCrime) %>%
  add_predictions(lm_fit) %>%
  add_residuals(lm_fit) %>%
  head()

  year    county propertyCrime violentCrime      pred      resid
1 2011     Adams          1555          218 230.24816 -12.248156
2 2011 Alexander           290          119 -21.95172 140.951715
3 2011      Bond           211            6 -37.70175  43.701747
4 2011     Boone           733           59  66.36808  -7.368081
5 2011     Brown            38            7 -72.19232  79.192322
6 2011    Bureau           505           42  20.91229  21.087706

# plot the model fit
plot(violentCrime ~ propertyCrime, ispcrime)
abline(lm_fit)

plot of chunk unnamed-chunk-34

# show diagnostic plots
par(mfrow=c(2, 2))
plot(lm_fit)

plot of chunk unnamed-chunk-35

Generalized linear models

# examples of generalized linear models with glm()
logistic_reg <- glm(binary ~ x1 + x2, data = mydata, family = binomial())
poisson_reg <- glm(count ~ x1 + x2, data = mydata, family = poisson())
gamma_reg <- glm(y ~ x1 + x2, data = mydata, family = Gamma())

Other advanced models

time series models (e.g. stats and forecast packages)

spatial regression models (e.g. spdep and spgwr packages)

survival analysis (e.g. survival package)

network analysis (e.g. network and igraph packages)

text analysis (e.g. tm and tidytext packages)

machine learning (e.g. caret and mlr packages)

And more!

Reports

HTML documents for web publishing
- create interactive workflow using R Notebook
- add interactive elements using htmlwidgets and/or shiny
PDF documents for printing
MS Word documents

Example - R Notebook

plot of chunk unnamed-chunk-38

Slideshow

plot of chunk unnamed-chunk-39

Dashboard

Website

Objectives

plot of chunk unnamed-chunk-42

Technical objectives

Import and manipulate tabular data files using R
Create simple data visualizations to extract insight from data using R
Perform basic statistical analysis using R
Generate a report on a simple data analysis task using R

Fundamental objectives

Understand the basic elements of the R programming language
Employ the programmatic approach to research and data analysis projects
Leverage online resources to find solutions to specific questions on using R for a given task

Structure

plot of chunk unnamed-chunk-43

Overall setup

Six modules
One module per week
Each module consists of two parts
- except the first module on introduction
All workshop materials (slides and notes) will be available
I will be available, too, for answering questions

Modules

Introduction to R
R basics
Data analysis in R
Data visualization in R
Statistical modeling in R
Sharing your analysis and more

Questions?

plot of chunk unnamed-chunk-44

Source: Giphy.com

plot of chunk unnamed-chunk-45