# template elements
# presentation

ICJIA R Workshop

R&A Meeting Presentation

2018-02-13
Bobae Kang
(Bobae.Kang@illinois.gov)

plot of chunk unnamed-chunk-1

Agenda

  • 1. What is R and what can it do?
  • 2. Workshop objectives and structure

plot of chunk unnamed-chunk-2

Source: pixabay.com

A brief intro to ...

plot of chunk unnamed-chunk-3

Source: r-project.org

What is R?

“R is a language and environment for statistical computing and graphics.” - The R Foundation

  • Built for data analysis and visualization
  • One of the the most popular choices of programming language among academic researchers and data scientists

Benefits of using R

  • Open source (free!)
  • Built for statistical analysis
  • Reproducible and transparent
  • Extensible through powerful third-party packages
  • Enabling researchers to tackle a variety of tasks using a single platform

plot of chunk unnamed-chunk-5

Source: Time Magazine

Data manipulation

plot of chunk unnamed-chunk-6

Source: Wikimedia.org

ISP UCR data (2011-2015)

# print the state police's crime data
ispcrime_tbl
# A tibble: 510 x 12
    year county violentCrime murder  rape robbery aggAssault propertyCrime
   <int> <fct>         <int>  <int> <int>   <int>      <int>         <int>
 1  2011 Adams           218      0    37      15        166          1555
 2  2011 Alexa~          119      0    14       4        101           290
 3  2011 Bond              6      1     0       0          5           211
 4  2011 Boone            59      0    24       8         27           733
 5  2011 Brown             7      0     1       0          6            38
 6  2011 Bureau           42      0     4       3         35           505
 7  2011 Calho~           13      0     0       0         13            56
 8  2011 Carro~            8      0     1       0          7           206
 9  2011 Cass             12      0     1       0         11           119
10  2011 Champ~         1210      5   127     208        870          5332
# ... with 500 more rows, and 4 more variables: burglary <int>,
#   larcenyTft <int>, MVTft <int>, arson <int>
# get a quick summary of violent crime and property crime
ispcrime_tbl %>%
  select(violentCrime, propertyCrime) %>%
  summary()
  violentCrime   propertyCrime   
 Min.   :    0   Min.   :     0  
 1st Qu.:   19   1st Qu.:   133  
 Median :   42   Median :   349  
 Mean   :  501   Mean   :  2913  
 3rd Qu.:  133   3rd Qu.:  1190  
 Max.   :33348   Max.   :178902  
 NA's   :7       NA's   :7       
# filter to keep only counties starting with C for 2015
#   while creating and showing a new variable for total crime count
ispcrime_tbl %>%
  filter(substr(county, 1, 1) == "C", year == 2015) %>%
  mutate(totalCrime = violentCrime + propertyCrime) %>%
  select(year, county, totalCrime)
# A tibble: 12 x 3
    year county     totalCrime
   <int> <fct>           <int>
 1  2015 Calhoun            NA
 2  2015 Carroll           176
 3  2015 Cass              154
 4  2015 Champaign        6486
 5  2015 Christian         292
 6  2015 Clark             103
 7  2015 Clay              191
 8  2015 Clinton           423
 9  2015 Coles             805
10  2015 Cook           153575
11  2015 Crawford          282
12  2015 Cumberland         42
# how about "D" counties in 2014 and 2015?
ispcrime_tbl %>%
  filter(substr(county, 1, 1) == "D", year %in% c(2014, 2015)) %>%
  mutate(totalCrime = violentCrime + propertyCrime) %>%
  select(year, county, totalCrime)
# A tibble: 8 x 3
   year county  totalCrime
  <int> <fct>        <int>
1  2014 De Kalb       2218
2  2014 De Witt        182
3  2014 Douglas        116
4  2014 Du Page      12576
5  2015 De Kalb       2173
6  2015 De Witt        140
7  2015 Douglas        173
8  2015 Du Page      12538
# get annual average crime count by county
ispcrime_tbl %>%
  group_by(county) %>%
  summarise(annualAvgCrime = sum(violentCrime, propertyCrime, na.rm = TRUE) / n())
# A tibble: 102 x 2
   county    annualAvgCrime
   <fct>              <dbl>
 1 Adams             1724  
 2 Alexander          385  
 3 Bond               190  
 4 Boone              426  
 5 Brown               39.0
 6 Bureau             480  
 7 Calhoun             13.8
 8 Carroll            196  
 9 Cass               109  
10 Champaign         6567  
# ... with 92 more rows
# sort by average crime count? 
ispcrime_tbl %>%
  group_by(county) %>%
  summarise(annualAvgCrime = sum(violentCrime, propertyCrime, na.rm = TRUE) / n()) %>%
  arrange(desc(annualAvgCrime))
# A tibble: 102 x 2
   county    annualAvgCrime
   <fct>              <dbl>
 1 Cook              182818
 2 Du Page            14316
 3 Lake               12779
 4 Winnebago          12275
 5 Will               11078
 6 St. Clair           9262
 7 Sangamon            8876
 8 Kane                8332
 9 Peoria              7229
10 Champaign           6567
# ... with 92 more rows
# merging regions data and count the number of counties by region
ispcrime_tbl %>%
  left_join(regions) %>%
  group_by(region) %>%
  count()
# A tibble: 4 x 2
# Groups:   region [4]
  region       n
  <fct>    <int>
1 Central    230
2 Cook         5
3 Northern    85
4 Southern   190
# no duplicates!
ispcrime_tbl %>%
  select(county) %>%
  unique() %>%
  left_join(regions) %>%
  group_by(region) %>%
  count()
# A tibble: 4 x 2
# Groups:   region [4]
  region       n
  <fct>    <int>
1 Central     46
2 Cook         1
3 Northern    17
4 Southern    38

Data visualization

plot of chunk unnamed-chunk-16

Source: Wikimedia.org

Example (1): Word cloud plot of chunk unnamed-chunk-17

Source: The R Graph Gallery

Example (2): Dendrogram plot of chunk unnamed-chunk-18

Source: The Comprehensive R Archive Network

Example (3): Network graph plot of chunk unnamed-chunk-19

Source: The Comprehensive R Archive Network

Example (4): Line graph plot of chunk unnamed-chunk-20

Example (5): Choropleth map plot of chunk unnamed-chunk-21

plot of chunk unnamed-chunk-22

Quick demonstration

  • Bar plot
  • Histogram
# bar plot of crime count in 2015 by region
barplot <- ggplot(filter(ispcrime_tbl2, year == 2015), aes(x = region, y = violentCrime, fill = region, group = region)) +
  stat_summary(geom = "bar", fun.y = "sum")

barplot

plot of chunk unnamed-chunk-24

# add title and change appearance
barplot2 <- barplot +
  labs(title = "Violent crime count in 2015 by region") +
  theme_classic(base_size = 15)

barplot2

plot of chunk unnamed-chunk-25

# remove the axes names and legends, and change colors
barplot2 +
  labs(x = "", y = "") +
  theme(legend.position = "None") +
  scale_fill_brewer(palette="Spectral")

plot of chunk unnamed-chunk-26

# histogram of burglary count by county
ggplot(ispcrime_tbl2, aes(x = burglary)) +
  geom_histogram() +
  facet_wrap(~ year) +
  labs(x = "Burglary count", y = "# counties") +
  theme_minimal(base_size = 15)

plot of chunk unnamed-chunk-27

# exclude Cook county data from the histogram and add colors
ggplot(filter(ispcrime_tbl2, county != "Cook"), aes(x = burglary, fill = Year)) +
  geom_histogram() + facet_wrap(~ Year) +
  labs(x = "Burglary count", y = "# counties") +
  theme_minimal(base_size = 15)

plot of chunk unnamed-chunk-28

Statistical modeling

plot of chunk unnamed-chunk-29

Source: pixabay

Example - Simple linear model

lm_fit <- lm(violentCrime ~ propertyCrime, ispcrime)
summary(lm_fit)

Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-2239.5    -2.2    57.0    78.3  3992.9 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -79.768287  16.496961  -4.835 1.77e-06 ***
propertyCrime   0.199367   0.001059 188.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 363.5 on 501 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.9861,    Adjusted R-squared:  0.986 
F-statistic: 3.546e+04 on 1 and 501 DF,  p-value: < 2.2e-16
# put model fit results in a data frame format
tidy(lm_fit)
           term    estimate   std.error  statistic      p.value
1   (Intercept) -79.7682868 16.49696109  -4.835332 1.771126e-06
2 propertyCrime   0.1993675  0.00105876 188.302852 0.000000e+00
# get predictions and residuals for each data point
ispcrime %>%
  select(year, county, propertyCrime, violentCrime) %>%
  add_predictions(lm_fit) %>%
  add_residuals(lm_fit) %>%
  head()
  year    county propertyCrime violentCrime      pred      resid
1 2011     Adams          1555          218 230.24816 -12.248156
2 2011 Alexander           290          119 -21.95172 140.951715
3 2011      Bond           211            6 -37.70175  43.701747
4 2011     Boone           733           59  66.36808  -7.368081
5 2011     Brown            38            7 -72.19232  79.192322
6 2011    Bureau           505           42  20.91229  21.087706
# plot the model fit
plot(violentCrime ~ propertyCrime, ispcrime)
abline(lm_fit)

plot of chunk unnamed-chunk-34

# show diagnostic plots
par(mfrow=c(2, 2))
plot(lm_fit)

plot of chunk unnamed-chunk-35

Generalized linear models

# examples of generalized linear models with glm()
logistic_reg <- glm(binary ~ x1 + x2, data = mydata, family = binomial())
poisson_reg <- glm(count ~ x1 + x2, data = mydata, family = poisson())
gamma_reg <- glm(y ~ x1 + x2, data = mydata, family = Gamma())

Other advanced models

  • time series models (e.g. stats and forecast packages)
  • spatial regression models (e.g. spdep and spgwr packages)
  • survival analysis (e.g. survival package)
  • network analysis (e.g. network and igraph packages)
  • text analysis (e.g. tm and tidytext packages)
  • machine learning (e.g. caret and mlr packages)

And more!

plot of chunk unnamed-chunk-37

Reports

  • HTML documents for web publishing
    • create interactive workflow using R Notebook
    • add interactive elements using htmlwidgets and/or shiny
  • PDF documents for printing
  • MS Word documents

Example - R Notebook

plot of chunk unnamed-chunk-38

Slideshow

plot of chunk unnamed-chunk-39

Dashboard

Website

Objectives

plot of chunk unnamed-chunk-42

Technical objectives

  • Import and manipulate tabular data files using R
  • Create simple data visualizations to extract insight from data using R
  • Perform basic statistical analysis using R
  • Generate a report on a simple data analysis task using R

Fundamental objectives

  • Understand the basic elements of the R programming language
  • Employ the programmatic approach to research and data analysis projects
  • Leverage online resources to find solutions to specific questions on using R for a given task

Structure

plot of chunk unnamed-chunk-43

Overall setup

  • Six modules
  • One module per week
  • Each module consists of two parts
    • except the first module on introduction
  • All workshop materials (slides and notes) will be available
  • I will be available, too, for answering questions

Modules

  1. Introduction to R
  2. R basics
  3. Data analysis in R
  4. Data visualization in R
  5. Statistical modeling in R
  6. Sharing your analysis and more

Questions?

plot of chunk unnamed-chunk-44

Source: Giphy.com

plot of chunk unnamed-chunk-45