R Workshop

Module 5: Statistical modeling with R (1)

2018-04-18
Bobae Kang
(Bobae.Kang@illinois.gov)

Agenda

Part 1: Basics of statistical modeling
Part 2: Options for advanced modeling

Basic Descriptive Statistics

plot of chunk unnamed-chunk-1

Quick summary stats

summary(data)

Base R's summary() function is a quick way to get descriptive statistics on each columan of a tabular data object.
- For numerical columns, we get minimum, 1st quartile, median, mean, 3rd quartie and maximum values, as well as the count of missing value (NA).
- For categorical variables (e.g. factor), we can the count of each level as well as the missing value (NA).
summary() is also used to get the “summary” of a fitted model object (e.g. lm) as well, as we will see shortly.

summary(ispcrime)

      year            county     violentCrime       murder       
 Min.   :2011   Adams    :  5   Min.   :    0   Min.   :  0.000  
 1st Qu.:2012   Alexander:  5   1st Qu.:   19   1st Qu.:  0.000  
 Median :2013   Bond     :  5   Median :   42   Median :  0.000  
 Mean   :2013   Boone    :  5   Mean   :  501   Mean   :  7.026  
 3rd Qu.:2014   Brown    :  5   3rd Qu.:  133   3rd Qu.:  1.000  
 Max.   :2015   Bureau   :  5   Max.   :33348   Max.   :566.000  
                (Other)  :480   NA's   :7       NA's   :7        
      rape            robbery          aggAssault      propertyCrime   
 Min.   :   0.00   Min.   :    0.0   Min.   :    0.0   Min.   :     0  
 1st Qu.:   1.00   1st Qu.:    0.0   1st Qu.:   15.0   1st Qu.:   133  
 Median :   6.00   Median :    2.0   Median :   33.0   Median :   349  
 Mean   :  41.29   Mean   :  172.3   Mean   :  280.4   Mean   :  2913  
 3rd Qu.:  22.00   3rd Qu.:   13.0   3rd Qu.:  102.0   3rd Qu.:  1190  
 Max.   :1986.00   Max.   :16095.0   Max.   :15129.0   Max.   :178902  
 NA's   :7         NA's   :7         NA's   :7         NA's   :7       
    burglary         larcenyTft           MVTft             arson        
 Min.   :    0.0   Min.   :     0.0   Min.   :    0.0   Min.   :   0.00  
 1st Qu.:   35.5   1st Qu.:    85.5   1st Qu.:    3.0   1st Qu.:   1.00  
 Median :   79.0   Median :   258.0   Median :   10.0   Median :   2.00  
 Mean   :  589.3   Mean   :  2084.9   Mean   :  215.2   Mean   :  23.45  
 3rd Qu.:  268.0   3rd Qu.:   852.0   3rd Qu.:   30.0   3rd Qu.:   8.50  
 Max.   :38485.0   Max.   :116145.0   Max.   :22879.0   Max.   :1418.00  
 NA's   :7         NA's   :7          NA's   :7         NA's   :7

Numerical variables

Central tendency
Variability
Shape
Outliers

Central tendency

In statistics, central tendency is concerned with the “center” of a distribution. The following three are the common measures of central tendency:
- “mean” for the arithmetic mean
- “median” for the 50th percentile
- “mode” for the most frequent value
R has built-in functions for the first two: mean() and median()

mean(ispcrime$violentCrime)

[1] NA

median(ispcrime$violentCrime)

[1] NA

… wait, what?

mean(ispcrime$violentCrime, na.rm = TRUE)

[1] 500.9702

median(ispcrime$violentCrime, na.rm = TRUE)

[1] 42

NA is known to be “contagious” in R, making any operation with missing values return NA
na.rm is an argument with a boolean input to control whether NA values are removed/excluded in calculation

Variability

Variability (or dispersion) in statistics is concerned with the extent to which a distribution is spreaded out.
There exist various measures for variability and R has built-in functions for many of them:
- range: min(), max(), range()
- percentiles: fivenum(), IQR(), quantile()
- variance: var(), sd()

Range

min(ispcrime$violentCrime, na.rm = TRUE)

[1] 0

max(ispcrime$violentCrime, na.rm = TRUE)

[1] 33348

range(ispcrime$violentCrime, na.rm = TRUE)

[1]     0 33348

Range is the interval between the minium nad maximum values.

Percentiles

fivenum(ispcrime$violentCrime, na.rm = TRUE)

[1]     0    19    42   133 33348

IQR(ispcrime$violentCrime, na.rm = TRUE)

[1] 114

fivenum() returns what is called Tukey's five number summary (minumum, 1st quartile, median, 3rd qurtile, maximum) for the input data
IQR() returns the inter-quartile range, which is the difference between the 1st quartile (25%) and 3rd quartile (75%)

# default: equal to fivenum()
quantile(ispcrime$violentCrime, probs = seq(0, 1, 0.25), na.rm = TRUE)

   0%   25%   50%   75%  100% 
    0    19    42   133 33348

quantile(ispcrime$violentCrime, probs = seq(0, 1, 0.1), na.rm = TRUE)

     0%     10%     20%     30%     40%     50%     60%     70%     80% 
    0.0     5.0    13.0    24.0    32.0    42.0    57.0    90.0   246.2 
    90%    100% 
  668.4 33348.0

The default setting output of quantile() is equal to fivenum() output
However, quantile() can use probs arugment to get more flexible output

Variance

var(ispcrime$violentCrime, na.rm = TRUE)

[1] 9463013

sd(ispcrime$violentCrime, na.rm = TRUE)

[1] 3076.201

var() returns the sample variance of the given data:
- \( s^2 = (\sum{x - \bar{x}}^2)/(n-1) \)
sd() returns the standard deviation of the given data:
- \( s = \sqrt{(\sum{x - \bar{x}}^2)/(n-1)} \)

variables <- ispcrime %>% select(violentCrime, propertyCrime)
var(variables, na.rm = TRUE)

              violentCrime propertyCrime
violentCrime       9463013      46803865
propertyCrime     46803865     234761775

When multiple variables are given as its input, var() returns a covariance matrix
- In such cases, cov()can be used instead; however, cov() has no na.rm argument to remove missing values

Shape

The skewness and kurtosis (fatness/thinness) of a distribution are called the “shape” of the distribution.
The moments package offers the following functions to meaasure the shape of a distribution:
- skewness()
- kurtosis()
- See moments package documentation

Outliers

Outliers are observations or data points that are far from most other observations and disproportinately affect key summary statistics
The outliers package offers useful functions to detect and measure outliers in data.
- See outliers package documentation

Categorical variables

Frequency table is one of the most common ways to summarize the distribution of a catagorical variable
Frequency tables can be made using table functions
- table() for generating frequency tables
- prop.table() for tables of proportions
- xtabs() for creating frequency tables using formula
- ftable() for creating “flat” contingency tables

Table functions

table(...)
prop.table(x, margin = NULL)
ftable(x)
xtabs(formula, data, ...)

table() takes one or more data vectors of same length
- Each input data can be named
- Use as.data.frame() to turn a table into a data frame
prop.table() and ftable() takes a table object
xtabs use formula to generate a frequency table
- If a data frame is provided as the data input, its column names can be used directly in formula

my_data <- ispcrime %>%
  left_join(regions) %>%
  select(
    region,
    viol = violentCrime,
    prop = propertyCrime
  ) %>%
  mutate(
    high_viol = ifelse(viol > mean(viol, na.rm = TRUE), 1, 0),
    high_prop = ifelse(prop > mean(prop, na.rm = TRUE), 1, 0)
  )

my_tbl <- table(
  region = my_data$region,
  hviol = my_data$high_viol
)

my_tbl

          hviol
region       0   1
  Central  206  24
  Cook       0   5
  Northern  60  25
  Southern 175   8

as.data.frame(my_tbl)

    region hviol Freq
1  Central     0  206
2     Cook     0    0
3 Northern     0   60
4 Southern     0  175
5  Central     1   24
6     Cook     1    5
7 Northern     1   25
8 Southern     1    8

prop.table(my_tbl, 1) # each row adds up to 1

          hviol
region              0          1
  Central  0.89565217 0.10434783
  Cook     0.00000000 1.00000000
  Northern 0.70588235 0.29411765
  Southern 0.95628415 0.04371585

prop.table(my_tbl, 2) # each column adds up to 1

          hviol
region              0          1
  Central  0.46712018 0.38709677
  Cook     0.00000000 0.08064516
  Northern 0.13605442 0.40322581
  Southern 0.39682540 0.12903226

my_tbl2 <- table(
  region = my_data$region,
  hviol = my_data$high_viol,
  hprop = my_data$high_prop
)

# with ftable
ftable(my_tbl2)

               hprop   0   1
region   hviol              
Central  0           197   9
         1             1  23
Cook     0             0   0
         1             0   5
Northern 0            55   5
         1             0  25
Southern 0           173   2
         1             0   8

# without ftable
my_tbl2

, , hprop = 0

          hviol
region       0   1
  Central  197   1
  Cook       0   0
  Northern  55   0
  Southern 173   0

, , hprop = 1

          hviol
region       0   1
  Central    9  23
  Cook       0   5
  Northern   5  25
  Southern   2   8

# one-dimension
xtabs(~ region, my_data)

region
 Central     Cook Northern Southern 
     230        5       85      190

# two-dimension
xtabs(~ region + high_viol, my_data)

          high_viol
region       0   1
  Central  206  24
  Cook       0   5
  Northern  60  25
  Southern 175   8

Basic Inferential Statistics

plot of chunk unnamed-chunk-18

Source: Vadlo.com

Student's t-test

t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
       mu = 0, conf.level = 0.95, ...)
t.test(formula, data, subset, na.action, ...)

The null hypothesis assumes normal distribution
The tested null hypothesis is no difference in mean and the alternative hypothesis can one of the three options, with the default of “two.sided”
One sample t-test is conducted with only x input is given
- The mean of x is compared against the mu value (default 0)
Two sample t-test can be conducted by either supplying x and y inputs or using formula (y ~ x) with data input

ispcrime2 <- left_join(ispcrime, regions)
viol_crime_north <- (filter(ispcrime2, region == "Northern"))$violentCrime
viol_crime_south <- (filter(ispcrime2, region == "Southern"))$violentCrime
t.test(viol_crime_north, viol_crime_south)


    Welch Two Sample t-test

data:  viol_crime_north and viol_crime_south
t = 4.4064, df = 105.49, p-value = 2.534e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 176.7132 465.8398
sample estimates:
mean of x mean of y 
 438.2000  116.9235

Wilcoxon tests

wilcox.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
            mu = 0, conf.level = 0.95, ...)
wilcox.test(formula, data, subset, na.action, ...)

The null hypothesis does NOT assume normality
One sample Wilcoxon test (rank-sum test) is conducted with only x input is provided
Two sample Wilcoxon test (singed-rank test) can be conducted by either supplying x and y inputs or using formula (y ~ x) with data input

Analysis of Variance

aov(formula, data, qr = TRUE, ...)

formula is of the following format: value ~ subgroup
summary() has a method for aov class that returns a refined print result
model.tables() has a method for aov class for reporting table of means or table of effects

aov_by_region <- aov(formula = violentCrime ~ region, data = ispcrime2)
print(aov_by_region)

Call:
   aov(formula = violentCrime ~ region, data = ispcrime2)

Terms:
                    region  Residuals
Sum of Squares  4657351567   93080910
Deg. of Freedom          3        499

Residual standard error: 431.8969
Estimated effects may be unbalanced
7 observations deleted due to missingness

summary(aov_by_region)

             Df    Sum Sq   Mean Sq F value Pr(>F)    
region        3 4.657e+09 1.552e+09    8323 <2e-16 ***
Residuals   499 9.308e+07 1.865e+05                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7 observations deleted due to missingness

model.tables(aov_by_region, type = "means")

Tables of means
Grand mean

500.9702 

 region 
    Central  Cook Northern Southern
        170 30848    438.2    116.9
rep     230     5     85.0    183.0

model.tables(aov_by_region, type = "effects")

Tables of effects

 region 
    Central  Cook Northern Southern
       -331 30347   -62.77     -384
rep     230     5    85.00      183

Other statistical tests

# Test of equal proprtions
prop.test(x, n, p, ...)

# Chi-square test of indepndence/goodness-of-fit
chisq.test(x, y = NULL, ...)

# Shapiro-Wilk test of normality
shapiro.test(x)

# One- or two-sample Kolmogorov-Smirnov test
ks.test(x, y, ..., alternative = "two.sided", exact = NULL)

# F-test to compare variances
var.test(x, y, ratio = 1, alternative = "two.sided",
         conf.level = 0.95, ...)
var.test(formula, data, subset, na.action, ...)

# Test of correlation between paired samples
cor.test(x, y, alternative = "two.sided", conf.level = 0.95,
         method = c("pearson", "kendell", "spearman"), ...)

Linear models

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
\[ \boldsymbol{\text{y}} = \boldsymbol{\text{X}}^{\text{T}}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

The lm() function

lm(formula, data, subset, na.action, ...)

lm() is used to fit linear models, according to the formula input.
data is an optional argument, which can take a data frame
- If data input is provided, its column names can be used directly in the formula

R formula

formula <- y ~ x1 + x2

The response variable is placed on the left-hand side of the tilde symbol (~), and the explanatory variables on the right-hand side
- Adding multiple explanatory variables can be done using + symbol
Simply use y ~ . to include all other columns additively as explanatory variables

Model summary

lmfit <- lm(formula, data)

print(lmfit) # not recommended
summary(lmfit)

lm() output is an object of lm class
Simply printing an lm object gives only minimal information on the fitted model
summary() function with a model input (e.g. lm) prints a more comprehensive model summary

my_lmfit <- lm(violentCrime ~ propertyCrime, ispcrime)
class(my_lmfit)

[1] "lm"

my_lmfit


Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)

Coefficients:
  (Intercept)  propertyCrime  
     -79.7683         0.1994

summary(my_lmfit)


Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-2239.5    -2.2    57.0    78.3  3992.9 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -79.768287  16.496961  -4.835 1.77e-06 ***
propertyCrime   0.199367   0.001059 188.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 363.5 on 501 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.9861,    Adjusted R-squared:  0.986 
F-statistic: 3.546e+04 on 1 and 501 DF,  p-value: < 2.2e-16

Model diagnostics

plot(lmfit, which = c(1:3, 5), ...)

Base R plot() function has a method for lm objects
Total 6 plots are generated
- “Residuals vs Fitted”
- “Normal Q-Q”
- “Scale-Location”
- “Cook's distance”
- “Residuals vs Leverage”
- “Cook's dist vs Leverage \( h_{ii}/(1-h_ii) \)”

par(mfrow = c(2,3))
plot(my_lmfit, which = c(1:6))

plot of chunk unnamed-chunk-32

Model evaluation

logLik(object, ...)
AIC(object, ..., k = 2)
BIC(object, ...) # equivalent to AIC with k = log(n)

There are measures of model fit other than \( R^2 \).
For a model fitted by maximum likelihood algorithm:
- logLik() for log-liklehood value
- AIC() for Akaike Information Criterion (with default k)
- BIC() for Bayesian Information Criterion
Read this Wikipedia article for more on maximum likelihood estimation

Extending lm() formula

Transformations
Interaction terms
Polynomials
Polynomial splines

Transformations

scale(y) ~ log(x1) + sqrt(x2) + ...

We can transform formula terms on either side of ~ by applying a function to each term
Common transformations include:
- log(): natural log (cf. log2 and log10)
- exp(): exponentiate
- sqrt(): square root
- scale(): rescale data such that the mean is 0 and the standard deviation is 1

Transforming x
plot of chunk unnamed-chunk-35

Transforming y
plot of chunk unnamed-chunk-36

Interactions

y ~ x1 + x2 + x1:x2
y ~ x1 * x2

x1:x2 takes the interactions of all terms in x1 and x2
x1*x2 is equivalent to x1:x2 and the original terms x1 and x2
It is recommended to use an interaction term with all the original terms

Polynomials

y ~ ploy(x, degree = n)
y ~ x + I(x^2) + I(x^3) + ...

poly() generates the n-th degree polynimals of the input x
ploy(x, 3) is equivalent to using x + I(x^2) + I(x^3)
- I() ensures the ^ with a symbol is used as an arithmetic operator

Polynomial splines

library(splines) # part of R "base pacakges"
y ~ ns(x, ...)

y ~ ns(x, ...)

y ~ bs(x, degree = n, ...)

y ~ bs(x, degree = n, ...)

ns() is used for the natural cubic splines
bs() is used for the n-th degree polynomial splines
- equivalent to bs(x, degree = 3)

broom package

broom is a tidyverse package for “tidying up” statistical model outputs in R, i.e., converting them into tidy data frames.
broom functions:
- glance()
- tidy()
- augment()
Visit broom GitHub repository and “broom as well as dplyr” vignette for more

broom functions

glance(x, ...)
tidy(x, ...)
augment(x, ...)

glance() takes a model object and returns a concise one-row summary of the model.
tidy() takes a model object and returns a data.frame with the coeffcient estimation results
augment() takes a model object and returns a data.frame for each observation with additional columns such as predictions, residuals and cluster assignments

glance(my_lmfit)

  r.squared adj.r.squared    sigma statistic p.value df    logLik      AIC
1 0.9860675     0.9860396 363.4653  35457.96       0  2 -3678.253 7362.506
       BIC deviance df.residual
1 7375.168 66185630         501

tidy(my_lmfit)

           term    estimate   std.error  statistic      p.value
1   (Intercept) -79.7682868 16.49696109  -4.835332 1.771126e-06
2 propertyCrime   0.1993675  0.00105876 188.302852 0.000000e+00

# check the class of each output
c(class(glance(my_lmfit)), class(glance(my_lmfit)))

[1] "data.frame" "data.frame"

head(augment(my_lmfit))

  .rownames violentCrime propertyCrime   .fitted  .se.fit     .resid
1         1          218          1555 230.24816 16.26976 -12.248156
2         2          119           290 -21.95172 16.44233 140.951715
3         3            6           211 -37.70175 16.45666  43.701747
4         4           59           733  66.36808 16.36964  -7.368081
5         5            7            38 -72.19232 16.48949  79.192322
6         6           42           505  20.91229 16.40542  21.087706
         .hat   .sigma      .cooksd  .std.resid
1 0.002003718 363.8282 1.142258e-06 -0.03373209
2 0.002046448 363.7739 1.545129e-04  0.38819703
3 0.002050017 363.8234 1.487927e-05  0.12035979
4 0.002028394 363.8285 4.184743e-07 -0.02029235
5 0.002058203 363.8113 4.905554e-05  0.21810593
6 0.002037270 363.8274 3.442885e-06  0.05807767

class(augment(my_lmfit))

[1] "data.frame"

modelr package

modelr is a tidyverse package that provides helper functions for statistical modeling in R
- partitioning and sampling
- model quality metrics
- interacting with models
Visit modelr GitHub repository and the “Model Basics” chapter in R for Data Science for more

Partitioning and sampling

resample(data, idx)
resample_partition(data, p)
bootstrap(data, n, id = ".id")

resample take a data frame and a vector of indices to create a pointer object, which can be either turned into a data frame or used directly in modeling functions (e.g. lm())
resample_partition takes a data and a named numeric vector to specify partitioning. Its output is a list of random partitions.
bootstrap takes a data frame and an integer to specify the number of bootstrap replicates to generate. The output is a data frame of bootstrap samples column and id column

set.seed(1) # for reproducibility

partitioned <- ispcrime %>%
  filter(year == "2015") %>%
  resample_partition(p = c(treatment = 0.5, control = 0.5))

partitioned

$treatment
<resample [50 x 12]> 1, 2, 5, 10, 11, 12, 14, 16, 19, 22, ...

$control
<resample [52 x 12]> 3, 4, 6, 7, 8, 9, 13, 15, 17, 18, ...

head(partitioned$treatment$data)

  year    county violentCrime murder rape robbery aggAssault propertyCrime
1 2015     Adams          227      5   39       9        174          1268
2 2015 Alexander          107      0    3       3        101           194
3 2015      Bond            5      1    0       0          4           130
4 2015     Boone            0      0    0       0          0             1
5 2015     Brown            9      0    1       0          8            11
6 2015    Bureau           53      1   17       4         31           360
  burglary larcenyTft MVTft arson
1      274        944    43     7
2       58        119     9     8
3       39         84     7     0
4        0          0     0     1
5        4          5     2     0
6      110        244     6     0

partitioned$treatment$idx

 [1]   1   2   5  10  11  12  14  16  19  22  24  25  26  27  28  30  32
[18]  34  36  38  39  40  42  44  45  48  50  53  54  55  56  57  58  61
[35]  64  69  71  75  77  82  84  86  87  91  92  95  97  98  99 100

Model quality metrics

rmse(model, data)
mae(model, data)
qae(model, data, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
rsquare(model, data)

rmse() returns the root mean squared error
mae() returns the mean abosolute error
qae() returns quantiles of absolute error
rsquare() returns the \( R^2 \) value, i.e. the variance of the predictions divided by the variance of the reponse

Interacting with models

add_predictions(data, model, var = "pred")
add_residuals(data, model, var = "resid")

add_predictions() adds a new column to a data frame for predicted values based on a model
add_residuals() adds a new column to a data frame for residuals based on a model

ispcrime_subset <- select(ispcrime, year:violentCrime, propertyCrime)

ispcrime_subset %>%
  add_predictions(my_lmfit) %>%
  add_residuals(my_lmfit) %>%
  head()

  year    county violentCrime propertyCrime      pred      resid
1 2011     Adams          218          1555 230.24816 -12.248156
2 2011 Alexander          119           290 -21.95172 140.951715
3 2011      Bond            6           211 -37.70175  43.701747
4 2011     Boone           59           733  66.36808  -7.368081
5 2011     Brown            7            38 -72.19232  79.192322
6 2011    Bureau           42           505  20.91229  21.087706

Generalized linear models

\[ \text{E}[\boldsymbol{\text{Y}}] = \mu = g^{-1}(\boldsymbol{\text{X}\beta}) \]
\[ \text{Var}[\boldsymbol{\text{Y}}] = \text{V}[\mu] = \text{V}[g^{-1}(\boldsymbol{\text{X}}\boldsymbol{\beta})] \]

GLM basics

Generalized linear models (GLMs) extend the linear model to fit response variables with error distribution other than the Gaussian (normal) distribution
Three components of GLM:
- A probability distribution from the exponential family
- A linear predictor, \( \eta = \boldsymbol{\text{X}\beta} \)
- A link function \( g \), such that \( \text{E}[\boldsymbol{\text{Y}}] = \mu = g^{-1}(\eta) \)
See this Wikipedia article for more on GLM and the model components

The glm() function

glm(formula, family = gaussian, data, ...)

formula is the same formula we have seen all along
family input specifies the link function \( g \)
- the default is gaussian(link = "identity") for the linear model
glm() function returns a glm class object
- Use summary() to inspect the model estimation results

family objects

family	Default link function	Use
`gaussian`	`(link = "identity")`	Linear model for continous outcome (ordinary linear regression; OLS)
`binomial`	`(link = "logit")`	Logit model for binary outcome (logistic regression)
`poisson`	`(link = "log"`)	Poisson model for count outcome
`Gamma`	`(link = "inverse")`	…
`inverse.gaussian`	`(link = "1/mu^2")`	…
`quasi`	`(link = "identitiy")`	…
`quasibinomial`	`(link = "logit")`	…
`quasipoisson`	`(link = "log")`	…

Logit model

glm(formula, family = binomial, data, ...)

A logit model, i.e. logistic regression, can be fitted using glm() with family = binomial
The response variable must range between 0 and 1 (an error will be thrown otherwise)

Poisson model

glm(formula, family = poisson, data, ...)

A possion model can be fitted using glm() with family = poisson
The response variable must not include negative values (an error will be thrown otherwise)

Better ways?

There are thrid-party packages to provide additional fuctions and/or different interfaces to existing statistical analysis functions. Although I have little exposure to them, the following two might be worth checking:

psych package
- See this package vignette by the package author
jmv package
- See this package documentation page

Resources

Kabacoff, R. “Statistics”. Quick-R.
Lane D. et al. Online Statistics: A Multimedia Course of Study.
Prabhakaran, S. r-statistics.co.
UCLA Institute for Digital Research and Education. “R”.
University of Cincinnati. “Descriptive analytics”. UC Business Analytic R Programming Guide.
University of Cincinnati. “Predictive analytics”. UC Business Analytic R Programming Guide.
Wollschlaeger, D. RExRepos.
Yau, C. “Elementary Statistics with R”. R Tutorial.

Questions?

plot of chunk unnamed-chunk-53

Source: Giphy

References

Lane D. et al. (n.d.). Online Statistics: A Multimedia Course of Study.
Prabhakaran, S. (n.d.). “Statistic Tests”. r-statistics.co.
tidyverse/broom Github repository
tidyverse/modelr Github repository
University of Cincinnati. UC Business Analytics R Programming Guide
Wikipedia contributors. (2018). “Generalized linear model”. Wikipedia, The Free Encyclopedia.

Source: Wikimedia.org