# template elements
2018-04-18
Bobae Kang
(Bobae.Kang@illinois.gov)
Source: Vadlo.com
summary(data)
summary()
function is a quick way to get descriptive statistics on each columan of a tabular data object.
NA
).factor
), we can the count of each level as well as the missing value (NA
).summary()
is also used to get the “summary” of a fitted model object (e.g. lm
) as well, as we will see shortly.summary(ispcrime)
year county violentCrime murder
Min. :2011 Adams : 5 Min. : 0 Min. : 0.000
1st Qu.:2012 Alexander: 5 1st Qu.: 19 1st Qu.: 0.000
Median :2013 Bond : 5 Median : 42 Median : 0.000
Mean :2013 Boone : 5 Mean : 501 Mean : 7.026
3rd Qu.:2014 Brown : 5 3rd Qu.: 133 3rd Qu.: 1.000
Max. :2015 Bureau : 5 Max. :33348 Max. :566.000
(Other) :480 NA's :7 NA's :7
rape robbery aggAssault propertyCrime
Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
1st Qu.: 1.00 1st Qu.: 0.0 1st Qu.: 15.0 1st Qu.: 133
Median : 6.00 Median : 2.0 Median : 33.0 Median : 349
Mean : 41.29 Mean : 172.3 Mean : 280.4 Mean : 2913
3rd Qu.: 22.00 3rd Qu.: 13.0 3rd Qu.: 102.0 3rd Qu.: 1190
Max. :1986.00 Max. :16095.0 Max. :15129.0 Max. :178902
NA's :7 NA's :7 NA's :7 NA's :7
burglary larcenyTft MVTft arson
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
1st Qu.: 35.5 1st Qu.: 85.5 1st Qu.: 3.0 1st Qu.: 1.00
Median : 79.0 Median : 258.0 Median : 10.0 Median : 2.00
Mean : 589.3 Mean : 2084.9 Mean : 215.2 Mean : 23.45
3rd Qu.: 268.0 3rd Qu.: 852.0 3rd Qu.: 30.0 3rd Qu.: 8.50
Max. :38485.0 Max. :116145.0 Max. :22879.0 Max. :1418.00
NA's :7 NA's :7 NA's :7 NA's :7
mean()
and median()
mean(ispcrime$violentCrime)
[1] NA
median(ispcrime$violentCrime)
[1] NA
… wait, what?
mean(ispcrime$violentCrime, na.rm = TRUE)
[1] 500.9702
median(ispcrime$violentCrime, na.rm = TRUE)
[1] 42
NA
is known to be “contagious” in R, making any operation with missing values return NA
na.rm
is an argument with a boolean input to control whether NA
values are removed/excluded in calculationmin()
, max()
, range()
fivenum()
, IQR()
, quantile()
var()
, sd()
Range
min(ispcrime$violentCrime, na.rm = TRUE)
[1] 0
max(ispcrime$violentCrime, na.rm = TRUE)
[1] 33348
range(ispcrime$violentCrime, na.rm = TRUE)
[1] 0 33348
Percentiles
fivenum(ispcrime$violentCrime, na.rm = TRUE)
[1] 0 19 42 133 33348
IQR(ispcrime$violentCrime, na.rm = TRUE)
[1] 114
fivenum()
returns what is called Tukey's five number summary (minumum, 1st quartile, median, 3rd qurtile, maximum) for the input dataIQR()
returns the inter-quartile range, which is the difference between the 1st quartile (25%) and 3rd quartile (75%)# default: equal to fivenum()
quantile(ispcrime$violentCrime, probs = seq(0, 1, 0.25), na.rm = TRUE)
0% 25% 50% 75% 100%
0 19 42 133 33348
quantile(ispcrime$violentCrime, probs = seq(0, 1, 0.1), na.rm = TRUE)
0% 10% 20% 30% 40% 50% 60% 70% 80%
0.0 5.0 13.0 24.0 32.0 42.0 57.0 90.0 246.2
90% 100%
668.4 33348.0
quantile()
is equal to fivenum()
outputquantile()
can use probs
arugment to get more flexible outputVariance
var(ispcrime$violentCrime, na.rm = TRUE)
[1] 9463013
sd(ispcrime$violentCrime, na.rm = TRUE)
[1] 3076.201
var()
returns the sample variance of the given data:
sd()
returns the standard deviation of the given data:
variables <- ispcrime %>% select(violentCrime, propertyCrime)
var(variables, na.rm = TRUE)
violentCrime propertyCrime
violentCrime 9463013 46803865
propertyCrime 46803865 234761775
var()
returns a covariance matrix
cov()
can be used instead; however, cov()
has no na.rm
argument to remove missing valuesmoments
package offers the following functions to meaasure the shape of a distribution:
skewness()
kurtosis()
moments
package documentationoutliers
package offers useful functions to detect and measure outliers in data.
outliers
package documentationtable()
for generating frequency tablesprop.table()
for tables of proportionsxtabs()
for creating frequency tables using formulaftable()
for creating “flat” contingency tablestable(...)
prop.table(x, margin = NULL)
ftable(x)
xtabs(formula, data, ...)
table()
takes one or more data vectors of same length
as.data.frame()
to turn a table
into a data frame prop.table()
and ftable()
takes a table
objectxtabs
use formula to generate a frequency table
data
input, its column names can be used directly in formulamy_data <- ispcrime %>%
left_join(regions) %>%
select(
region,
viol = violentCrime,
prop = propertyCrime
) %>%
mutate(
high_viol = ifelse(viol > mean(viol, na.rm = TRUE), 1, 0),
high_prop = ifelse(prop > mean(prop, na.rm = TRUE), 1, 0)
)
my_tbl <- table(
region = my_data$region,
hviol = my_data$high_viol
)
my_tbl
hviol
region 0 1
Central 206 24
Cook 0 5
Northern 60 25
Southern 175 8
as.data.frame(my_tbl)
region hviol Freq
1 Central 0 206
2 Cook 0 0
3 Northern 0 60
4 Southern 0 175
5 Central 1 24
6 Cook 1 5
7 Northern 1 25
8 Southern 1 8
prop.table(my_tbl, 1) # each row adds up to 1
hviol
region 0 1
Central 0.89565217 0.10434783
Cook 0.00000000 1.00000000
Northern 0.70588235 0.29411765
Southern 0.95628415 0.04371585
prop.table(my_tbl, 2) # each column adds up to 1
hviol
region 0 1
Central 0.46712018 0.38709677
Cook 0.00000000 0.08064516
Northern 0.13605442 0.40322581
Southern 0.39682540 0.12903226
my_tbl2 <- table(
region = my_data$region,
hviol = my_data$high_viol,
hprop = my_data$high_prop
)
# with ftable
ftable(my_tbl2)
hprop 0 1
region hviol
Central 0 197 9
1 1 23
Cook 0 0 0
1 0 5
Northern 0 55 5
1 0 25
Southern 0 173 2
1 0 8
# without ftable
my_tbl2
, , hprop = 0
hviol
region 0 1
Central 197 1
Cook 0 0
Northern 55 0
Southern 173 0
, , hprop = 1
hviol
region 0 1
Central 9 23
Cook 0 5
Northern 5 25
Southern 2 8
# one-dimension
xtabs(~ region, my_data)
region
Central Cook Northern Southern
230 5 85 190
# two-dimension
xtabs(~ region + high_viol, my_data)
high_viol
region 0 1
Central 206 24
Cook 0 5
Northern 60 25
Southern 175 8
Source: Vadlo.com
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
mu = 0, conf.level = 0.95, ...)
t.test(formula, data, subset, na.action, ...)
x
input is given
x
is compared against the mu
value (default 0)x
and y
inputs or using formula
(y ~ x
) with data
inputispcrime2 <- left_join(ispcrime, regions)
viol_crime_north <- (filter(ispcrime2, region == "Northern"))$violentCrime
viol_crime_south <- (filter(ispcrime2, region == "Southern"))$violentCrime
t.test(viol_crime_north, viol_crime_south)
Welch Two Sample t-test
data: viol_crime_north and viol_crime_south
t = 4.4064, df = 105.49, p-value = 2.534e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
176.7132 465.8398
sample estimates:
mean of x mean of y
438.2000 116.9235
wilcox.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
mu = 0, conf.level = 0.95, ...)
wilcox.test(formula, data, subset, na.action, ...)
x
input is providedx
and y
inputs or using formula
(y ~ x
) with data
inputaov(formula, data, qr = TRUE, ...)
formula
is of the following format: value ~ subgroup
summary()
has a method for aov
class that returns a refined print resultmodel.tables()
has a method for aov
class for reporting table of means or table of effectsaov_by_region <- aov(formula = violentCrime ~ region, data = ispcrime2)
print(aov_by_region)
Call:
aov(formula = violentCrime ~ region, data = ispcrime2)
Terms:
region Residuals
Sum of Squares 4657351567 93080910
Deg. of Freedom 3 499
Residual standard error: 431.8969
Estimated effects may be unbalanced
7 observations deleted due to missingness
summary(aov_by_region)
Df Sum Sq Mean Sq F value Pr(>F)
region 3 4.657e+09 1.552e+09 8323 <2e-16 ***
Residuals 499 9.308e+07 1.865e+05
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7 observations deleted due to missingness
model.tables(aov_by_region, type = "means")
Tables of means
Grand mean
500.9702
region
Central Cook Northern Southern
170 30848 438.2 116.9
rep 230 5 85.0 183.0
model.tables(aov_by_region, type = "effects")
Tables of effects
region
Central Cook Northern Southern
-331 30347 -62.77 -384
rep 230 5 85.00 183
# Test of equal proprtions
prop.test(x, n, p, ...)
# Chi-square test of indepndence/goodness-of-fit
chisq.test(x, y = NULL, ...)
# Shapiro-Wilk test of normality
shapiro.test(x)
# One- or two-sample Kolmogorov-Smirnov test
ks.test(x, y, ..., alternative = "two.sided", exact = NULL)
# F-test to compare variances
var.test(x, y, ratio = 1, alternative = "two.sided",
conf.level = 0.95, ...)
var.test(formula, data, subset, na.action, ...)
# Test of correlation between paired samples
cor.test(x, y, alternative = "two.sided", conf.level = 0.95,
method = c("pearson", "kendell", "spearman"), ...)
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
\[ \boldsymbol{\text{y}} = \boldsymbol{\text{X}}^{\text{T}}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]
lm(formula, data, subset, na.action, ...)
lm()
is used to fit linear models, according to the formula
input.data
is an optional argument, which can take a data frame
data
input is provided, its column names can be used directly in the formulaformula <- y ~ x1 + x2
~
), and the explanatory variables on the right-hand side
+
symboly ~ .
to include all other columns additively as explanatory variableslmfit <- lm(formula, data)
print(lmfit) # not recommended
summary(lmfit)
lm()
output is an object of lm
classlm
object gives only minimal information on the fitted modelsummary()
function with a model input (e.g. lm
) prints a more comprehensive model summarymy_lmfit <- lm(violentCrime ~ propertyCrime, ispcrime)
class(my_lmfit)
[1] "lm"
my_lmfit
Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)
Coefficients:
(Intercept) propertyCrime
-79.7683 0.1994
summary(my_lmfit)
Call:
lm(formula = violentCrime ~ propertyCrime, data = ispcrime)
Residuals:
Min 1Q Median 3Q Max
-2239.5 -2.2 57.0 78.3 3992.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -79.768287 16.496961 -4.835 1.77e-06 ***
propertyCrime 0.199367 0.001059 188.303 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 363.5 on 501 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.9861, Adjusted R-squared: 0.986
F-statistic: 3.546e+04 on 1 and 501 DF, p-value: < 2.2e-16
plot(lmfit, which = c(1:3, 5), ...)
plot()
function has a method for lm
objectspar(mfrow = c(2,3))
plot(my_lmfit, which = c(1:6))
logLik(object, ...)
AIC(object, ..., k = 2)
BIC(object, ...) # equivalent to AIC with k = log(n)
logLik()
for log-liklehood valueAIC()
for Akaike Information Criterion (with default k
)BIC()
for Bayesian Information Criterionscale(y) ~ log(x1) + sqrt(x2) + ...
~
by applying a function to each termlog()
: natural log (cf. log2
and log10
)exp()
: exponentiatesqrt()
: square rootscale()
: rescale data such that the mean is 0 and the standard deviation is 1
Transforming x
Transforming y
y ~ x1 + x2 + x1:x2
y ~ x1 * x2
x1:x2
takes the interactions of all terms in x1
and x2
x1*x2
is equivalent to x1:x2
and the original terms x1
and x2
y ~ ploy(x, degree = n)
y ~ x + I(x^2) + I(x^3) + ...
poly()
generates the n-th degree polynimals of the input x
ploy(x, 3)
is equivalent to using x + I(x^2) + I(x^3)
I()
ensures the ^
with a symbol is used as an arithmetic operatorlibrary(splines) # part of R "base pacakges"
y ~ ns(x, ...)
y ~ ns(x, ...)
y ~ bs(x, degree = n, ...)
y ~ bs(x, degree = n, ...)
ns()
is used for the natural cubic splinesbs()
is used for the n-th degree polynomial splines
bs(x, degree = 3)
broom
is a tidyverse
package for “tidying up” statistical model outputs in R, i.e., converting them into tidy data frames.broom
functions:
glance()
tidy()
augment()
broom
GitHub repository and “broom
as well as dplyr
” vignette for moreglance(x, ...)
tidy(x, ...)
augment(x, ...)
glance()
takes a model object and returns a concise one-row summary of the model.tidy()
takes a model object and returns a data.frame
with the coeffcient estimation resultsaugment()
takes a model object and returns a data.frame
for each observation with additional columns such as predictions, residuals and cluster assignmentsglance(my_lmfit)
r.squared adj.r.squared sigma statistic p.value df logLik AIC
1 0.9860675 0.9860396 363.4653 35457.96 0 2 -3678.253 7362.506
BIC deviance df.residual
1 7375.168 66185630 501
tidy(my_lmfit)
term estimate std.error statistic p.value
1 (Intercept) -79.7682868 16.49696109 -4.835332 1.771126e-06
2 propertyCrime 0.1993675 0.00105876 188.302852 0.000000e+00
# check the class of each output
c(class(glance(my_lmfit)), class(glance(my_lmfit)))
[1] "data.frame" "data.frame"
head(augment(my_lmfit))
.rownames violentCrime propertyCrime .fitted .se.fit .resid
1 1 218 1555 230.24816 16.26976 -12.248156
2 2 119 290 -21.95172 16.44233 140.951715
3 3 6 211 -37.70175 16.45666 43.701747
4 4 59 733 66.36808 16.36964 -7.368081
5 5 7 38 -72.19232 16.48949 79.192322
6 6 42 505 20.91229 16.40542 21.087706
.hat .sigma .cooksd .std.resid
1 0.002003718 363.8282 1.142258e-06 -0.03373209
2 0.002046448 363.7739 1.545129e-04 0.38819703
3 0.002050017 363.8234 1.487927e-05 0.12035979
4 0.002028394 363.8285 4.184743e-07 -0.02029235
5 0.002058203 363.8113 4.905554e-05 0.21810593
6 0.002037270 363.8274 3.442885e-06 0.05807767
class(augment(my_lmfit))
[1] "data.frame"
modelr
is a tidyverse
package that provides helper functions for statistical modeling in R
modelr
GitHub repository and the “Model Basics” chapter in R for Data Science for moreresample(data, idx)
resample_partition(data, p)
bootstrap(data, n, id = ".id")
resample
take a data frame and a vector of indices to create a pointer object, which can be either turned into a data frame or used directly in modeling functions (e.g. lm()
)resample_partition
takes a data and a named numeric vector to specify partitioning. Its output is a list of random partitions.bootstrap
takes a data frame and an integer to specify the number of bootstrap replicates to generate. The output is a data frame of bootstrap samples column and id columnset.seed(1) # for reproducibility
partitioned <- ispcrime %>%
filter(year == "2015") %>%
resample_partition(p = c(treatment = 0.5, control = 0.5))
partitioned
$treatment
<resample [50 x 12]> 1, 2, 5, 10, 11, 12, 14, 16, 19, 22, ...
$control
<resample [52 x 12]> 3, 4, 6, 7, 8, 9, 13, 15, 17, 18, ...
head(partitioned$treatment$data)
year county violentCrime murder rape robbery aggAssault propertyCrime
1 2015 Adams 227 5 39 9 174 1268
2 2015 Alexander 107 0 3 3 101 194
3 2015 Bond 5 1 0 0 4 130
4 2015 Boone 0 0 0 0 0 1
5 2015 Brown 9 0 1 0 8 11
6 2015 Bureau 53 1 17 4 31 360
burglary larcenyTft MVTft arson
1 274 944 43 7
2 58 119 9 8
3 39 84 7 0
4 0 0 0 1
5 4 5 2 0
6 110 244 6 0
partitioned$treatment$idx
[1] 1 2 5 10 11 12 14 16 19 22 24 25 26 27 28 30 32
[18] 34 36 38 39 40 42 44 45 48 50 53 54 55 56 57 58 61
[35] 64 69 71 75 77 82 84 86 87 91 92 95 97 98 99 100
rmse(model, data)
mae(model, data)
qae(model, data, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
rsquare(model, data)
rmse()
returns the root mean squared errormae()
returns the mean abosolute errorqae()
returns quantiles of absolute errorrsquare()
returns the \( R^2 \) value, i.e. the variance of the predictions divided by the variance of the reponseadd_predictions(data, model, var = "pred")
add_residuals(data, model, var = "resid")
add_predictions()
adds a new column to a data frame for predicted values based on a modeladd_residuals()
adds a new column to a data frame for residuals based on a modelispcrime_subset <- select(ispcrime, year:violentCrime, propertyCrime)
ispcrime_subset %>%
add_predictions(my_lmfit) %>%
add_residuals(my_lmfit) %>%
head()
year county violentCrime propertyCrime pred resid
1 2011 Adams 218 1555 230.24816 -12.248156
2 2011 Alexander 119 290 -21.95172 140.951715
3 2011 Bond 6 211 -37.70175 43.701747
4 2011 Boone 59 733 66.36808 -7.368081
5 2011 Brown 7 38 -72.19232 79.192322
6 2011 Bureau 42 505 20.91229 21.087706
\[ \text{E}[\boldsymbol{\text{Y}}] = \mu = g^{-1}(\boldsymbol{\text{X}\beta}) \]
\[ \text{Var}[\boldsymbol{\text{Y}}] = \text{V}[\mu] = \text{V}[g^{-1}(\boldsymbol{\text{X}}\boldsymbol{\beta})] \]
glm(formula, family = gaussian, data, ...)
formula
is the same formula we have seen all alongfamily
input specifies the link function \( g \)
gaussian(link = "identity")
for the linear modelglm()
function returns a glm
class object
summary()
to inspect the model estimation resultsfamily
objects
family | Default link function | Use |
---|---|---|
gaussian |
(link = "identity") |
Linear model for continous outcome (ordinary linear regression; OLS) |
binomial |
(link = "logit") |
Logit model for binary outcome (logistic regression) |
poisson |
(link = "log" ) |
Poisson model for count outcome |
Gamma |
(link = "inverse") |
… |
inverse.gaussian |
(link = "1/mu^2") |
… |
quasi |
(link = "identitiy") |
… |
quasibinomial |
(link = "logit") |
… |
quasipoisson |
(link = "log") |
… |
glm(formula, family = binomial, data, ...)
glm()
with family = binomial
glm(formula, family = poisson, data, ...)
glm()
with family = poisson
There are thrid-party packages to provide additional fuctions and/or different interfaces to existing statistical analysis functions. Although I have little exposure to them, the following two might be worth checking:
psych
package
jmv
package
Source: Giphy
References
Source: Wikimedia.org