This page contains the notes for the second part of R Workshop Module 5: Statistical modeling with R, which is part of the R Workshop series prepared by ICJIA Research Analyst Bobae Kang to enable and encourage ICJIA researchers to take advantage of R, a statistical programming language that is one of the most powerful modern research tools.
Click here to go to the workshop home page.
Click here to go to the workshop Modules page.
Click here to view the accompanying slides for Module 5, Part 2.
Navigate to the other workshop materials:
In the previous Part of Module 5, we have explored how to conduct basic statistical analysis with R. Now we would like move toward the realm of advanced statistical modeling. In order to facilitate such progress in knowledge, this Part offers a crude introduction to some advanced modeling topics (survival analysis, time series analyis, spatial regression analysis, and machine learning) as well as a list of relevant R packages.
Please note that the current Part is barely scratching the surface of any of the introduced topics. Think of this Part as taking the very first step, and refer to the suggested resource materials to learn more about each topic.
Source: Wikimedia Commons
“Survival analysis is used to analyze data in which the time until the event is of interest. The response variable is the time until that event and is often called a failure time, survival time, or event time.”
- Harrell Jr. (2015).
In survival analysis, the response variable is a non-negative discrete/continuous random variable, which represents the time to event for each subject/observation.
We start this section by introducing some basic concepts in survival anaylsis.
Failure, or event, refers to an event of interest where the time, \(T\), from the beginning of observation to the occurence of an event is measured and available for modeling.
Examples of event include:
Observations are called censored when the information about their survival time is incomplete.
The most common form of censoring is right-censoring, where the final endpoint is only known to exceed a particular value. Right censored data most likely occurs due to the end of the study/observation period.
The survival function \(S(t)\) is the probability that the time of event/failure is later than some specified time \(t\). Using mathematical notation:
\[S(t) = \text{Pr}(T > t),\quad 0 < t < \infty.\]
The hazard function \(\lambda(t)\) is the event rate at time \(t\), conditional on survival (i.e. no event) until time \(t\) or later:
\[\lambda(t) = \lim_{dt\to0} \frac{\text{Pr}(t \leq T < t + dt)}{S(t)dt} = \frac{S'(t)}{S(t)}\]
R ecosystem offers a number of packages to offer tools and functionalities pertaining to survival analysis.
Among many, the survival
package is the most essential, which makes available the Surv
class that is fundamental to a survival analysis tools in R ecosystem. For more details, you may visit its GitHub repository and read the package vignettes available on the package CRAN page
survminer
package helps visualization of survival models. Take a look at the package website or the GiHub repository for more information.
In the following, we will introduce two basic models for survival analysis:
The Kaplan-Meier (KM) estimator is perhaps the most widely used nonparametric estimator of the survival function. Mathematically, the KM estiamor \(\hat{S}(t)\) is the product over the failure times of the conditional probabilities of surviving to the next failure time, expressed as follows:
\[\hat{S}(t) = \prod_{t_i \leq t} \Big( 1 - \frac{d_i}{n_i} \Big),\]
where \(d_i\) is the number of subjects who fail at time \(t\) and \(n_i\) is the number of subjects at risk at time \(t\).
Surv
function and Surv
classsurvival::Surv(time, time2, event, type, ...)
survival
package has Surv()
function to create objects of Surv
class, which is used in many survival analysis tools in R.
In Surv()
, time
is the follow up time for the right censored data. For interval censored data, time
is the starting time and time2
is the ending time.
event
is the status indicator, where 0 = alive (i.e., event not occured), 1 = dead (i.e., event occured).
type
is a character string of the censoring type. Available inputs include “right” for right censoring, “left” for left censoring, and “interval” for interval data.
Surv()
returns a Surv
class object, used to fit survival models.
survfit
functionsurvival::survfit(formula/model, data, ...)
survfit()
function returns a survival curve. The function uses KM estimator if with formula
. If model
is given, the survival curve is based on the input model (e.g. Cox model).
When formula
is used for the KM estimator, it must have the Surv
object as the response variable and might look like: Surv(time, status) ~ x
.
data
is optional, and if provided, the columns of the input data frame can be used in the formula.
Cox proportional hazrads model, or simply Cox (regression) model, models the effect of a set of explanatory variables on the hazard or risk of failure over time.
The model assumes that the unique effect of a change in a covariate is multiplicative with respect to the hazard rate. The model is given as follows:
\[\lambda(t|\boldsymbol{\text{x}}_i) = \lambda_0(t)\psi_i = \lambda_0(t)\text{exp}(\boldsymbol{\text{x}}_i^{\text{T}}\beta),\] where \(\boldsymbol{\text{x}}_i\) denotes the covariate values for subject \(i\). The estimated coefficient is interpreted in releative terms.
coxph
functionsurvival::coxph(formula, data, ...)
survival
package offers coxph
function to fit Cox models. As in the case of survfit()
, formula
for coxph()
must have the Surv
object as the response variable. Again, data
is optional and if provided, the columns of the input data frame can be used in the formula.
plot(survfit)
survminer::ggsurvplot(survfit, data, ...)
There are two ways to plot survival curves.
First, we can simply use the bas R plotting function, plot()
, since the survival
package provides a plot
method for survfit
objects.
Alternatively, we can used ggsurvplot()
from the survminer
package to plot survival curves. ggsurvplot()
has ggplot2
-like API and makes ggplot2
themes available. The output of ggsurvplot()
might look like this:
Obviously, what we have covered above is only the first step toward survival analysis in R. Please refer to the following resources. Also, for specific functions, remember to take advantage of the reference manuals using ?
or help()
.
Source: Wikimedia Commons
A time series is a set of observations measured sequentially through time. Examples of time series include annual crime count for multiple years and fluctuations in stock prices.
Time series analysis then involves modeling such time series data, often for making forecasts, i.e., predicting what comes next based on the information from the historical records.
Before we take a look at specific time series analysis tools and models, let’s take a brief look at some basic concepts
Decomposition refers to the work of separating a time series into trend, seasonal effects, and remaining variability.
The following plot provides a visual illustration of decomposing a time series into its components:
The notion of seasonality, also called seasonal variation, pertains to similar and recurrent patterns of behavior in data at particular times of the given period. Seaonality is generally annual in period, but could be monthly, weekly, daily, or any other kind of periods.
Stationarity is a common assumption underlying any time series analysis techniques. A stationary process is a stochastic process of which the mean, covariance and autocorrelation structure do not change over time.
The following figure illustrates the difference between a stationary time series and a non-stationary one:
Source: “Stationary process”, Wikipedia
Differencing refers to a time sereis analysis technique to make the series stationary, de-trend, and control the auto-correlations.
stats
package, which is part of R “base packages”, provide a wide range of useful tools for time series analysis.
There is, of course, no shortage of third-party packages to offer additional functionalities relevant to time series analysis. Among many, forecast
package seems to be a popular choice. Visit the package website] to find out more. There also exists a full-fledged online textbook using forecast
package: Hyndman, R. & Athanasopoulos, G. (2018). Forecasting: Principles and Practice.
Let’s take a look at some basic time series analysis techniques and how to implement them in R. Specifically, we will take a look at the following:
A time series \(y_t\) can be decoposed into three components:
There are two models for decomposing a time series. One is additive and the other is multiplicative.
decompose
and stl
functionsdecompose(x, type = c("additive", "multiplicative"), ...)
stl(x, s.window, ...)
stats
offers two functions for time series decomposition using different methods for time series decomposition.
decompose()
uses moving averages to decompose a time series. Here, type
input specifices whether seasonal components are additive or multiplicative.
stl()
uses LOESS (local regression), which is a more modern technique for time series decomposition. For that reason, stl
is often recommended for time series decomposition. s.window
input controls the span of the LOESS window for seasonal extraction (must be odd and at least 7). The decomposion is additive by default, but taking natural log of the data will account for multiplicative decomposition.
Autocorrelation function (ACF) and Partial autocorrelation function (PACF) plots of a differenced time series offer a heuristic way to choose specifications for ARIMA models, which we will explore shortly.
It is beyond the scope of the current Part to offer a full description of how each function is computed and how to use them as a guide to selecting a proper time series model. However, you can read this “Guide” for how to use ACF and PACF plots for selecting models to fit.
Roughly:
acf
and pacf
functionsacf(x, lag.max = NULL, plot = TRUE,
type = c("correlation", "covariance", "partial"), ...)
pacf(x, lag.max = NULL, plot = TRUE, ...)
acf()
and pacf()
from stats
package can generate ACF and PACF plots from the input x
, a univariate time series.
In acf()
, type = "correlation"
is the default for an ACF plot. Also, pacf()
is equivalent to acf()
with type = "partial"
.
Here we will take a brief look at the following basic time series models:
\(\{X_t\}\) is an autoregressive process of order \(p\), \(\text{AR}(p)\):
\[X_t = Z_t + \sum_{i=1}^p \phi_i X_{t-i},\] \[\text{rewritten as }\phi(B)X_t = Z_t\] Here, \(Z_t\) is a white noise (stationary) with mean 0 and the constant variance, and \(\phi_i\) is the autoregressive parameter for the \(i^{th}\) order. Finally, \(\phi(B)X_t\) can be unpacked: \(\phi(B)X_t = 1 - \phi_1B - ... - \phi_pB^p\).
\(\{X_t\}\) is an moving average process of order \(q\), \(\text{MA}(q)\):
\[X_t = Z_t + \sum_{i=1}^q \theta_i X_{t-i},\] \[\text{rewritten as }X_t =\theta(B)Z_t\] Again, \(Z_t\) is a white noise. \(\theta(B)\) can be unpacked as: \(\theta(B) = 1 + \theta_1B + ... + \theta_qB^q\)
Combining \(\text{AR}(p)\) and \(\text{MA}(q)\), we get \(\text{ARMA}(p, q)\):
\[\phi(B)X_t =\theta(B)Z_t\]
We can also incorporate differencing to non-stationary data in order to generate stationarity needed to fit AR, MA, and ARMA models. With \(d^{th}\) differences, we get \(\text{ARIMA}(p, d, q)\):
\[\phi(B)(1-B)^dX_t =\theta(B)Z_t\]
arima
functionarima(x, order = c(0L, 0L, 0L), seasonal = list(order, period) ...)
arima()
from the stats
package is the function used to fit ARIMA univariate time series models. The first input of arima()
is x
, a vector of a univariate time series.
Then, order
input is used to specify the (p, d, q) for ARIMA model, in that order. If d and q are both 0, it is a \(\text{AR}(p)\) model. If p and d are both 0, it is a \(\text{MA}(q)\) model. If d equals 0, it is \(\text{ARMA}(p, q)\) model.
seasonal
is a specification of the seasonal part of ARIMA model, consists of order
and period
.
Refer to the reference manual documentation using ?armia
for more details.
The following resource materials can help you to dive deeper into the time series analysis with R:
stats
and tseries
packagesSource: R Spatial
Spatial regression analysis is for modeling data with spatial dependence, which refers to the spatial relationship of variable values or locations. Spatial regression then seeks to capture the effect of spatial dependence in the statistical modeling efforts.
Neighbors of a spatial data point are other data points that are proximate to the focal data point. There are various ways to define neighbors:
x
are neighbors of x
.x
to be the neighbors of x
.Try this Shiny application by Klye E. Walker to see how the neighboring structure changes with different definitions.
Spatial weights represent the significance of neighbors. This reflects an intuition that not all neighbors are equally important. However, in practice, weights are often binary: 1 if neighbors and 0 otherwise.
For geospaital data expressed in polygons (e.g. counties), contiguity-based neighbors are used to construct spatial weights.
Currently, there are two major packages to create and manipulate spatial objects: sp
and sf
.
sp
is a mature package that is first release on CRAN in 2005. Currently, over 130 packages on CRAN are dependent on and built on top of sp
. However its objects have un-tidy structures, making it often difficult to work with.
sf
is a new package that is first released on CRAN in 2016. It is an R implementation of the “Simple Features Access” standard (ISO 19125) for geospatial data. Its spatial objects are better suited for tidy framework and its user base is growing. So far, however, sp
remains to be the dominant framework for spatial data structure and analysis.
spdep
packagespdep
package is, to my knowledge, the most comprehensive package for spatial regression analysis in R. Its functions use sp
spatial objects. To use spdep
with sf
objects, they must be first coerced into sp
spatial objects.
Take a look at the following resource materials for spdep
and spatial data analysis with R:
spdep
reference manual{target=“_blank“.The first step of spatial regression analysis involves creating neighbors and weights matrix for the model specification.
spdep
offers *2nb()
functions to create neighbors (nb
object):
poly2nb()
for continuity-based neighborsknn2nb()
for distance-based neighborstri2nb()
for grid-based neighborscell2nb()
for grid neighborsThen, nb2listw()
function is used to generate a list of spatial weights (listw
object) from an nb
object.
poly2nb
functionspdep::ploy2nb(pl, row.names = NULL, queen = TRUE, ...)
Let’s take a look at the poly2nb
function interface as an example of *2nb()
functions.
Here, pl
is a list of polygons (e.g. SpatialPolygons
class). If queen
is TRUE
, polygons with a single shared boundary point are considered neighbors; if FALSE
, more than two shared poitns are needed.
nb2listw
functionspdep::nb2listw(neighbours, ...)
The nb2listw
function takes a neighbours object of class nb
and returns a spatial weights list (listw
)
Now we will explore the basics of spatial modeling, which include the following:
Moran’s I is one of the most commonly used statistic for spatial autocorrelation. It is given in the following format:
\[I = \frac{\boldsymbol{\text{e}}^\text{T}\boldsymbol{\text{W}}\boldsymbol{\text{e}}/S_0}{\boldsymbol{\text{e}}^\text{T}\boldsymbol{\text{e}}/n} = \frac{\boldsymbol{\text{e}}^\text{T}\boldsymbol{\text{W}}\boldsymbol{\text{e}}/S_0}{\hat{\sigma}^2_{ML}},\]
where \(S_0\) is the sum of the weights, and \(\boldsymbol{\text{e}}\) is a vector of OLS residulas.
It is a non-constructive test, which means that no specific model is implied as an alternative hypothesis. The test is formulated as follows:
\[I_z = \frac{I - \text{E}[I]}{\sqrt{\text{Var}[I]}} \sim N(0, 1)\]
moran.test
and lm.morantest
spdep::moran.test(x, listw, ...)
spdep::lm.morantest(model, listw, ...)
spdep
offers two functiosn for conducting Moran’s I test.
moran()
takes a numeric vector of data and a spatial weights list (listw
) created by nb2listw
. On the other hand, lm.morantest()
takes a lm
object and a spatial weights list.
A small p-value
for the test statistic suggests the presence of special autocorrelation that is not accounted for in the ordinary least squares (OLS) model.
Unlike Moran’s I test, Lagrange multiplier (LM) tests for spatial lag model and spatial error model have specific models implied in their alternative hypothesis.
In either type of LM test, the null hypothesis is no spatial autocorrelation
There are also “Robust” LM test, which are used when LM tests for both lag model and error model reject the null.
lm.LMtests
functionspdep::lm.LMtests(model, listw, test = "LMerr")
The lm.LMtests()
function is used for conducting LM tests for various types of spatial models. lm.LMtests()
takes a lm
model and a spatial weights list (listw
).
Then, test
input specifies which test to implement. Available inputs for test
include:
"LMerr"
and "LMlag"
for spatial error and spatial lag model"RLMerr"
and "RLMlag"
for robust LM tests"SARMA"
for spatial ARMA modelSpatial lag model models spatial autoregression of the response variable. The model is formulated as follows:
\[\boldsymbol{\text{y}} = \rho\boldsymbol{\text{W}}\boldsymbol{\text{y}} + \boldsymbol{\text{X}}\beta + \boldsymbol{\text{u}},\]
where \(\boldsymbol{\text{W}}\boldsymbol{\text{y}}\) is a spatial lag term, \(\rho\) is a spatial autoregressive parameter, and \(\boldsymbol{\text{u}}\) is an error term.
Spatial error model models spatial dependence on the error term, and is expressed in the following form:
\[\boldsymbol{\text{y}} = \boldsymbol{\text{X}}\beta + \boldsymbol{\text{u}}, \text{where}\] \[\boldsymbol{\text{u}} = \lambda\boldsymbol{\text{W}}\boldsymbol{\text{u}} + \varepsilon\] Here, \(\boldsymbol{\text{W}}\) is a weights matrix, \(\lambda\) is a spatial autoregressive parameter for the error term, and \(\varepsilon\) is an idiosyncratic error term.
lagsarlm()
and errorsarlm()
spdep::lagsarlm(formula, data, listw, ...)
spdep::errorsarlm(formula, data, listw, ...)
These two are functions to fit lag and error models. formula
and data
works like in lm()
, and listw
is a spatial weights matrix.
To know more about spatial regression analysis in general and its implementation in R, refer to the following materials:
Source: Wikimedia Commons
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”
-Tom M. Matchell
Machine learning (ML) is a key factor of the recent success of artificial intelligence (AI) algorithms and applications. ML can be roughly thought as the automated optimization of model parameters through iteratively reducing the difference between the model outputs and the input data.
The field of machine learning is largely divided into the following three subfields:
The basic machine learning workflow can be roughly divided into the following steps:
Here is a brief list of commonly used machine learning algorithms:
It is beyond the scope of the current Part to offer a full description of each algorithm and how it works. However, feel free to leverage the power of the Internet and search online for each algorithm. There is a wealth of resources (articles, books, written and video tutorials, online courses, etc.) available on the subject of machine learning.
R has many packages that offer machine learning algorithms and relevant tools. The following two are among the most comprehensive packages providing a wide range of machine learning functionalities.
caret
packagecaret
is a shorthand for “Classification And REgression Training”, and one of the most mature machine learning package out there. Check out the package vignettes on the CRAN page and the package GitHub repository. There is also an online textbook on using caret
for machine learning.
mlr
packagemlr
is an abbreviation of “Machine Learning in R”. Take a look at the vignettes on the package CRAN page and its GitHub repository. Also visit the package website, which provide tutorials on the mlr
API.
To get started with machine learning, you may want to try the following materials. Also, don’t forget to search online for other machine learning materials.