largeRCRF/vignettes/simple-example.Rmd

106 lines
4.7 KiB
Text

---
title: "Simple example of using largeRCRF"
author: "Joel Therrien & Jiguo Cao"
output: rmarkdown::html_vignette
bibliography: refs.bib
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.
## Source
The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.
## Background
The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:
* The woman began treatment for HIV.
* The woman developed AIDS or died.
* The woman was censored for administrative reasons.
There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).
## Getting the data
```{r}
data(wihs, package = "largeRCRF")
names(wihs)
```
`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on.
We train a forest by calling `train`.
```{r}
library("largeRCRF")
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
data = wihs, splitFinder = LogRankSplitFinder(1:2, 2),
ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
randomSeed = 15)
```
We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).
We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.
Printing `model` on its own doesn't really do much except print the different components and parameters that made the forest.
```{r}
model
```
Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).
```{r}
predictions <- predict(model)
```
Since our data is competing risks data, our responses are several functions which can't really be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).
```{r}
predictions[[1]]
```
Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.
```{r}
aids.cifs = extractCIF(predictions, event = 2)
aids.chfs = extractCHF(predictions, event = 2)
survivor.curves = extractSurvivorCurve(predictions)
```
Now we plot some of the functions that we extracted.
```{r}
curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
type="S", ylab="CIF(t)", xlab="Time (t)")
curve(aids.chfs[[3]](x), from=0, to=8,
type="S", ylab="CHF(t)", xlab="Time (t)")
```
Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.
```{r}
mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
naiveConcordance(CR_Response(wihs$status, wihs$time),
list(mortalities1, mortalities2))
```
We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.
## References