106 lines
4.7 KiB
Text
106 lines
4.7 KiB
Text
---
|
|
title: "Simple example of using largeRCRF"
|
|
author: "Joel Therrien & Jiguo Cao"
|
|
output: rmarkdown::html_vignette
|
|
bibliography: refs.bib
|
|
vignette: >
|
|
%\VignetteIndexEntry{Vignette Title}
|
|
%\VignetteEngine{knitr::rmarkdown}
|
|
%\VignetteEncoding{UTF-8}
|
|
---
|
|
|
|
```{r setup, include = FALSE}
|
|
knitr::opts_chunk$set(
|
|
collapse = TRUE,
|
|
comment = "#>"
|
|
)
|
|
```
|
|
|
|
This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.
|
|
|
|
## Source
|
|
|
|
The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.
|
|
|
|
## Background
|
|
|
|
The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:
|
|
|
|
* The woman began treatment for HIV.
|
|
* The woman developed AIDS or died.
|
|
* The woman was censored for administrative reasons.
|
|
|
|
There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).
|
|
|
|
## Getting the data
|
|
|
|
```{r}
|
|
data(wihs, package = "largeRCRF")
|
|
names(wihs)
|
|
```
|
|
|
|
`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on.
|
|
|
|
We train a forest by calling `train`.
|
|
|
|
```{r}
|
|
library("largeRCRF")
|
|
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
|
|
data = wihs, splitFinder = LogRankSplitFinder(1:2, 2),
|
|
ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
|
|
randomSeed = 15)
|
|
```
|
|
|
|
We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).
|
|
|
|
We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.
|
|
|
|
Printing `model` on its own doesn't really do much except print the different components and parameters that made the forest.
|
|
|
|
```{r}
|
|
model
|
|
```
|
|
|
|
Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).
|
|
|
|
```{r}
|
|
predictions <- predict(model)
|
|
```
|
|
|
|
Since our data is competing risks data, our responses are several functions which can't really be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).
|
|
|
|
```{r}
|
|
predictions[[1]]
|
|
```
|
|
|
|
|
|
Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.
|
|
|
|
```{r}
|
|
aids.cifs = extractCIF(predictions, event = 2)
|
|
aids.chfs = extractCHF(predictions, event = 2)
|
|
survivor.curves = extractSurvivorCurve(predictions)
|
|
```
|
|
|
|
Now we plot some of the functions that we extracted.
|
|
|
|
```{r}
|
|
curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
|
|
type="S", ylab="CIF(t)", xlab="Time (t)")
|
|
|
|
curve(aids.chfs[[3]](x), from=0, to=8,
|
|
type="S", ylab="CHF(t)", xlab="Time (t)")
|
|
```
|
|
|
|
Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.
|
|
|
|
```{r}
|
|
mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
|
|
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
|
|
naiveConcordance(CR_Response(wihs$status, wihs$time),
|
|
list(mortalities1, mortalities2))
|
|
```
|
|
|
|
We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.
|
|
|
|
## References
|