This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.
## Source
The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.
## Background
The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:
* The woman began treatment for HIV.
* The woman developed AIDS or died.
* The woman was censored for administrative reasons.
There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).
## Getting the data
```{r}
data(wihs, package = "largeRCRF")
names(wihs)
```
`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on.
We train a forest by calling `train`.
```{r}
library("largeRCRF")
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
data = wihs, splitFinder = LogRankSplitFinder(1:2, 2),
We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).
We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.
Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).
Since our data is competing risks data, our responses are several functions which can't be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).
Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.
```{r}
mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.