largeRCRF/vignettes/simple-example.Rmd

---
title: "Simple example of using largeRCRF"
author: "Joel Therrien & Jiguo Cao"
output: rmarkdown::html_vignette
bibliography: refs.bib
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.

## Source

The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.

## Background

The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:

* The woman began treatment for HIV.
* The woman developed AIDS or died.
* The woman was censored for administrative reasons.

There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).

## Getting the data

```{r}
data(wihs, package = "largeRCRF")
names(wihs)
```

`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on.

We train a forest by calling `train`.

```{r}
library("largeRCRF")
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
               data = wihs, splitFinder = LogRankSplitFinder(1:2, 2),
               ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
               randomSeed = 15)
```

We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).

We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.

Printing `model` on its own doesn't do much except print the different components and parameters that made the forest.

```{r}
model
```

Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).

```{r}
predictions <- predict(model)
```

Since our data is competing risks data, our responses are several functions which can't be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).

```{r}
predictions[[1]]
```


Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.

```{r}
aids.cifs = extractCIF(predictions, event = 2)
aids.chfs = extractCHF(predictions, event = 2)
survivor.curves = extractSurvivorCurve(predictions)
```

Now we plot some of the functions that we extracted.

```{r}
curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
       type="S", ylab="CIF(t)", xlab="Time (t)")

curve(aids.chfs[[3]](x), from=0, to=8,
       type="S", ylab="CHF(t)", xlab="Time (t)")
```

Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.

```{r}
mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
naiveConcordance(CR_Response(wihs$status, wihs$time),
               list(mortalities1, mortalities2))
```

We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.

## References