Include wihs data and vignette into package

This commit is contained in:
Joel Therrien 2019-07-05 11:56:53 -07:00
parent ec4ef7ea44
commit 0cd20225ce
10 changed files with 400 additions and 5 deletions

View file

@ -1,3 +1,5 @@
^Meta$
^doc$
^.*\.Rproj$
^\.Rproj\.user$
copyJar

3
.gitignore vendored
View file

@ -1,3 +1,6 @@
Meta
doc
inst/doc
*.Rproj
.Rproj.user
copyJar

View file

@ -1,7 +1,7 @@
Package: largeRCRF
Type: Package
Title: Large Random Competing Risks Forests
Version: 1.0.2
Version: 1.0.3
Authors@R: c(
person("Joel", "Therrien", email = "joel_therrien@sfu.ca", role = c("aut", "cre", "cph")),
person("Jiguo", "Cao", email = "jiguo_cao@sfu.ca", role = c("aut", "dgs"))
@ -18,7 +18,10 @@ Imports:
rJava (>= 0.9-9)
Suggests:
parallel,
testthat
testthat,
knitr,
rmarkdown
Depends: R (>= 3.4.0)
SystemRequirements: Java JDK 1.8 or higher
RoxygenNote: 6.1.1
VignetteBuilder: knitr

23
R/wihs.R Normal file
View file

@ -0,0 +1,23 @@
#' Women's Interagency HIV Study
#'
#' A dataset containing competing risks information for women with HIV;
#' recording the time to treatment, or the time to developing AIDS or death. The
#' time may also be censored.
#'
#' @format A data frame with 1164 rows and 6 variables: \describe{
#' \item{time}{time to the event} \item{status}{denotes which event occurred.
#' 0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS
#' developed or the patient died} \item{ageatfda}{patient age at time first
#' treatment approved} \item{idu}{binary specifying if the patient has a
#' history of drug injections (1 if true)} \item{black}{binary specifying if
#' the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells}
#' }
#' @source The data was obtained from the randomForestSRC R package.
#'
#' @references Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange
#' S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Womens
#' Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to
#' the Bench.” Clinical and Vaccine Immunology, 12(9), 10131019.
#' doi:10.1128/CDLI.12.9.1013-1019.2005.
#'
"wihs"

View file

@ -3,12 +3,13 @@
This R package is used to train random competing risks forests, ideally for large data.
It's based heavily off of [randomForestSRC](https://github.com/kogalur/randomForestSRC/), although there are some differences.
This package is still in a pre-release state and so it not yet available on CRAN.
To install it now, in R install the `devtools` package and run the following command:
This package is not yet on CRAN, so in the meantime to install it use the `devtools` package and run the following command:
```
R> devtools::install_git("https://github.com/jatherrien/largeRCRF.git")
```
If you care about vignettes and have the packages available to build them you can include `build_vignettes = TRUE` as a parameter in the command above.
## System Requirements
You need:

BIN
data/wihs.rda Normal file

Binary file not shown.

33
man/wihs.Rd Normal file
View file

@ -0,0 +1,33 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/wihs.R
\docType{data}
\name{wihs}
\alias{wihs}
\title{Women's Interagency HIV Study}
\format{A data frame with 1164 rows and 6 variables: \describe{
\item{time}{time to the event} \item{status}{denotes which event occurred.
0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS
developed or the patient died} \item{ageatfda}{patient age at time first
treatment approved} \item{idu}{binary specifying if the patient has a
history of drug injections (1 if true)} \item{black}{binary specifying if
the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells}
}}
\source{
The data was obtained from the randomForestSRC R package.
}
\usage{
wihs
}
\description{
A dataset containing competing risks information for women with HIV;
recording the time to treatment, or the time to developing AIDS or death. The
time may also be censored.
}
\references{
Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange
S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Womens
Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to
the Bench.” Clinical and Vaccine Immunology, 12(9), 10131019.
doi:10.1128/CDLI.12.9.1013-1019.2005.
}
\keyword{datasets}

2
vignettes/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
*.html
*.R

222
vignettes/refs.bib Normal file
View file

@ -0,0 +1,222 @@
% TODO - read me
@Article{AalenJohansenCIFs,
URL = {http://www.jstor.org/stable/4615704},
author = {Odd O. Aalen and Søren Johansen},
journal = {Scandinavian Journal of Statistics},
number = {3},
pages = {141--150},
title = {An Empirical Transition Matrix for Non-Homogeneous Markov Chains Based on Censored Observations},
volume = {5},
year = {1978}
}
@Article{Breiman2001,
author="Breiman, Leo",
title="Random Forests",
journal="Machine Learning",
year="2001",
month="Oct",
day="01",
volume="45",
number="1",
pages="5--32",
doi="10.1023/A:1010933404324"
}
@article{IshwaranCompetingRisks,
author = {Ishwaran, Hemant and Gerds, Thomas A. and Kogalur, Udaya B. and Moore, Richard D. and Gange, Stephen J. and Lau, Bryan M.},
title = {Random Survival Forests for Competing Risks},
journal = {Biostatistics},
volume = {15},
number = {4},
pages = {757-773},
year = {2014},
doi = {10.1093/biostatistics/kxu010}
}
% TODO - need DOI
@Article{IshwaranSurvivalR,
title = {Random Survival Forests for \proglang{R}},
author = {H. Ishwaran and Udaya B. Kogalur},
journal = {\proglang{R} News},
year = {2007},
volume = {7},
number = {2},
pages = {25--31},
month = {10},
url = {https://CRAN.R-project.org/doc/Rnews/},
pdf = {https://CRAN.R-project.org/doc/Rnews/Rnews_2007-2.pdf}
}
@Manual{IshwaranRfsrc,
title = {Random Forests for Survival, Regression, and Classification (RF-SRC)},
author = {H. Ishwaran and Udaya B. Kogalur},
publisher = {manual},
year = {2018},
note = {\proglang{R} package version 2.8.0},
url = {https://cran.r-project.org/package=randomForestSRC},
pdf = {https://cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf},
}
@Article{IshwaranSurvival,
author = "Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S.",
doi = "10.1214/08-AOAS169",
journal = "The Annals of Applied Statistics",
month = "09",
number = "3",
pages = "841--860",
publisher = "The Institute of Mathematical Statistics",
title = "Random Survival Forests",
volume = "2",
year = "2008"
}
% TODO - read me
@Article{KaplanMeierCurve,
author = {E. L. Kaplan and Paul Meier },
title = {Nonparametric Estimation from Incomplete Observations},
journal = {Journal of the American Statistical Association},
volume = {53},
number = {282},
pages = {457-481},
year = {1958},
publisher = {Taylor \& Francis},
doi = {10.1080/01621459.1958.10501452}
}
% Note; the exported citation did not include year.
@Article{FeiIshwaranMissingData,
author = {Tang, Fei and Ishwaran, Hemant},
title = {Random Forest Missing Data Algorithms},
journal = {Statistical Analysis and Data Mining: The ASA Data Science Journal},
volume = {10},
number = {6},
pages = {363-377},
year = {2017},
keywords = {correlation, imputation, machine learning, missingness, splitting (random, univariate, multivariate, unsupervised)},
doi = {10.1002/sam.11348}
}
@Manual{rJava,
title = {\pkg{rJava}: Low-Level \proglang{R} to \proglang{Java} Interface},
author = {Simon Urbanek},
year = {2018},
note = {\proglang{R} package version 0.9-10},
url = {https://CRAN.R-project.org/package=rJava},
}
@Book{ggplot2,
author = {Hadley Wickham},
title = {\pkg{ggplot2}: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag},
year = {2016},
isbn = {978-3-319-24277-4},
url = {http://ggplot2.org},
}
@Manual{RCitation,
title = {\proglang{R}: A Language and Environment for Statistical Computing},
author = {\proglang{R} Core Team},
organization = {\proglang{R} Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2018},
url = {https://www.R-project.org/},
}
% %author = {Wolbers, Marcel and Blanche, Paul and T Koller, Michael and C M Witteman, Jacqueline and Gerds, Thomas}, Adjusted field for better adjustments
% excluded eprint = {http://oup.prod.sis.lan/biostatistics/article-pdf/15/3/526/599536/kxt059.pdf}, as the URL doesn't work
% excluded url = {https://doi.org/10.1093/biostatistics/kxt059}, as it's redundant on doi
@article{WolbersConcordanceCompetingRisks,
author = {Wolbers, Marcel and Blanche, Paul and Koller, Michael T. and Witteman, Jacqueline C M and Gerds, Thomas A},
title = {Concordance for Prognostic Models with Competing Risks},
journal = {Biostatistics},
volume = {15},
number = {3},
pages = {526-539},
year = {2014},
month = {02},
doi = {10.1093/biostatistics/kxt059}
}
@article{NelsonAalenEstimator1,
author = {Wayne Nelson},
title = {Theory and Applications of Hazard Plotting for Censored Failure Data},
journal = {Technometrics},
volume = {14},
number = {4},
pages = {945-966},
year = {1972},
publisher = {Taylor & Francis},
doi = {10.1080/00401706.1972.10488991}
}
@article{NelsonAalenEstimator2,
URL = {http://www.jstor.org/stable/2958850},
author = {Odd O. Aalen},
journal = {The Annals of Statistics},
number = {4},
pages = {701--726},
publisher = {Institute of Mathematical Statistics},
title = {Nonparametric Inference for a Family of Counting Processes},
volume = {6},
year = {1978}
}
% Not used
@article{BrierScore,
author = {Glenn W. Brier},
title = {Verification of Forecasts Expressed in Terms of Probability},
journal = {Monthly Weather Review},
volume = {78},
number = {1},
pages = {1-3},
year = {1950},
doi = {10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2}
}
@article {wihs,
author = {Bacon, Melanie C. and von Wyl, Viktor and Alden, Christine and Sharp, Gerald and Robison, Esther and Hessol, Nancy and Gange, Stephen and Barranday, Yvonne and Holman, Susan and Weber, Kathleen and Young, Mary A.},
title = {The Women{\textquoteright}s Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to the Bench},
volume = {12},
number = {9},
pages = {1013--1019},
year = {2005},
doi = {10.1128/CDLI.12.9.1013-1019.2005},
publisher = {American Society for Microbiology Journals},
journal = {Clinical and Vaccine Immunology}
}
@article{FineAndGrayProportional,
journal = {Journal of the American Statistical Association},
pages = {496--509},
volume = {94},
publisher = {Taylor & Francis Group},
number = {446},
year = {1999},
title = {A Proportional Hazards Model for the Subdistribution of a Competing Risk},
author = {Fine, Jason P. and Gray, Robert J.}
}
% Bibtex has a lot of difficulty with the UTF-8 character that \O produces; however if I let Bibtex try and format the author section it strips out the '\' from O, so I had to manually format this
@book{survival_event_history_book,
title={Survival and Event History Analysis: A Process Point of View},
publisher={Springer-Verlag},
author={{Aalen OO, Borgan \O, Gjessing HK}},
year={2008},
doi={10.1007/978-0-387-68560-1},
isbn={978-0-387-20287-7}
}
@book{methods_for_lifetime_data_book,
title={Statistical Models and Methods for Lifetime Data},
publisher={John Wiley \& Sons},
author={Jerald F. Lawless},
year={2002},
doi={10.1002/9781118033005},
isbn={978-0-471-37215-8}
}

View file

@ -0,0 +1,106 @@
---
title: "Simple example of using largeRCRF"
author: "Joel Therrien & Jiguo Cao"
output: rmarkdown::html_vignette
bibliography: refs.bib
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.
## Source
The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.
## Background
The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:
* The woman began treatment for HIV.
* The woman developed AIDS or died.
* The woman was censored for administrative reasons.
There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).
## Getting the data
```{r}
data(wihs, package = "largeRCRF")
names(wihs)
```
`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on.
We train a forest by calling `train`.
```{r}
library("largeRCRF")
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
data = wihs, splitFinder = LogRankSplitFinder(1:2, 2),
ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
randomSeed = 15)
```
We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).
We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.
Printing `model` on its own doesn't really do much except print the different components and parameters that made the forest.
```{r}
model
```
Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).
```{r}
predictions <- predict(model)
```
Since our data is competing risks data, our responses are several functions which can't really be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).
```{r}
predictions[[1]]
```
Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.
```{r}
aids.cifs = extractCIF(predictions, event = 2)
aids.chfs = extractCHF(predictions, event = 2)
survivor.curves = extractSurvivorCurve(predictions)
```
Now we plot some of the functions that we extracted.
```{r}
curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
type="S", ylab="CIF(t)", xlab="Time (t)")
curve(aids.chfs[[3]](x), from=0, to=8,
type="S", ylab="CHF(t)", xlab="Time (t)")
```
Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.
```{r}
mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
naiveConcordance(CR_Response(wihs$status, wihs$time),
list(mortalities1, mortalities2))
```
We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.
## References