diff --git a/.Rbuildignore b/.Rbuildignore index 62639ed..8f98485 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,4 +1,6 @@ +^Meta$ +^doc$ ^.*\.Rproj$ ^\.Rproj\.user$ copyJar -.gitignore \ No newline at end of file +.gitignore diff --git a/.gitignore b/.gitignore index 861651b..6795a17 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,6 @@ +Meta +doc +inst/doc *.Rproj .Rproj.user copyJar diff --git a/DESCRIPTION b/DESCRIPTION index 48a339e..20d8762 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Package: largeRCRF Type: Package Title: Large Random Competing Risks Forests -Version: 1.0.2 +Version: 1.0.3 Authors@R: c( person("Joel", "Therrien", email = "joel_therrien@sfu.ca", role = c("aut", "cre", "cph")), person("Jiguo", "Cao", email = "jiguo_cao@sfu.ca", role = c("aut", "dgs")) @@ -18,7 +18,10 @@ Imports: rJava (>= 0.9-9) Suggests: parallel, - testthat + testthat, + knitr, + rmarkdown Depends: R (>= 3.4.0) SystemRequirements: Java JDK 1.8 or higher RoxygenNote: 6.1.1 +VignetteBuilder: knitr diff --git a/R/wihs.R b/R/wihs.R new file mode 100644 index 0000000..cd97b32 --- /dev/null +++ b/R/wihs.R @@ -0,0 +1,23 @@ +#' Women's Interagency HIV Study +#' +#' A dataset containing competing risks information for women with HIV; +#' recording the time to treatment, or the time to developing AIDS or death. The +#' time may also be censored. +#' +#' @format A data frame with 1164 rows and 6 variables: \describe{ +#' \item{time}{time to the event} \item{status}{denotes which event occurred. +#' 0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS +#' developed or the patient died} \item{ageatfda}{patient age at time first +#' treatment approved} \item{idu}{binary specifying if the patient has a +#' history of drug injections (1 if true)} \item{black}{binary specifying if +#' the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells} +#' } +#' @source The data was obtained from the randomForestSRC R package. +#' +#' @references Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange +#' S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Women’s +#' Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to +#' the Bench.” Clinical and Vaccine Immunology, 12(9), 1013–1019. +#' doi:10.1128/CDLI.12.9.1013-1019.2005. +#' +"wihs" \ No newline at end of file diff --git a/README.md b/README.md index f959099..d6bbeb1 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,13 @@ This R package is used to train random competing risks forests, ideally for large data. It's based heavily off of [randomForestSRC](https://github.com/kogalur/randomForestSRC/), although there are some differences. -This package is still in a pre-release state and so it not yet available on CRAN. -To install it now, in R install the `devtools` package and run the following command: +This package is not yet on CRAN, so in the meantime to install it use the `devtools` package and run the following command: ``` R> devtools::install_git("https://github.com/jatherrien/largeRCRF.git") ``` +If you care about vignettes and have the packages available to build them you can include `build_vignettes = TRUE` as a parameter in the command above. + ## System Requirements You need: diff --git a/data/wihs.rda b/data/wihs.rda new file mode 100644 index 0000000..5ac00d8 Binary files /dev/null and b/data/wihs.rda differ diff --git a/man/wihs.Rd b/man/wihs.Rd new file mode 100644 index 0000000..fad12bc --- /dev/null +++ b/man/wihs.Rd @@ -0,0 +1,33 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/wihs.R +\docType{data} +\name{wihs} +\alias{wihs} +\title{Women's Interagency HIV Study} +\format{A data frame with 1164 rows and 6 variables: \describe{ + \item{time}{time to the event} \item{status}{denotes which event occurred. + 0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS + developed or the patient died} \item{ageatfda}{patient age at time first + treatment approved} \item{idu}{binary specifying if the patient has a + history of drug injections (1 if true)} \item{black}{binary specifying if + the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells} + }} +\source{ +The data was obtained from the randomForestSRC R package. +} +\usage{ +wihs +} +\description{ +A dataset containing competing risks information for women with HIV; +recording the time to treatment, or the time to developing AIDS or death. The +time may also be censored. +} +\references{ +Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange + S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Women’s + Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to + the Bench.” Clinical and Vaccine Immunology, 12(9), 1013–1019. + doi:10.1128/CDLI.12.9.1013-1019.2005. +} +\keyword{datasets} diff --git a/vignettes/.gitignore b/vignettes/.gitignore new file mode 100644 index 0000000..097b241 --- /dev/null +++ b/vignettes/.gitignore @@ -0,0 +1,2 @@ +*.html +*.R diff --git a/vignettes/refs.bib b/vignettes/refs.bib new file mode 100644 index 0000000..a557d4d --- /dev/null +++ b/vignettes/refs.bib @@ -0,0 +1,222 @@ +% TODO - read me +@Article{AalenJohansenCIFs, + URL = {http://www.jstor.org/stable/4615704}, + author = {Odd O. Aalen and Søren Johansen}, + journal = {Scandinavian Journal of Statistics}, + number = {3}, + pages = {141--150}, + title = {An Empirical Transition Matrix for Non-Homogeneous Markov Chains Based on Censored Observations}, + volume = {5}, + year = {1978} +} + +@Article{Breiman2001, + author="Breiman, Leo", + title="Random Forests", + journal="Machine Learning", + year="2001", + month="Oct", + day="01", + volume="45", + number="1", + pages="5--32", + doi="10.1023/A:1010933404324" +} + +@article{IshwaranCompetingRisks, + author = {Ishwaran, Hemant and Gerds, Thomas A. and Kogalur, Udaya B. and Moore, Richard D. and Gange, Stephen J. and Lau, Bryan M.}, + title = {Random Survival Forests for Competing Risks}, + journal = {Biostatistics}, + volume = {15}, + number = {4}, + pages = {757-773}, + year = {2014}, + doi = {10.1093/biostatistics/kxu010} +} + + +% TODO - need DOI +@Article{IshwaranSurvivalR, + title = {Random Survival Forests for \proglang{R}}, + author = {H. Ishwaran and Udaya B. Kogalur}, + journal = {\proglang{R} News}, + year = {2007}, + volume = {7}, + number = {2}, + pages = {25--31}, + month = {10}, + url = {https://CRAN.R-project.org/doc/Rnews/}, + pdf = {https://CRAN.R-project.org/doc/Rnews/Rnews_2007-2.pdf} +} + +@Manual{IshwaranRfsrc, + title = {Random Forests for Survival, Regression, and Classification (RF-SRC)}, + author = {H. Ishwaran and Udaya B. Kogalur}, + publisher = {manual}, + year = {2018}, + note = {\proglang{R} package version 2.8.0}, + url = {https://cran.r-project.org/package=randomForestSRC}, + pdf = {https://cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf}, +} + +@Article{IshwaranSurvival, + author = "Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S.", + doi = "10.1214/08-AOAS169", + journal = "The Annals of Applied Statistics", + month = "09", + number = "3", + pages = "841--860", + publisher = "The Institute of Mathematical Statistics", + title = "Random Survival Forests", + volume = "2", + year = "2008" +} + + +% TODO - read me +@Article{KaplanMeierCurve, + author = {E. L. Kaplan and Paul Meier }, + title = {Nonparametric Estimation from Incomplete Observations}, + journal = {Journal of the American Statistical Association}, + volume = {53}, + number = {282}, + pages = {457-481}, + year = {1958}, + publisher = {Taylor \& Francis}, + doi = {10.1080/01621459.1958.10501452} +} + + +% Note; the exported citation did not include year. +@Article{FeiIshwaranMissingData, + author = {Tang, Fei and Ishwaran, Hemant}, + title = {Random Forest Missing Data Algorithms}, + journal = {Statistical Analysis and Data Mining: The ASA Data Science Journal}, + volume = {10}, + number = {6}, + pages = {363-377}, + year = {2017}, + keywords = {correlation, imputation, machine learning, missingness, splitting (random, univariate, multivariate, unsupervised)}, + doi = {10.1002/sam.11348} +} + +@Manual{rJava, + title = {\pkg{rJava}: Low-Level \proglang{R} to \proglang{Java} Interface}, + author = {Simon Urbanek}, + year = {2018}, + note = {\proglang{R} package version 0.9-10}, + url = {https://CRAN.R-project.org/package=rJava}, +} + +@Book{ggplot2, + author = {Hadley Wickham}, + title = {\pkg{ggplot2}: Elegant Graphics for Data Analysis}, + publisher = {Springer-Verlag}, + year = {2016}, + isbn = {978-3-319-24277-4}, + url = {http://ggplot2.org}, +} + + +@Manual{RCitation, + title = {\proglang{R}: A Language and Environment for Statistical Computing}, + author = {\proglang{R} Core Team}, + organization = {\proglang{R} Foundation for Statistical Computing}, + address = {Vienna, Austria}, + year = {2018}, + url = {https://www.R-project.org/}, +} + +% %author = {Wolbers, Marcel and Blanche, Paul and T Koller, Michael and C M Witteman, Jacqueline and Gerds, Thomas}, Adjusted field for better adjustments +% excluded eprint = {http://oup.prod.sis.lan/biostatistics/article-pdf/15/3/526/599536/kxt059.pdf}, as the URL doesn't work +% excluded url = {https://doi.org/10.1093/biostatistics/kxt059}, as it's redundant on doi +@article{WolbersConcordanceCompetingRisks, + author = {Wolbers, Marcel and Blanche, Paul and Koller, Michael T. and Witteman, Jacqueline C M and Gerds, Thomas A}, + title = {Concordance for Prognostic Models with Competing Risks}, + journal = {Biostatistics}, + volume = {15}, + number = {3}, + pages = {526-539}, + year = {2014}, + month = {02}, + doi = {10.1093/biostatistics/kxt059} +} + +@article{NelsonAalenEstimator1, + author = {Wayne Nelson}, + title = {Theory and Applications of Hazard Plotting for Censored Failure Data}, + journal = {Technometrics}, + volume = {14}, + number = {4}, + pages = {945-966}, + year = {1972}, + publisher = {Taylor & Francis}, + doi = {10.1080/00401706.1972.10488991} +} + +@article{NelsonAalenEstimator2, + URL = {http://www.jstor.org/stable/2958850}, + author = {Odd O. Aalen}, + journal = {The Annals of Statistics}, + number = {4}, + pages = {701--726}, + publisher = {Institute of Mathematical Statistics}, + title = {Nonparametric Inference for a Family of Counting Processes}, + volume = {6}, + year = {1978} +} + +% Not used +@article{BrierScore, + author = {Glenn W. Brier}, + title = {Verification of Forecasts Expressed in Terms of Probability}, + journal = {Monthly Weather Review}, + volume = {78}, + number = {1}, + pages = {1-3}, + year = {1950}, + doi = {10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2} +} + +@article {wihs, + author = {Bacon, Melanie C. and von Wyl, Viktor and Alden, Christine and Sharp, Gerald and Robison, Esther and Hessol, Nancy and Gange, Stephen and Barranday, Yvonne and Holman, Susan and Weber, Kathleen and Young, Mary A.}, + title = {The Women{\textquoteright}s Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to the Bench}, + volume = {12}, + number = {9}, + pages = {1013--1019}, + year = {2005}, + doi = {10.1128/CDLI.12.9.1013-1019.2005}, + publisher = {American Society for Microbiology Journals}, + journal = {Clinical and Vaccine Immunology} +} + + +@article{FineAndGrayProportional, + journal = {Journal of the American Statistical Association}, + pages = {496--509}, + volume = {94}, + publisher = {Taylor & Francis Group}, + number = {446}, + year = {1999}, + title = {A Proportional Hazards Model for the Subdistribution of a Competing Risk}, + author = {Fine, Jason P. and Gray, Robert J.} +} + +% Bibtex has a lot of difficulty with the UTF-8 character that \O produces; however if I let Bibtex try and format the author section it strips out the '\' from O, so I had to manually format this +@book{survival_event_history_book, + title={Survival and Event History Analysis: A Process Point of View}, + publisher={Springer-Verlag}, + author={{Aalen OO, Borgan \O, Gjessing HK}}, + year={2008}, + doi={10.1007/978-0-387-68560-1}, + isbn={978-0-387-20287-7} +} + +@book{methods_for_lifetime_data_book, + title={Statistical Models and Methods for Lifetime Data}, + publisher={John Wiley \& Sons}, + author={Jerald F. Lawless}, + year={2002}, + doi={10.1002/9781118033005}, + isbn={978-0-471-37215-8} +} diff --git a/vignettes/simple-example.Rmd b/vignettes/simple-example.Rmd new file mode 100644 index 0000000..b640268 --- /dev/null +++ b/vignettes/simple-example.Rmd @@ -0,0 +1,106 @@ +--- +title: "Simple example of using largeRCRF" +author: "Joel Therrien & Jiguo Cao" +output: rmarkdown::html_vignette +bibliography: refs.bib +vignette: > + %\VignetteIndexEntry{Vignette Title} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error. + +## Source + +The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package. + +## Background + +The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one: + +* The woman began treatment for HIV. +* The woman developed AIDS or died. +* The woman was censored for administrative reasons. + +There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells). + +## Getting the data + +```{r} +data(wihs, package = "largeRCRF") +names(wihs) +``` + +`time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on. + +We train a forest by calling `train`. + +```{r} +library("largeRCRF") +model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir, + data = wihs, splitFinder = LogRankSplitFinder(1:2, 2), + ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15, + randomSeed = 15) +``` + +We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops). + +We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`. + +Printing `model` on its own doesn't really do much except print the different components and parameters that made the forest. + +```{r} +model +``` + +Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions). + +```{r} +predictions <- predict(model) +``` + +Since our data is competing risks data, our responses are several functions which can't really be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF). + +```{r} +predictions[[1]] +``` + + +Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve. + +```{r} +aids.cifs = extractCIF(predictions, event = 2) +aids.chfs = extractCHF(predictions, event = 2) +survivor.curves = extractSurvivorCurve(predictions) +``` + +Now we plot some of the functions that we extracted. + +```{r} +curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1), + type="S", ylab="CIF(t)", xlab="Time (t)") + +curve(aids.chfs[[3]](x), from=0, to=8, + type="S", ylab="CHF(t)", xlab="Time (t)") +``` + +Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks. + +```{r} +mortalities1 <- extractMortalities(predictions, time = 8, event = 1) +mortalities2 <- extractMortalities(predictions, time = 8, event = 2) +naiveConcordance(CR_Response(wihs$status, wihs$time), + list(mortalities1, mortalities2)) +``` + +We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm. + +## References