Include wihs data and vignette into package

2019-07-05 11:56:53 -07:00 · 2019-07-05 11:56:53 -07:00 · 0cd20225ce
commit 0cd20225ce
parent ec4ef7ea44
10 changed files with 400 additions and 5 deletions
--- a/.Rbuildignore
+++ b/.Rbuildignore
@ -1,4 +1,6 @@
 ^Meta$
 ^doc$
 ^.*\.Rproj$
 ^\.Rproj\.user$
 copyJar
-.gitignore
+.gitignore
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,6 @@
 Meta
 doc
 inst/doc
 *.Rproj
 .Rproj.user
 copyJar
--- a/7
+++ b/7
@ -1,7 +1,7 @@
 Package: largeRCRF
 Type: Package
 Title: Large Random Competing Risks Forests
-Version: 1.0.2
+Version: 1.0.3
 Authors@R: c(
    person("Joel", "Therrien", email = "joel_therrien@sfu.ca", role = c("aut", "cre", "cph")),
    person("Jiguo", "Cao", email = "jiguo_cao@sfu.ca", role = c("aut", "dgs"))
@ -18,7 +18,10 @@ Imports:
    rJava (>= 0.9-9)
 Suggests:
    parallel,
-    testthat
+    testthat,
    knitr,
    rmarkdown
 Depends: R (>= 3.4.0)
 SystemRequirements: Java JDK 1.8 or higher
 RoxygenNote: 6.1.1
 VignetteBuilder: knitr
--- a/R/wihs.R
+++ b/R/wihs.R
@ -0,0 +1,23 @@
 #' Women's Interagency HIV Study
 #'
 #' A dataset containing competing risks information for women with HIV;
 #' recording the time to treatment, or the time to developing AIDS or death. The
 #' time may also be censored.
 #'
 #' @format A data frame with 1164 rows and 6 variables: \describe{
 #'   \item{time}{time to the event} \item{status}{denotes which event occurred.
 #'   0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS
 #'   developed or the patient died} \item{ageatfda}{patient age at time first
 #'   treatment approved} \item{idu}{binary specifying if the patient has a
 #'   history of drug injections (1 if true)} \item{black}{binary specifying if
 #'   the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells}
 #'   }
 #' @source The data was obtained from the randomForestSRC R package.
 #'
 #' @references Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange
 #'   S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Women’s
 #'   Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to
 #'   the Bench.” Clinical and Vaccine Immunology, 12(9), 1013–1019.
 #'   doi:10.1128/CDLI.12.9.1013-1019.2005.
 #'   
 "wihs"
--- a/README.md
+++ b/README.md
@ -3,12 +3,13 @@
 This R package is used to train random competing risks forests, ideally for large data.
 It's based heavily off of [randomForestSRC](https://github.com/kogalur/randomForestSRC/), although there are some differences.
-This package is still in a pre-release state and so it not yet available on CRAN.
+This package is not yet on CRAN, so in the meantime to install it use the `devtools` package and run the following command:
 To install it now, in R install the `devtools` package and run the following command:
 ```
 R> devtools::install_git("https://github.com/jatherrien/largeRCRF.git")
 ```
 If you care about vignettes and have the packages available to build them you can include `build_vignettes = TRUE` as a parameter in the command above.
 ## System Requirements
 You need:
--- a/data/wihs.rda
+++ b/data/wihs.rda
--- a/man/wihs.Rd
+++ b/man/wihs.Rd
@ -0,0 +1,33 @@
 % Generated by roxygen2: do not edit by hand
 % Please edit documentation in R/wihs.R
 \docType{data}
 \name{wihs}
 \alias{wihs}
 \title{Women's Interagency HIV Study}
 \format{A data frame with 1164 rows and 6 variables: \describe{
  \item{time}{time to the event} \item{status}{denotes which event occurred.
  0 denotes censoring, 1 denotes HIV treatment began, and 2 denotes AIDS
  developed or the patient died} \item{ageatfda}{patient age at time first
  treatment approved} \item{idu}{binary specifying if the patient has a
  history of drug injections (1 if true)} \item{black}{binary specifying if
  the patient is black (1 if true)} \item{cd4nadir}{blood count of CD4 cells}
  }}
 \source{
 The data was obtained from the randomForestSRC R package.
 }
 \usage{
 wihs
 }
 \description{
 A dataset containing competing risks information for women with HIV;
 recording the time to treatment, or the time to developing AIDS or death. The
 time may also be censored.
 }
 \references{
 Bacon MC, von Wyl V, Alden C, Sharp G, Robison E, Hessol N, Gange
  S, Barranday Y, Holman S, Weber K, Young MA (2005). “The Women’s
  Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to
  the Bench.” Clinical and Vaccine Immunology, 12(9), 1013–1019.
  doi:10.1128/CDLI.12.9.1013-1019.2005.
 }
 \keyword{datasets}
--- a/vignettes/.gitignore
+++ b/vignettes/.gitignore
@ -0,0 +1,2 @@
 *.html
 *.R
--- a/vignettes/refs.bib
+++ b/vignettes/refs.bib
@ -0,0 +1,222 @@
 % TODO - read me
@Article{AalenJohansenCIFs,
    URL = {http://www.jstor.org/stable/4615704},
    author = {Odd O. Aalen and Søren Johansen},
    journal = {Scandinavian Journal of Statistics},
    number = {3},
    pages = {141--150},
    title = {An Empirical Transition Matrix for Non-Homogeneous Markov Chains Based on Censored Observations},
    volume = {5},
    year = {1978}
 }
@Article{Breiman2001,
    author="Breiman, Leo",
    title="Random Forests",
    journal="Machine Learning",
    year="2001",
    month="Oct",
    day="01",
    volume="45",
    number="1",
    pages="5--32",
    doi="10.1023/A:1010933404324"
 }
@article{IshwaranCompetingRisks,
    author = {Ishwaran, Hemant and Gerds, Thomas A. and Kogalur, Udaya B. and Moore, Richard D. and Gange, Stephen J. and Lau, Bryan M.},
    title = {Random Survival Forests for Competing Risks},
    journal = {Biostatistics},
    volume = {15},
    number = {4},
    pages = {757-773},
    year = {2014},
    doi = {10.1093/biostatistics/kxu010}
 }
 % TODO - need DOI
@Article{IshwaranSurvivalR,
    title = {Random Survival Forests for \proglang{R}},
    author = {H. Ishwaran and Udaya B. Kogalur},
    journal = {\proglang{R} News},
    year = {2007},
    volume = {7},
    number = {2},
    pages = {25--31},
    month = {10},
    url = {https://CRAN.R-project.org/doc/Rnews/},
    pdf = {https://CRAN.R-project.org/doc/Rnews/Rnews_2007-2.pdf}
 }
@Manual{IshwaranRfsrc,
    title = {Random Forests for Survival, Regression, and Classification (RF-SRC)},
    author = {H. Ishwaran and Udaya B. Kogalur},
    publisher = {manual},
    year = {2018},
    note = {\proglang{R} package version 2.8.0},
    url = {https://cran.r-project.org/package=randomForestSRC},
    pdf = {https://cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf},
 }
@Article{IshwaranSurvival,
    author = "Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S.",
    doi = "10.1214/08-AOAS169",
    journal = "The Annals of Applied Statistics",
    month = "09",
    number = "3",
    pages = "841--860",
    publisher = "The Institute of Mathematical Statistics",
    title = "Random Survival Forests",
    volume = "2",
    year = "2008"
 }
 % TODO - read me
@Article{KaplanMeierCurve,
    author = {E. L. Kaplan and Paul Meier },
    title = {Nonparametric Estimation from Incomplete Observations},
    journal = {Journal of the American Statistical Association},
    volume = {53},
    number = {282},
    pages = {457-481},
    year  = {1958},
    publisher = {Taylor \& Francis},
    doi = {10.1080/01621459.1958.10501452}
 }
 % Note; the exported citation did not include year.  
@Article{FeiIshwaranMissingData,
    author = {Tang, Fei and Ishwaran, Hemant},
    title = {Random Forest Missing Data Algorithms},
    journal = {Statistical Analysis and Data Mining: The ASA Data Science Journal},
    volume = {10},
    number = {6},
    pages = {363-377},
    year = {2017},
    keywords = {correlation, imputation, machine learning, missingness, splitting (random, univariate, multivariate, unsupervised)},
    doi = {10.1002/sam.11348}
 }
@Manual{rJava,
    title = {\pkg{rJava}: Low-Level \proglang{R} to \proglang{Java} Interface},
    author = {Simon Urbanek},
    year = {2018},
    note = {\proglang{R} package version 0.9-10},
    url = {https://CRAN.R-project.org/package=rJava},
 }
@Book{ggplot2,
    author = {Hadley Wickham},
    title = {\pkg{ggplot2}: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {http://ggplot2.org},
 }
@Manual{RCitation,
    title = {\proglang{R}: A Language and Environment for Statistical Computing},
    author = {\proglang{R} Core Team},
    organization = {\proglang{R} Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2018},
    url = {https://www.R-project.org/},
 }
 % %author = {Wolbers, Marcel and Blanche, Paul and T Koller, Michael and C M Witteman, Jacqueline and Gerds, Thomas}, Adjusted field for better adjustments
 % excluded     eprint = {http://oup.prod.sis.lan/biostatistics/article-pdf/15/3/526/599536/kxt059.pdf}, as the URL doesn't work
 % excluded     url = {https://doi.org/10.1093/biostatistics/kxt059}, as it's redundant on doi
@article{WolbersConcordanceCompetingRisks,
    author = {Wolbers, Marcel and Blanche, Paul and Koller, Michael T. and Witteman, Jacqueline C M and Gerds, Thomas A},
    title = {Concordance for Prognostic Models with Competing Risks},
    journal = {Biostatistics},
    volume = {15},
    number = {3},
    pages = {526-539},
    year = {2014},
    month = {02},
    doi = {10.1093/biostatistics/kxt059}
 }
@article{NelsonAalenEstimator1,
    author = {Wayne Nelson},
    title = {Theory and Applications of Hazard Plotting for Censored Failure Data},
    journal = {Technometrics},
    volume = {14},
    number = {4},
    pages = {945-966},
    year  = {1972},
    publisher = {Taylor & Francis},
    doi = {10.1080/00401706.1972.10488991}
 }
@article{NelsonAalenEstimator2,
    URL = {http://www.jstor.org/stable/2958850},
    author = {Odd O. Aalen},
    journal = {The Annals of Statistics},
    number = {4},
    pages = {701--726},
    publisher = {Institute of Mathematical Statistics},
    title = {Nonparametric Inference for a Family of Counting Processes},
    volume = {6},
    year = {1978}
 }
 % Not used
@article{BrierScore,
    author = {Glenn W. Brier},
    title = {Verification of Forecasts Expressed in Terms of Probability},
    journal = {Monthly Weather Review},
    volume = {78},
    number = {1},
    pages = {1-3},
    year = {1950},
    doi = {10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2}
 }
@article {wihs,
    author = {Bacon, Melanie C. and von Wyl, Viktor and Alden, Christine and Sharp, Gerald and Robison, Esther and Hessol, Nancy and Gange, Stephen and Barranday, Yvonne and Holman, Susan and Weber, Kathleen and Young, Mary A.},
    title = {The Women{\textquoteright}s Interagency HIV Study: an Observational Cohort Brings Clinical Sciences to the Bench},
    volume = {12},
    number = {9},
    pages = {1013--1019},
    year = {2005},
    doi = {10.1128/CDLI.12.9.1013-1019.2005},
    publisher = {American Society for Microbiology Journals},
    journal = {Clinical and Vaccine Immunology}
 }
@article{FineAndGrayProportional,
    journal = {Journal of the American Statistical Association},
    pages = {496--509},
    volume = {94},
    publisher = {Taylor & Francis Group},
    number = {446},
    year = {1999},
    title = {A Proportional Hazards Model for the Subdistribution of a Competing Risk},
    author = {Fine, Jason P. and Gray, Robert J.}
 }
 % Bibtex has a lot of difficulty with the UTF-8 character that \O produces; however if I let Bibtex try and format the author section it strips out the '\' from O, so I had to manually format this
@book{survival_event_history_book,
    title={Survival and Event History Analysis: A Process Point of View}, 
    publisher={Springer-Verlag}, 
    author={{Aalen OO, Borgan \O, Gjessing HK}}, 
    year={2008},
    doi={10.1007/978-0-387-68560-1},
    isbn={978-0-387-20287-7}
 } 
@book{methods_for_lifetime_data_book,
    title={Statistical Models and Methods for Lifetime Data}, 
    publisher={John Wiley \& Sons}, 
    author={Jerald F. Lawless},
    year={2002},
    doi={10.1002/9781118033005},
    isbn={978-0-471-37215-8}
 } 
--- a/vignettes/simple-example.Rmd
+++ b/vignettes/simple-example.Rmd
@ -0,0 +1,106 @@
 ---
 title: "Simple example of using largeRCRF"
 author: "Joel Therrien & Jiguo Cao"
 output: rmarkdown::html_vignette
 bibliography: refs.bib
 vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
 ---
 ```{r setup, include = FALSE}
 knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
 )
 ```
 This is a quick example of running **largeRCRF** on a dataset, extracting some predictions from it, and calculating a measure of concordance error.
 ## Source
 The dataset originally comes from the *Women's Interagency HIV Study* [@wihs], but was obtained through the **randomForestSRC** [@IshwaranRfsrc] package.
 ## Background
 The *Women's Interagency HIV Study* is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:
 * The woman began treatment for HIV.
 * The woman developed AIDS or died.
 * The woman was censored for administrative reasons.
 There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).
 ## Getting the data
 ```{r}
 data(wihs, package = "largeRCRF")
 names(wihs)
 ```
 `time` & `status` are two columns in `wihs` corresponding to the competing risks response, while `ageatfda`, `idu`, `black`, and `cd4nadir` are the different predictors we wish to train on. 
 We train a forest by calling `train`.
 ```{r}
 library("largeRCRF")
 model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
               data = wihs, splitFinder = LogRankSplitFinder(1:2, 2), 
               ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
               randomSeed = 15)
 ```
 We specify `splitFinder = LogRankSplitFinder(1:2, 2)`, which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).
 We specify that we want a forest of 100 trees (`ntree = 100`), that we want to try all possible splits when trying to split on a variable (`numberOfSplits = 0`), that we want to try splitting on two predictors at a time (`mtry = 2`), and that the terminal nodes should have an average size of at minimum 15 (`nodeSize = 15`; accomplished by not splitting any nodes with size less than 2 $\times$ `nodeSize`). `randomSeed = 15` specifies a seed so that the results are deterministic; note that **largeRCRF** generates random numbers separately from R and so is not affected by `set.seed()`.
 Printing `model` on its own doesn't really do much except print the different components and parameters that made the forest.
 ```{r}
 model
 ```
 Next we'll make predictions on the training data. Since we're using the training data, **largeRCRF** will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).
 ```{r}
 predictions <- predict(model)
 ```
 Since our data is competing risks data, our responses are several functions which can't really be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).
 ```{r}
 predictions[[1]]
 ```
 Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.
 ```{r}
 aids.cifs = extractCIF(predictions, event = 2)
 aids.chfs = extractCHF(predictions, event = 2)
 survivor.curves = extractSurvivorCurve(predictions)
 ```
 Now we plot some of the functions that we extracted.
 ```{r}
 curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
       type="S", ylab="CIF(t)", xlab="Time (t)")
 curve(aids.chfs[[3]](x), from=0, to=8, 
       type="S", ylab="CHF(t)", xlab="Time (t)")
 ```
 Finally, we calculate the naive concordance error on the out-of-bag predictions. `extractMortalities` calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to `time`, although users are free to substitute their own measures if desired. `naiveConcordance` then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.
 ```{r}
 mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
 mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
 naiveConcordance(CR_Response(wihs$status, wihs$time), 
               list(mortalities1, mortalities2))
 ```
 We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.
 ## References