Results are more variable than randomForestSRC #10

Closed
opened 2019-01-02 18:14:40 +00:00 by joel · 11 comments
Owner

Currently, some simulations have been run and it's clear that my package's results of the CIF are more variable (with no reduction in bias) than the corresponding results from randomForestSRC (because I generated the data, I know the true CIF). Before this package can be published, this must be fixed.

Currently, some simulations have been run and it's clear that my package's results of the CIF are more variable (with no reduction in bias) than the corresponding results from randomForestSRC (because I generated the data, I know the true CIF). Before this package can be published, this **must** be fixed.
joel self-assigned this 2019-01-02 18:14:49 +00:00
Author
Owner

Tested CompetingRiskResponseCombiner by making in randomForestSRC a forest of 1 tree with a node size such that we'd only have one terminal node (and bootstrapping turned off). One similar tree was made in my package; the curves matched each other.

Therefore CompetingRiskResponseCombiner is cleared of error. Note that I only compared the CIF and CHF functions; I didn't look at the Kaplan-Meier estimate.

Remaining entries to check:

  • CompetingRiskResponseCombiner
  • General splitting procedure
  • Log-rank differentiator
  • CompetingRiskFunctionCombiner
Tested CompetingRiskResponseCombiner by making in randomForestSRC a forest of 1 tree with a node size such that we'd only have one terminal node (and bootstrapping turned off). One similar tree was made in my package; the curves matched each other. Therefore CompetingRiskResponseCombiner is cleared of error. Note that I only compared the CIF and CHF functions; I didn't look at the Kaplan-Meier estimate. Remaining entries to check: * [x] CompetingRiskResponseCombiner * [ ] General splitting procedure * [ ] Log-rank differentiator * [ ] CompetingRiskFunctionCombiner
Author
Owner

I took the previous procedure, but I didn't restrict node size. I let mtry=p and nsplit=0, growing only one tree without bootstrapping. This time there was a large difference in results, suggesting that there is an error somewhere in the Log-rank differentiator or the general splitting procedure. I could consider repeating the problem on some regression data to see if the error is present in just the general splitting procedure, but there's a chance that randomForestSRC's implementation may vary between methods.

I took the previous procedure, but I didn't restrict node size. I let mtry=p and nsplit=0, growing only one tree without bootstrapping. This time there was a large difference in results, suggesting that there is an error somewhere in the Log-rank differentiator or the general splitting procedure. I could consider repeating the problem on some regression data to see if the error is present in just the general splitting procedure, but there's a chance that randomForestSRC's implementation may vary between methods.
Author
Owner

Repeated above procedure using only a single log rank group differentiator. If I restricted myself to the 2nd event there was no difference in the CIFs over 500 repeats! However, looking at the 1st event there are quick differences that show up. One thing I discovered is that if censored values are moved between the left and right group, depending on their values the log rank value may not necessarily change. I'm not sure if randomForestSRC prefers smaller or larger covariate values in this case; my package currently is largely random as I previously threw all my splits into a HashSet.

However; this slight and minor differences don't explain why my package has worse results. Picking one of the equally best splits arbitrarily shouldn't result in worse results; plus, given that my overall simulation uses small nsplit, I'm not sure how often this difference occurred.

Repeated above procedure using only a single log rank group differentiator. If I restricted myself to the 2nd event there was no difference in the CIFs over 500 repeats! However, looking at the 1st event there are quick differences that show up. One thing I discovered is that if censored values are moved between the left and right group, depending on their values the log rank value may not necessarily change. I'm not sure if `randomForestSRC` prefers smaller or larger covariate values in this case; my package currently is largely random as I previously threw all my splits into a HashSet. However; this slight and minor differences don't explain why my package has worse results. Picking one of the equally best splits arbitrarily shouldn't result in worse results; plus, given that my overall simulation uses small nsplit, I'm not sure how often this difference occurred.
Author
Owner

I took the optimal composite splits on the test_split_data.csv file produced by randomForestSRC and my package, and I ran those splits through survival::survdiff. I then manually produced the composite scores for both packages on their selected splits.

randomForestSRC produced a composite split score of 71.41135, while my package produced a larger score of 71.5354. In this case it looks like the difference in splitting is due to some bug in randomForestSRC since it didn't select the optimal split.

Unfortunately, randomForestSRC not selecting the optimal split doesn't explain why it does better, which I still need to determine.

Code for producing scores:

library(survival)

data <- read.csv("test_split_data.csv")

newData <- data.frame(u=data$u, delta=data$delta, rfsrcGroupA = 1:nrow(data) %in% rfsrcGroupA, javaGroupA = 1:nrow(data) %in% javaGroupA)
newData$isEvent1 <- newData$delta==1
newData$isEvent2 <- newData$delta==2


rfsrc.test.1 <- survdiff(Surv(u, isEvent1)~rfsrcGroupA, newData)
rfsrc.test.2 <- survdiff(Surv(u, isEvent2)~rfsrcGroupA, newData)

rfsrc.composite <- (sqrt(rfsrc.test.1$chisq)*rfsrc.test.1$var[1,1] + sqrt(rfsrc.test.2$chisq)*rfsrc.test.2$var[1,1]) / 
  sqrt(rfsrc.test.1$var[1,1] + rfsrc.test.2$var[1,1])


java.test.1 <- survdiff(Surv(u, isEvent1)~javaGroupA, newData)
java.test.2 <- survdiff(Surv(u, isEvent2)~javaGroupA, newData)

java.composite <- (sqrt(java.test.1$chisq)*java.test.1$var[1,1] + sqrt(java.test.2$chisq)*java.test.2$var[1,1]) / 
  sqrt(java.test.1$var[1,1] + java.test.2$var[1,1])

I took the optimal composite splits on the `test_split_data.csv` file produced by `randomForestSRC` and my package, and I ran those splits through `survival::survdiff`. I then manually produced the composite scores for both packages on their selected splits. `randomForestSRC` produced a composite split score of 71.41135, while my package produced a larger score of 71.5354. In this case it looks like the difference in splitting is due to some bug in `randomForestSRC` since it didn't select the optimal split. Unfortunately, randomForestSRC not selecting the optimal split doesn't explain why it does *better*, which I still need to determine. Code for producing scores: ``` library(survival) data <- read.csv("test_split_data.csv") newData <- data.frame(u=data$u, delta=data$delta, rfsrcGroupA = 1:nrow(data) %in% rfsrcGroupA, javaGroupA = 1:nrow(data) %in% javaGroupA) newData$isEvent1 <- newData$delta==1 newData$isEvent2 <- newData$delta==2 rfsrc.test.1 <- survdiff(Surv(u, isEvent1)~rfsrcGroupA, newData) rfsrc.test.2 <- survdiff(Surv(u, isEvent2)~rfsrcGroupA, newData) rfsrc.composite <- (sqrt(rfsrc.test.1$chisq)*rfsrc.test.1$var[1,1] + sqrt(rfsrc.test.2$chisq)*rfsrc.test.2$var[1,1]) / sqrt(rfsrc.test.1$var[1,1] + rfsrc.test.2$var[1,1]) java.test.1 <- survdiff(Surv(u, isEvent1)~javaGroupA, newData) java.test.2 <- survdiff(Surv(u, isEvent2)~javaGroupA, newData) java.composite <- (sqrt(java.test.1$chisq)*java.test.1$var[1,1] + sqrt(java.test.2$chisq)*java.test.2$var[1,1]) / sqrt(java.test.1$var[1,1] + java.test.2$var[1,1]) ```
Author
Owner

Remaining entries to check:

  • CompetingRiskResponseCombiner
  • General splitting procedure
  • Log-rank differentiator
  • CompetingRiskFunctionCombiner
Remaining entries to check: * [x] CompetingRiskResponseCombiner * [ ] General splitting procedure * [x] Log-rank differentiator * [ ] CompetingRiskFunctionCombiner
Author
Owner

In the optimizations branch I've changed how the general splitting procedures work to be a bit faster. I'm going to run some simulations where deterministic splitting is used everywhere, and see if I get different or worse results.

In the optimizations branch I've changed how the general splitting procedures work to be a bit faster. I'm going to run some simulations where deterministic splitting is used everywhere, and see if I get different or worse results.
Author
Owner

I repeated the simulation where I produced a stump of a tree and compared the estimated CIFs, but this time I produced a whole forest of stumps where splitting never occurs. Because of bootstrapping the results wouldn't be identical, but they could be close, and they were. Therefore I'm ruling out the CompetingRiskFunctionCombiner as a source of the error; the bug must be somewhere in my general splitting procedure.

FYI; in this task my package was much, much faster than randomForestSRC (n=10000, ntree=1000); indicating that perhaps the large data inefficiencies in randomForestSRC are in the creation of the functions, not in the splitting.

Remaining entries to check:

  • CompetingRiskResponseCombiner
  • General splitting procedure
  • Log-rank differentiator
  • CompetingRiskFunctionCombiner
I repeated the simulation where I produced a stump of a tree and compared the estimated CIFs, but this time I produced a whole forest of stumps where splitting never occurs. Because of bootstrapping the results wouldn't be identical, but they could be close, and they were. Therefore I'm ruling out the CompetingRiskFunctionCombiner as a source of the error; the bug must be somewhere in my general splitting procedure. FYI; in this task my package was much, much faster than randomForestSRC (n=10000, ntree=1000); indicating that perhaps the large data inefficiencies in randomForestSRC are in the creation of the functions, not in the splitting. Remaining entries to check: * [x] CompetingRiskResponseCombiner * [ ] General splitting procedure * [x] Log-rank differentiator * [x] CompetingRiskFunctionCombiner
Author
Owner

Ran some simulations on the optimizations branch (which included some changes as to how random splitting worked for numeric covariates) and my estimates of the relative (to randomForestSRC) error of the CIFs are now on average less than 1, which is a fantastic result. I'd still like to run a more formal simulation study to confirm these results, but it looks like this issue is now closed.

Ran some simulations on the `optimizations` branch (which included some changes as to how random splitting worked for numeric covariates) and my estimates of the relative (to `randomForestSRC`) error of the CIFs are now on average less than 1, which is a fantastic result. I'd still like to run a more formal simulation study to confirm these results, but it looks like this issue is now closed.
joel closed this issue 2019-01-14 19:06:50 +00:00
Author
Owner

Found the reason why randomForestSRC produced different results and is doing better. In the composite split rule formula they provide in their documentation, they use \sigma_j^2 in the numerator; when randomForestSRC in fact uses only \sigma_j. randomForestSRC produces optimal results using this other rule. As well, this other rule has the nice property that when J=1 the multiple log rank rule reduces down to the single log rank rule, which isn't currently true.

Going to try and modify code to use the actual rule.

Found the reason why `randomForestSRC` produced different results and is doing better. In the composite split rule formula they provide in their documentation, they use \sigma_j^2 in the numerator; when `randomForestSRC` in fact uses only \sigma_j. `randomForestSRC` produces optimal results using this other rule. As well, this other rule has the nice property that when J=1 the multiple log rank rule reduces down to the single log rank rule, which isn't currently true. Going to try and modify code to use the actual rule.
joel reopened this issue 2019-04-24 00:24:09 +00:00
Author
Owner

Fixed in the experimental branch

Fixed in the `experimental` branch
Author
Owner

experimental has been merged into master for a while.

`experimental` has been merged into `master` for a while.
joel closed this issue 2019-05-29 22:09:07 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: joel/largeRCRF-Java#10
No description provided.