Results are more variable than randomForestSRC #10
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently, some simulations have been run and it's clear that my package's results of the CIF are more variable (with no reduction in bias) than the corresponding results from randomForestSRC (because I generated the data, I know the true CIF). Before this package can be published, this must be fixed.
Tested CompetingRiskResponseCombiner by making in randomForestSRC a forest of 1 tree with a node size such that we'd only have one terminal node (and bootstrapping turned off). One similar tree was made in my package; the curves matched each other.
Therefore CompetingRiskResponseCombiner is cleared of error. Note that I only compared the CIF and CHF functions; I didn't look at the Kaplan-Meier estimate.
Remaining entries to check:
I took the previous procedure, but I didn't restrict node size. I let mtry=p and nsplit=0, growing only one tree without bootstrapping. This time there was a large difference in results, suggesting that there is an error somewhere in the Log-rank differentiator or the general splitting procedure. I could consider repeating the problem on some regression data to see if the error is present in just the general splitting procedure, but there's a chance that randomForestSRC's implementation may vary between methods.
Repeated above procedure using only a single log rank group differentiator. If I restricted myself to the 2nd event there was no difference in the CIFs over 500 repeats! However, looking at the 1st event there are quick differences that show up. One thing I discovered is that if censored values are moved between the left and right group, depending on their values the log rank value may not necessarily change. I'm not sure if
randomForestSRC
prefers smaller or larger covariate values in this case; my package currently is largely random as I previously threw all my splits into a HashSet.However; this slight and minor differences don't explain why my package has worse results. Picking one of the equally best splits arbitrarily shouldn't result in worse results; plus, given that my overall simulation uses small nsplit, I'm not sure how often this difference occurred.
I took the optimal composite splits on the
test_split_data.csv
file produced byrandomForestSRC
and my package, and I ran those splits throughsurvival::survdiff
. I then manually produced the composite scores for both packages on their selected splits.randomForestSRC
produced a composite split score of 71.41135, while my package produced a larger score of 71.5354. In this case it looks like the difference in splitting is due to some bug inrandomForestSRC
since it didn't select the optimal split.Unfortunately, randomForestSRC not selecting the optimal split doesn't explain why it does better, which I still need to determine.
Code for producing scores:
Remaining entries to check:
In the optimizations branch I've changed how the general splitting procedures work to be a bit faster. I'm going to run some simulations where deterministic splitting is used everywhere, and see if I get different or worse results.
I repeated the simulation where I produced a stump of a tree and compared the estimated CIFs, but this time I produced a whole forest of stumps where splitting never occurs. Because of bootstrapping the results wouldn't be identical, but they could be close, and they were. Therefore I'm ruling out the CompetingRiskFunctionCombiner as a source of the error; the bug must be somewhere in my general splitting procedure.
FYI; in this task my package was much, much faster than randomForestSRC (n=10000, ntree=1000); indicating that perhaps the large data inefficiencies in randomForestSRC are in the creation of the functions, not in the splitting.
Remaining entries to check:
Ran some simulations on the
optimizations
branch (which included some changes as to how random splitting worked for numeric covariates) and my estimates of the relative (torandomForestSRC
) error of the CIFs are now on average less than 1, which is a fantastic result. I'd still like to run a more formal simulation study to confirm these results, but it looks like this issue is now closed.Found the reason why
randomForestSRC
produced different results and is doing better. In the composite split rule formula they provide in their documentation, they use \sigma_j^2 in the numerator; whenrandomForestSRC
in fact uses only \sigma_j.randomForestSRC
produces optimal results using this other rule. As well, this other rule has the nice property that when J=1 the multiple log rank rule reduces down to the single log rank rule, which isn't currently true.Going to try and modify code to use the actual rule.
Fixed in the
experimental
branchexperimental
has been merged intomaster
for a while.