Made Forest into an abstract class with OnlineForest (in memory; same as previous)
and OfflineForest (reads individual trees only as needed). Many methods were changed.
Fixed some tests that weren't running.
Fixed a bug where training crashed if FactorCovariates had any NA
Fixed a bug where FactorCovariates were ignored in splitting if nsplit==0
Added a covariate specific option for whether splitting on an NA variable should have a penalty.
This penalty is accomplished by first calculating the split score and best split for a covariate
without NAs as done previously before. Then NAs are randomly assigned, and the split score is
recalculated on that best split. The new score is the lower of the new score and the original.
Specifically, the integration returned an NaN if the integration was
*up to* an NaN (real inegrals are robust); and the results were negative
if integrating from a to b where a > b.
This project is now purely a library only; the code for running directly from the command line will be
put into a new project. This was important because we were including large dependencies into the R code
that weren't needed and created some minor licensing inconveniences.
Apparently older versions of Jackson contain a security vulnerability
(not really important for this project, given that users are only ever
using Jackson on their own settings files)
Benefits are for when we restart a previously parallel task
in which, say, trees 1, 2, and 4 were completed but tree 3
never did complete. Under the previous implementation we'd start
at tree 4 (we'd just count how many trees were done). To fix this
would require some additional effort. Since the order of trees
is irrelevant, it made sense to just stop ordering them.
This was done so that when we serialize trees (and thus SplitRules) we don't awkwardly also serialize ntree versions of the Covariates,
which is really awkward when deserializing them.