Description
Is your feature request related to a problem? Please describe.
Honest sampling in forests was first outlined by Breiman et al. 1984 in Classification and Regression Forests, in which it was suggested that random forests improve performance when the feature space is partitioned on one subset of samples, and the posteriors are estimated on a disjoint subset of the samples. This idea was revived by Denil et al. as structure vs. estimation points, and clarified and implemented by Wager and Athey. They identified several benefits of honest sampling: reduced bias, centered confidence intervals, reduced mean squared error, and the possibility of building causal forests.
From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):
In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the subsample size s scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example i, it only uses the response Yi to estimate the within-leaf treatment effect τ using (5) or to decide where to place the splits, but not both.
In addition, Wager and Athey note that subsampling does not "waste" training data:
However, in our case, the forest subsampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the I/J-data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both I and J samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.
Describe the solution you’d like
EconML has forked scikit-learn to create honest trees and generalized random forests for causal questions. We intend, instead, to merge back into scikit-learn based on the insights from EconML in regression trees, while also building classification trees, with the ability to accept both dense and sparse data.
Key references:
Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Forests. 1984.
Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html
Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839
Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325