Simple hypotheses testing for the number of trees in a random forest

  • Received : 2010.02.12
  • Accepted : 2010.03.17
  • Published : 2010.03.31

Abstract

In this study, we propose two informal hypothesis tests which may be useful in determining the number of trees in a random forest for use in classification. The first test declares that a case is 'easy' if the hypothesis of the equality of probabilities of two most popular classes is rejected. The second test declares that a case is 'hard' if the hypothesis that the relative difference or the margin of victory between the probabilities of two most popular classes is greater than or equal to some small number, say 0.05, is rejected. We propose to continue generating trees until all (or all but a small fraction) of the training cases are declared easy or hard. The advantage of combining the second test along with the first test is that the number of trees required to stop becomes much smaller than the first test only, where all (or all but a small fraction) of the training cases should be declared easy.

Keywords

References

  1. Alam, K. (1971). On selecting the most probable category. Technometrics, 13, 843-850. https://doi.org/10.2307/1266959
  2. Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008). Enriched random forests. Bioinformatics, 24, 2010-2014. https://doi.org/10.1093/bioinformatics/btn356
  3. Bhandari, S. K. and Ali, M. M. (1994). An asymptotically minimax procedure for selecting the t -best multinomial cells. Journal of Statistical Planning & Inference, 38, 65-74. https://doi.org/10.1016/0378-3758(94)90136-8
  4. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
  5. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  6. Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classfi- cation of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87. https://doi.org/10.1198/016214502753479248
  7. Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation & Simulation, 75, 629-643. https://doi.org/10.1080/00949650410001729472
  8. Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive evaluation of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48, 869-885. https://doi.org/10.1016/j.csda.2004.03.017
  9. Park, C. (2007). A stopping rule for the number of generating trees in a random forest. Journal of the Institute of Natural Sciences, 27, 7-10.
  10. Ramey, J. T. and Alam, K. (1979). A sequential procedure for selecting the most probable multinomial event. Biometrika, 55, 171-173.
  11. Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686. https://doi.org/10.1214/aos/1024691352