Simple hypotheses testing for the number of trees in a random forest

Park, Cheol-Yong;

Journal of the Korean Data and Information Science Society

Volume 21 Issue 2
/
Pages.371-377
/
2010
/
1598-9402(pISSN)

The Korean Data and Information Science Society (한국데이터정보과학회)

Simple hypotheses testing for the number of trees in a random forest

Park, Cheol-Yong (Department of Statistics, Keimyung University)

Received : 2010.02.12
Accepted : 2010.03.17
Published : 2010.03.31

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this study, we propose two informal hypothesis tests which may be useful in determining the number of trees in a random forest for use in classification. The first test declares that a case is 'easy' if the hypothesis of the equality of probabilities of two most popular classes is rejected. The second test declares that a case is 'hard' if the hypothesis that the relative difference or the margin of victory between the probabilities of two most popular classes is greater than or equal to some small number, say 0.05, is rejected. We propose to continue generating trees until all (or all but a small fraction) of the training cases are declared easy or hard. The advantage of combining the second test along with the first test is that the number of trees required to stop becomes much smaller than the first test only, where all (or all but a small fraction) of the training cases should be declared easy.

Keywords

References

Alam, K. (1971). On selecting the most probable category. Technometrics, 13, 843-850. https://doi.org/10.2307/1266959
Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008). Enriched random forests. Bioinformatics, 24, 2010-2014. https://doi.org/10.1093/bioinformatics/btn356
Bhandari, S. K. and Ali, M. M. (1994). An asymptotically minimax procedure for selecting the t -best multinomial cells. Journal of Statistical Planning & Inference, 38, 65-74. https://doi.org/10.1016/0378-3758(94)90136-8
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classfi- cation of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87. https://doi.org/10.1198/016214502753479248
Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation & Simulation, 75, 629-643. https://doi.org/10.1080/00949650410001729472
Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive evaluation of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48, 869-885. https://doi.org/10.1016/j.csda.2004.03.017
Park, C. (2007). A stopping rule for the number of generating trees in a random forest. Journal of the Institute of Natural Sciences, 27, 7-10.
Ramey, J. T. and Alam, K. (1979). A sequential procedure for selecting the most probable multinomial event. Biometrika, 55, 171-173.
Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686. https://doi.org/10.1214/aos/1024691352

Journal of the Korean Data and Information Science Society

Simple hypotheses testing for the number of trees in a random forest

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)