Tree size determination for classification ensemble

Choi, Sung Hoon;Kim, Hyunjoong;

doi:10.7465/jkdi.2016.27.1.255

Journal of the Korean Data and Information Science Society

Volume 27 Issue 1
/
Pages.255-264
/
2016
/
1598-9402(pISSN)

The Korean Data and Information Science Society (한국데이터정보과학회)

DOI QR Code

Tree size determination for classification ensemble

Choi, Sung Hoon (Department of Applied Statistics, Yonsei University) ;
Kim, Hyunjoong (Department of Applied Statistics, Yonsei University)

Received : 2015.11.19
Accepted : 2016.01.13
Published : 2016.01.31

https://doi.org/10.7465/jkdi.2016.27.1.255 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Classification is a predictive modeling for a categorical target variable. Various classification ensemble methods, which predict with better accuracy by combining multiple classifiers, became a powerful machine learning and data mining paradigm. Well-known methodologies of classification ensemble are boosting, bagging and random forest. In this article, we assume that decision trees are used as classifiers in the ensemble. Further, we hypothesized that tree size affects classification accuracy. To study how the tree size in uences accuracy, we performed experiments using twenty-eight data sets. Then we compare the performances of ensemble algorithms; bagging, double-bagging, boosting and random forest, with different tree sizes in the experiment.

Keywords

References

Asuncion, A. and Newman, D. J. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105-139. https://doi.org/10.1023/A:1007515423169
Breiman, L. (1996a). Bagging predictors. Machine Learning, 26, 123-140.
Breiman, L. (1996b). Out-of-bag estimation, Technical Report, Statistics Department, University of California Berkeley, Berkeley, California 94708, https://www.stat.berkeley.edu/-breiman/OOBestimation.pdf.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and regression trees, Chapman and Hall, New York.
Dietterich, T. (2000). Ensemble methods in machine learning, Springer, Berlin.
Freund, Y. and Schapire, R. (1996). Game theory, on-line prediction and boosting. Proceedings of the Ninth Annual Conference on Computational Learning Theory, 325-332.
Hansen, L. K., Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and machine Intelligence, 12, 993-1001. https://doi.org/10.1109/34.58871
Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36, 1303-1309. https://doi.org/10.1016/S0031-3203(02)00169-3
Kim, A., Kim, J. and Kim, H. (2012). The guideline for choosing the right-size of tree for boosting algorithm. Journal of the Korean Data and Information Science Society, 23, 949-959. https://doi.org/10.7465/jkdi.2012.23.5.949
Kim, H. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589-604. https://doi.org/10.1198/016214501753168271
Kim, H. and Loh, W. Y. (2003). Classification trees with bivariate linear discriminant node models. Journal of Computational and Graphical Statistics, 12, 512-530. https://doi.org/10.1198/1061860032049
Kwak, S. and Kim, H. (2014). Comparison of ensemble pruning methods using Lasso-bagging and WAVE-bagging. Journal of the Korean Data and Information Science Society, 25, 1371-1383. https://doi.org/10.7465/jkdi.2014.25.6.1371
Liew, A. and Wiener, M. (2002). Classification and regression by random forests. R News, 2, 18-22.
Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3, 1710-1737. https://doi.org/10.1214/09-AOAS260
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297-336. https://doi.org/10.1023/A:1007614523901
Shim, J. and Hwang, C. H. (2014). Support vector quantile regression ensemble with bagging. Journal of the Korean Data and Information Science Society, 25, 677-684. https://doi.org/10.7465/jkdi.2014.25.3.677
Statlib. (2010). Datasets archive. Carnegie Mellon University, Department of Statistics, http://lib.stat.cmu.edu.
Terhune, J. M. (1994). Geographical variation of harp seal underwater vocalizations. Canadian Journal of Zoology, 72, 892-897. https://doi.org/10.1139/z94-121
Therneau, T. and Atkinson, E. (1997). An introduction to recursive partitioning using the RPART routines, Mayo Foundation, Rochester, New York. http://eric.univ-lyon2.fr/-ricco/cours/didacticiels/r/longdocrpart.pdf.
Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). Multi-class AdaBoost. Statistics and its Interface, 2, 349-360. https://doi.org/10.4310/SII.2009.v2.n3.a8

Cited by

A simple diagnostic statistic for determining the size of random forest vol.27, pp.4, 2016, https://doi.org/10.7465/jkdi.2016.27.4.855
랜덤포레스트의 크기 결정에 유용한 승리표차에 기반한 불일치 측도 vol.28, pp.3, 2016, https://doi.org/10.7465/jkdi.2017.28.3.515
랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구 vol.36, pp.2, 2019, https://doi.org/10.3743/kosim.2019.36.2.057

Journal of the Korean Data and Information Science Society

Tree size determination for classification ensemble

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)