DOI QR코드

DOI QR Code

Tree size determination for classification ensemble

  • Received : 2015.11.19
  • Accepted : 2016.01.13
  • Published : 2016.01.31

Abstract

Classification is a predictive modeling for a categorical target variable. Various classification ensemble methods, which predict with better accuracy by combining multiple classifiers, became a powerful machine learning and data mining paradigm. Well-known methodologies of classification ensemble are boosting, bagging and random forest. In this article, we assume that decision trees are used as classifiers in the ensemble. Further, we hypothesized that tree size affects classification accuracy. To study how the tree size in uences accuracy, we performed experiments using twenty-eight data sets. Then we compare the performances of ensemble algorithms; bagging, double-bagging, boosting and random forest, with different tree sizes in the experiment.

Keywords

References

  1. Asuncion, A. and Newman, D. J. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
  2. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105-139. https://doi.org/10.1023/A:1007515423169
  3. Breiman, L. (1996a). Bagging predictors. Machine Learning, 26, 123-140.
  4. Breiman, L. (1996b). Out-of-bag estimation, Technical Report, Statistics Department, University of California Berkeley, Berkeley, California 94708, https://www.stat.berkeley.edu/-breiman/OOBestimation.pdf.
  5. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  6. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and regression trees, Chapman and Hall, New York.
  7. Dietterich, T. (2000). Ensemble methods in machine learning, Springer, Berlin.
  8. Freund, Y. and Schapire, R. (1996). Game theory, on-line prediction and boosting. Proceedings of the Ninth Annual Conference on Computational Learning Theory, 325-332.
  9. Hansen, L. K., Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and machine Intelligence, 12, 993-1001. https://doi.org/10.1109/34.58871
  10. Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
  11. Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36, 1303-1309. https://doi.org/10.1016/S0031-3203(02)00169-3
  12. Kim, A., Kim, J. and Kim, H. (2012). The guideline for choosing the right-size of tree for boosting algorithm. Journal of the Korean Data and Information Science Society, 23, 949-959. https://doi.org/10.7465/jkdi.2012.23.5.949
  13. Kim, H. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589-604. https://doi.org/10.1198/016214501753168271
  14. Kim, H. and Loh, W. Y. (2003). Classification trees with bivariate linear discriminant node models. Journal of Computational and Graphical Statistics, 12, 512-530. https://doi.org/10.1198/1061860032049
  15. Kwak, S. and Kim, H. (2014). Comparison of ensemble pruning methods using Lasso-bagging and WAVE-bagging. Journal of the Korean Data and Information Science Society, 25, 1371-1383. https://doi.org/10.7465/jkdi.2014.25.6.1371
  16. Liew, A. and Wiener, M. (2002). Classification and regression by random forests. R News, 2, 18-22.
  17. Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3, 1710-1737. https://doi.org/10.1214/09-AOAS260
  18. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
  19. Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297-336. https://doi.org/10.1023/A:1007614523901
  20. Shim, J. and Hwang, C. H. (2014). Support vector quantile regression ensemble with bagging. Journal of the Korean Data and Information Science Society, 25, 677-684. https://doi.org/10.7465/jkdi.2014.25.3.677
  21. Statlib. (2010). Datasets archive. Carnegie Mellon University, Department of Statistics, http://lib.stat.cmu.edu.
  22. Terhune, J. M. (1994). Geographical variation of harp seal underwater vocalizations. Canadian Journal of Zoology, 72, 892-897. https://doi.org/10.1139/z94-121
  23. Therneau, T. and Atkinson, E. (1997). An introduction to recursive partitioning using the RPART routines, Mayo Foundation, Rochester, New York. http://eric.univ-lyon2.fr/-ricco/cours/didacticiels/r/longdocrpart.pdf.
  24. Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). Multi-class AdaBoost. Statistics and its Interface, 2, 349-360. https://doi.org/10.4310/SII.2009.v2.n3.a8

Cited by

  1. A simple diagnostic statistic for determining the size of random forest vol.27, pp.4, 2016, https://doi.org/10.7465/jkdi.2016.27.4.855
  2. 랜덤포레스트의 크기 결정에 유용한 승리표차에 기반한 불일치 측도 vol.28, pp.3, 2016, https://doi.org/10.7465/jkdi.2017.28.3.515
  3. 랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구 vol.36, pp.2, 2019, https://doi.org/10.3743/kosim.2019.36.2.057