DOI QR코드

DOI QR Code

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data

데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구

  • Leea, Hee-Jae (Department of Applied Statistics, Dankook University) ;
  • Lee, Sungim (Department of Applied Statistics, Dankook University)
  • 이희재 (단국대학교 응용통계학과) ;
  • 이성임 (단국대학교 응용통계학과)
  • Received : 2013.08.26
  • Accepted : 2013.12.16
  • Published : 2014.06.30

Abstract

There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

최근 들어 데이터 마이닝의 분류문제에 있어 목표변수의 불균형 문제가 많은 관심을 받고 있다. 이러한 문제를 해결하기 위해, 이전 연구들은 원 자료에 대하여 데이터 전처리 과정을 실시했는데, 전처리 과정에는 목표변수의 다수계급을 소수계급의 비율에 맞게 조정하는 과소표집법, 소수계급을 복원추출하여 다수계급의 비율에 맞게 조정하는 과대표집법, 소수계급에 K-최근접 이웃 방법 등을 활용하여 과대표집법을 적용 후 다수계급에는 과소표집법을 적용한 하이브리드 기법 등이 있다. 또한 앙상블 기법도 이러한 불균형 데이터의 분류 성능을 높일 수 있다고 알려져 있어, 본 논문에서는 데이터의 전처리 과정과 앙상블 기법을 함께 고려한 여러 모형들을 사용하여, 불균형 자료에 대한 이들모형의 분류성능을 비교평가한다.

Keywords

References

  1. Batista, G. E. A. P. A., Prati, R. C. and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data, Special Interest Groups Knowledge Discovery in Data Explorations Newsletter, 6, 20-29.
  2. Breiman, L. (1984). Algorithm CART, California Wadsworth International Group, Belmont, CA., 6, 20-29.
  3. Breiman, L. (1996). Bagging predictors, Machine Learning., 24, 123-140.
  4. Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  5. Chawla, N. V., Bowyer, W. K., Hall, L. O. and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal Of Artificial Intelligence Research , 16, 321-357.
  6. Chawla, N. V., Cieslak, D., Hall, L. and Joshi, A. (2008). Automatically countering imbalance and its empirical relationship to cost, Data Mining and Knowledge Discovery, 17, 225-252. https://doi.org/10.1007/s10618-008-0087-0
  7. Chawla, N. V., Japkowicz, N. and Kolcz, A. (2004). Special issue learning imbalanced datasets, Special Interest Groups Knowledge Discovery in Data Explorations Newsletter, 6, 1-6.
  8. Chawla, N. V., Lazarevic, A., Hall, L. O. and Bowyer, W. K. (2003). Smoteboost: Improving prediction of the minority class in boosting, Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119.
  9. Culp, M., Johnson, K. and Michailidis, G. (2006). Ada : An r package for stochastic boosting, Journal of Statistical Software, 16, 321-357.
  10. Freitas, A., Costa- Pereira, A. and Brazdil, P. (2007). Cost-sensitive decision trees applied to medical data, Data Warehousing and Knowledge Discovery, 4654, 302-312.
  11. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
  12. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H. and Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, Institute of Electrical and Electronics Engineers, 42, 463-484.
  13. Hido, S., Kashima, H. and Takahashi, Y. (2009). Roughly balanced bagging for imbalanced data, Statistical Analysis and Data Mining, 2, 412-426. https://doi.org/10.1002/sam.10061
  14. Huang, J. and Ling, C. (2005). Using AUC and accuracy in evaluating learning algorithms, Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 17, 299-310.
  15. Hur, J. and Kim, J. (2007). Decision tree induction with imbalanced data set: A case of health insurance bill audit in a general hospital, Information systems review, 9, 45-65.
  16. Kang, P. and Cho, S. (2006). EUS SVMS: Ensemble of under-sampled SVMS for data imbalance problems, Lecture Notes in Computer Science, 4232, 837-846.
  17. Khreich, W., Granger, E., Miri, A. and Sabourin, R. (2010). Iterative boolean combination of classifiers in the roc space: An application to anomaly detection with HMMs, Pattern Recognition, 43, 2732-2752. https://doi.org/10.1016/j.patcog.2010.03.006
  18. Kim, J. and Jeong, J. (2004). Classification of class-imbalanced data: Effect of over-sampling and undersampling of training data, The Korean Journal of Applied Statistics, 17, 445-457. https://doi.org/10.5351/KJAS.2004.17.3.445
  19. Kim, J. and Park, H. (2012). Imbalanced data analysis using sampling methods, Inha University.
  20. Kubat, M., Holte, R. C. and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images, Machine Learning, 30, 195-215. https://doi.org/10.1023/A:1007452223027
  21. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced data sets: One-sided sampling, Proceedings of the Fourteenth International Conference on Machine Learning, 179-186.
  22. Ling, C. X. and Li, C. (1998). Data mining for direct marketing: Problems and solutions., Knowledge Discovery in Data-98.
  23. Liu, X., Wu, J. and Zhou, Z. (2009). Machine learning for the detection of oil spills in satellite radar images, Institute of Electrical and Electronics Engineers, 39, 539-550.
  24. Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A. and Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance, Neural Networks, 21, 427-436. https://doi.org/10.1016/j.neunet.2007.12.031
  25. Nathalie, J. and Shaju, S. (2002). The class imbalance problem: A systematic study, Intelligent Data Analysis, 6, 429-449.
  26. Newman, D., Hettich, S., Blake, C. and Merz, C. (1998). UCI repository of machine learning databases, Department of Information and Computer Science.
  27. Oh, J. and Zhang, B. (2001). Kernel perceptron boosting for effective learning of imbalanced data, Proceedings of the Korean Information Science Society Conference, 304-306.
  28. Sayyad, S. J. and Menzies, T. J. (2005). The promise repository of software engineering databases, http://promise.site.uottawa.ca/SERepository.
  29. Seiffert, C., Khoshgoftaar, T., Van Hulse, J. and Napolitano, A. (2010). Rusboost: A hybrid approach to alleviating class imbalance, Institute of Electrical and Electronics Engineers, 40, 185-197.
  30. Wu, G. and Chang, E. (2005). KBA: Kernel boundary alignment considering imbalanced data distribution, Institute of Electrical and Electronics Engineers, 17, 786-795.
  31. Yang, Q. and Wu, X. (2006). 10 challenging problems in data mining research, International Journal of Information Technology and Decision Making, 5, 597-604. https://doi.org/10.1142/S0219622006002258
  32. Zadrozny, B. and Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown, Knowledge Discovery in Data '01 Proceedings of the seventh Association for Computing Machinery Special Interest Groups Knowledge Discovery in Data international conference on Knowledge discovery and data mining, 204-213.
  33. Zhu, Z. and Song, Z. (2010). Fault diagnosis based on imbalance modified kernel fisher discriminant analysis, Chemical Engineering Research and Design, 88, 936-951. https://doi.org/10.1016/j.cherd.2010.01.005

Cited by

  1. Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data vol.28, pp.1, 2015, https://doi.org/10.5351/KJAS.2015.28.1.009