Boosting Algorithms for Large-Scale Data and Data Batch Stream

Yoon, Young-Joo;

doi:10.5351/KJAS.2010.23.1.197

The Korean Journal of Applied Statistics (응용통계연구)

Volume 23 Issue 1
/
Pages.197-206
/
2010
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Boosting Algorithms for Large-Scale Data and Data Batch Stream

대용량 자료와 순차적 자료를 위한 부스팅 알고리즘

Yoon, Young-Joo (Department of Statistics, University of Georgia)

윤영주

Received : 20091200
Accepted : 20100100
Published : 2010.02.28

https://doi.org/10.5351/KJAS.2010.23.1.197 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose boosting algorithms when data are very large or coming in batches sequentially over time. In this situation, ordinary boosting algorithm may be inappropriate because it requires the availability of all of the training set at once. To apply to large scale data or data batch stream, we modify the AdaBoost and Arc-x4. These algorithms have good results for both large scale data and data batch stream with or without concept drift on simulated data and real data sets.

본 논문에서는 대용량 자료 혹은 시간에 따라 순차적으로 들어오는 자료의 분류를 위한 부스팅(boosting) 알고리즘을 제안한다. 대용량 자료나 순차적 자료의 경우 분석시 모든 훈련 자료(training data)들을 한번에 이용하기 어려우므로 보통의 부스팅 알고리즘은 적절하지 못하다. 이러한 상황을 극복하기 위해 AdaBoost와 Arc-x4와 같은 부스팅 알고리즘을 수정하여 제안한다. 모의 실험과 실제 자료 분석을 통해 대용량 자료나 순차적 자료에 제안된 알고리즘이 잘 적용됨을 보였다.

Keywords

References

Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.
Breiman, L. (1998). Arcing classifiers (with discussion), Annals of Statistics, 26, 801-849. https://doi.org/10.1214/aos/1024691079
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984), Classification and Regression Trees, Chapman & Hall, New York.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of online learning and application to boosting, Journal of Computer and System Science, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning, Springer-Verlag, New York.
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid, Proceedings of the second International Conference on Knowledge Discovery and Data Mining, 202-207.
Kuncheva, L. I. (2004). Classification ensemble for changing environments, Proceedings of 5th International Workshop on Multiple Classifier Systems, 1-15.
Quinlan, J. R. (1993). C4.5: Prigrams for Machine Learning, Morgan Kaufmann, San Maeto, CA.
Rudin, C., Daubechies, I. and Schapire, R. E. (2004). The dynamics of AdaBoost: cyclic behavior and convergence of margins, Journal of Machine Learning Research, 5, 1557-1595.
Street, W. N. and Kim, Y. S. (2001). A streaming ensemble algorithm (SEA) for large scale classification, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 377-382.
Wang, H., Fan, W., Yu, P. S. and Han, J. (2003). Mining concept drifting data streams using ensemble classifiers, Proceedings of then 9th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, 226-235.
Yeon, K., Choi, H., Yoon, Y. J. and Song, M. S. (2005). Model based ensemble learning for tracking concept drift, Proceedings of 55th Session of the International Statistical Institute.

Cited by

Classification of large-scale data and data batch stream with forward stagewise algorithm vol.25, pp.6, 2014, https://doi.org/10.7465/jkdi.2014.25.6.1283

The Korean Journal of Applied Statistics (응용통계연구)

Boosting Algorithms for Large-Scale Data and Data Batch Stream

대용량 자료와 순차적 자료를 위한 부스팅 알고리즘

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)