DOI QR코드

DOI QR Code

A data extension technique to handle incomplete data

불완전한 데이터를 처리하기 위한 데이터 확장기법

  • Lee, Jong Chan (Dept. of Computer Engineering, Chungwoon University)
  • 이종찬 (청운대학교 컴퓨터공학과)
  • Received : 2020.12.01
  • Accepted : 2021.02.20
  • Published : 2021.02.28

Abstract

This paper introduces an algorithm that compensates for missing values after converting them into a format that can represent the probability for incomplete data including missing values in training data. In the previous method using this data conversion, incomplete data was processed by allocating missing values with an equal probability that missing variables can have. This method applied to many problems and obtained good results, but it was pointed out that there is a loss of information in that all information remaining in the missing variable is ignored and a new value is assigned. On the other hand, in the new proposed method, only complete information not including missing values is input into the well-known classification algorithm (C4.5), and the decision tree is constructed during learning. Then, the probability of the missing value is obtained from this decision tree and assigned as an estimated value of the missing variable. That is, some lost information is recovered using a lot of information that has not been lost from incomplete learning data.

본 논문은 학습 데이터에 손실값을 포함하고 있는 불완전한 데이터를 위하여 확률을 나타낼 수 있는 형식으로 변환한 후 손실값을 보상하는 알고리즘을 소개한다. 기존에 이러한 데이터 변환을 사용한 방법에서는 손실 변수가 가질 수 있는 균등한 확률로 손실값을 할당하여 불완전한 데이터를 처리하는 것이었다. 이 방법으로 많은 문제에 적용하여 좋은 결과를 얻었으나, 손실 변수에 남아있는 모든 정보를 무시하고 새로운 값을 할당한다는 점에서 정보의 손실이 있다는 지적이 있었다. 이에 반해 새로운 제안 방법은 손실값을 포함하지 않는 완전한 정보만을 잘 알려진 분류 알고리즘(C4.5)에 입력하고 학습하는 중에 결정트리가 구축된다. 그리고 이 결정트리로 부터 손실값에 대한 확률을 구하여 이를 손실 변수의 추정값으로 할당한다. 즉, 불완전한 학습 데이터에서 손실되지 않은 많은 정보들을 사용하여 손실된 일부 정보를 복구하는 것이다.

Keywords

References

  1. J. Han, J. Pei & M. Kamber. (2011). Data Mining: Concepts and Techniques, Waltham : Elsevier
  2. R. Kohavi & J. R. Quinlan. (2002). Data mining tasks and methods: Classification: Decision-tree discovery, Handbook of data mining and knowledge discovery, New York : Oxford University Press, 267-276.
  3. D. Kim, D. Lee & W. D. Lee. (2006). Classifier using Extended Data Expression, IEEE Mountain Workshop on Adaptive and Learning Systems. DOI : 10.1109/SMCALS.2006.250708
  4. J. C. Lee, D. H. Seo, C. H. Song & W. D. Lee. (2007). FLDF based Decision Tree using Extended Data Expression, The 6th Conference on Machine Learning & Cybernetics, 3478-3483
  5. J. C. Lee. (2018). Application Examples Applying Extended Data Expression Technique to Classification Problems, Journal of the Korea convergence society, 9(12), 9-15. DOI : 10.15207 /JKCS.2018.9.12.009 https://doi.org/10.15207/JKCS.2018.9.12.009
  6. J. C. Lee. (2019). Deep Learning Model for Incomplete Data, Journal of the Korea Convergence Society, 10(2), 1-6. DOI : 10.15207 /JKCS.2019.10.2.001 https://doi.org/10.15207/JKCS.2019.10.2.001
  7. J. C. Lee & W. D. Lee. (2010). Classifier handling incomplete data. Journal of the Korea Institute of Information and Communication Engineering, 14(1), 53-62. https://doi.org/10.6109/jkiice.2010.14.1.053
  8. A. McCallum, D. Freitag & F. Pereira. (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. Proc. Of 17th International Conference on Machine Learning, 591-598
  9. T.Delavallade & T.H.Dang.(2007). Using Entropy to Impute Missing Data in a Classification Task. IEEE International Fuzzy Systems Conference. DOI : 10.1109/FUZZY.2007.4295430
  10. J.R.Quinlan.(1993). C4.5 : Program for Machine Learning. San Mateo : Morgan Kaufmann
  11. A. Sportisse, C. Boyer, A. Dieuleveut & J. Josse. (2020). Debiasing Averaged Stochastic Gradient Descent to handle missing values, 34th Conference on Neural Information Processing Systems, Vancouver, Canada, 1-11
  12. T. F. Johnson, N. J. B. Isaac, A. Paviolo, M. Gonzalez-Suarez. (2020). Handling missing values in trait data, Global Ecology & Biogeography, 1-12. DOI : 10.1111/geb.13185
  13. S. Huang & C. Cheng. (2020). A Safe-Region Imputation Method for Handling Medical Data with Missing Values, Symmetry 2020, 12, 1792; DOI : 10.3390/sym12111792
  14. J. You, X. Ma, D. Y. Ding, M. Kochenderfer & J. Leskovec. (2020). Handling Missing Data with Graph Representation Learning, 34th Conference on Neural Information Processing Systems, Vancouver, Canada. 1-13
  15. Center for Machine Learning and Intelligent Systems, University of California, Irvine, (2020). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php