DOI QR코드

DOI QR Code

Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique

결측 데이터 보정법에 의한 의사 데이터로 조정된 예측 최적화 방법

  • Received : 2018.10.12
  • Accepted : 2019.02.01
  • Published : 2019.02.28

Abstract

When forecasting future values, a model estimated after minimizing training errors can yield test errors higher than the training errors. This result is the over-fitting problem caused by an increase in model complexity when the model is focused only on a given dataset. Some regularization and resampling methods have been introduced to reduce test errors by alleviating this problem but have been designed for use with only a given dataset. In this paper, we propose a new optimization approach to reduce test errors by transforming a test error minimization problem into a training error minimization problem. To carry out this transformation, we needed additional data for the given dataset, termed pseudo data. To make proper use of pseudo data, we used three types of missing data imputation techniques. As an optimization tool, we chose the least squares method and combined it with an extra pseudo data instance. Furthermore, we present the numerical results supporting our proposed approach, which resulted in less test errors than the ordinary least squares method.

미래 값을 예측할 때, 학습 오차(training error)를 최소화하여 추정된 모형은 보통 많은 테스트 오차(test error)를 야기할 수 있다. 이것은 추정 모델이 주어진 데이터 집합에만 집중하여 발생하는 모델 복잡성에 따른 과적합(overfitting) 문제이다. 일부 정규화 및 리샘플링 방법은 이 문제를 완화하여 테스트 오차를 줄이기 위해 도입되었지만, 이 방법들 또한 주어진 데이터 집합에서만 국한 되도록 설계되었다. 본 논문에서는 테스트 오차 최소화 문제를 학습 오차 최소화 문제로 변환하여 테스트 오차를 줄이기 위한 새로운 최적화 방법을 제안한다. 이 변환을 수행하기 위해 주어진 데이터 집합에 대해 의사(pseudo) 데이터라고 하는 새로운 데이터를 추가하였다. 그리고 적절한 의사 데이터를 만들기 위해 결측 데이터 보정법의 세 가지 유형을 사용하였다. 예측 모델로서 선형회귀모형, 자기회귀모형, ridge 회귀모형을 사용하고 이 모형들에 의사 데이터 방법을 적용하였다. 또한, 의사 데이터로 조정된 최적화 방법을 활용하여 환경 데이터 및 금융 데이터에 적용한 사례를 제시하였다. 결과적으로 이 논문에서 제시된 방법은 원래의 예측 모형보다 테스트 오차를 감소시키는 것으로 나타났다.

Keywords

SHGSCZ_2019_v20n2_200_f0001.png 이미지

Fig. 1. (a) Missing data imputation and (b) the proposed approach with pseudo data.

Table 1. Comparison of imputation methods

SHGSCZ_2019_v20n2_200_t0001.png 이미지

Table 2. Comparison of prediction models

SHGSCZ_2019_v20n2_200_t0002.png 이미지

Table 3. Test Error Comparison: Model 1

SHGSCZ_2019_v20n2_200_t0003.png 이미지

Table 4. Test Error Comparison: Model 2

SHGSCZ_2019_v20n2_200_t0004.png 이미지

Table 5. Test Error Comparison: Model 3

SHGSCZ_2019_v20n2_200_t0005.png 이미지

Table 6. Test Error Comparison: AR(1)

SHGSCZ_2019_v20n2_200_t0006.png 이미지

Table 7. Test Error Comparison: AR(2)

SHGSCZ_2019_v20n2_200_t0007.png 이미지

Table 8. Test Error Comparison: AR(3)

SHGSCZ_2019_v20n2_200_t0008.png 이미지

Table 9. Test Error Comparison: Model 1

SHGSCZ_2019_v20n2_200_t0009.png 이미지

Table 10. Test Error Comparison: Model 2

SHGSCZ_2019_v20n2_200_t0010.png 이미지

Table 11. Test Error Comparison: Model 3

SHGSCZ_2019_v20n2_200_t0011.png 이미지

Table 12. Results of the Test Error Comparison: CO2(ppm)

SHGSCZ_2019_v20n2_200_t0012.png 이미지

Table 13. Results of the Test Error Comparison: DJIA

SHGSCZ_2019_v20n2_200_t0013.png 이미지

References

  1. H. Akaike, "Information theory and an extension of the maximum likelihood principle," in Selected papers of hirotugu akaike, ed: Springer, pp. 199-213, 1998. DOI: https://doi.org/10.1007/978-1-4612-1694-0_15
  2. M. J. Garside, "The Best Subset in Multiple Regression Analysis," Applied Statistics, Vol. 14, pp. 196-200, 1965. DOI: https://doi.org/10.2307/2985341
  3. M. G. Kendall, A course in multivariate analysis, C, Griffin, London, pp. 23-29, 1957.
  4. H. Hotelling, "The relations of the newer multivariate statistical methods to factor analysis," British Journal of Statistical Psychology, Vol. 10, pp. 69-79, 1957. DOI: https://doi.org/10.1111/j.2044-8317.1957.tb00179.x
  5. A. E. Hoerl, and R. W. Kennard, "Ridge regression: Biased estimation for nonorthogonal problems," Technometrics, Vol. 12, No. 1, pp. 55-67, 1970. DOI: https://doi.org/10.2307/1271436
  6. R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267-288, 1996. DOI: https://doi.org/10.2307/41262671
  7. S. Arlot, and A. Celisse, "A survey of cross-validation procedures for model selection," Statistics surveys, Vol. 4, pp. 40-79, 2010. DOI: https://doi.org/10.1214/09-SS054
  8. A. Cowling, and P. Hall, "On pseudo data methods for removing boundary effects in kernel density estimation," Journal of the Royal Statistical Society. Series B (Methodological), pp. 551-563, 1996. DOI: https://doi.org/10.2307/2345893
  9. D. B. H. Cline, and J. D. Hart, "Kernel estimation of densities with discontinuities or discontinuous derivatives," Statistics: A Journal of Theoretical and Applied Statistics, Vol. 22, No. 1, pp. 69-84, 1991. DOI: https://doi.org/10.1080/02331889108802286
  10. I. Gerlovina, Small Sample Inference, Doctoral dissertation, UC Berkeley, 2016. Available From: http://digitalassets.lib.berkeley.edu/etd/ucb/text/ Gerlovina_berkeley_0028E_16680.pdf
  11. J. El Methni, L. Gardes, & S. Girard, "Kernel estimation of extreme regression risk measures," Electronic journal of statistics, Vol. 12, No. 1, pp. 359-398, 2018. DOI: https://doi.org/10.1214/18-EJS1392
  12. M. Mudelsee, Extreme Value Time Series. In: Climate Time Series Analysis. Springer, pp. 217-267, 2014. DOI: https://doi.org/10.1007/978-90-481-9482-7
  13. L. Breiman, "Using convex pseudo-data to increase prediction accuracy," breast (Wis), Vol. 5, No. 2, pp. 1-18, 1998. Available From: https://statistics.berkeley.edu/sites/default/files/techreports/513.pdf https://doi.org/10.1016/S0960-9776(96)90041-7
  14. A. Purwar, and S. K. Singh, "Hybrid prediction model with missing value imputation for medical data," Expert Systems with Applications, Vol. 42, No. 13, pp. 5621-5631, 2015. DOI: https://doi.org/10.1016/j.eswa.2015.02.050
  15. Z. Liu, S. Sharma, and S. Datla, "Imputation of missing traffic data during holiday periods," Transportation Planning and Technology, Vol. 31, No. 5, pp. 525-544, 2008. DOI: https://doi.org/10.1080/03081060802364505
  16. S. F. Wu, C. Y. Chang, and Lee, S. J., "Time series forecasting with missing values," In Industrial Networks and Intelligent Systems (INISCom), 2015 1st International Conference on IEEE, pp. 151-156, 2015. DOI: https://doi.org/10.4108/icst.iniscom.2015.258269
  17. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, Springer, Berlin: Springer series in statistics, p. 38, 2009.
  18. D. Blend, & T. Marwala, Comparison of data imputation techniques and their impact, Available From: https://arxiv.org/ftp/arxiv/papers/0812/0812.1539.pdf
  19. D. Ruppert, and M. P. Wand, "Multivariate locally weighted least squares regression," The annals of statistics, pp. 1346-1370, 1994. DOI: https://doi.org/10.1214/aos/1176325632
  20. M. R., Pina-Monarrez, "A new theory in multiple linear regression," International Journal Of Industrial Engineering, Vol. 18, No. 6, pp. 310-316, 2011 Available From: https://www.researchgate.net/publication/279181297A_new_theory_in_multiple_linear_regression
  21. B. Al-hnaity, and M. Abbod, "Predicting Financial Time Series Data Using Hybrid Model," In Intelligent Systems and Applications. Springer International Publishing, pp. 19-41, 2017. DOI: https://doi.org/10.1007/978-3-319-33386-1_2