DOI QR코드

DOI QR Code

Analysis of Missing Data Using an Empirical Bayesian Method

경험적 베이지안 방법을 이용한 결측자료 연구

  • Yoon, Yong Hwa (Department of Computer Science and Statistics, Daegu University) ;
  • Choi, Boseung (Department of Computer Science and Statistics, Daegu University)
  • Received : 2014.08.14
  • Accepted : 2014.10.20
  • Published : 2014.12.31

Abstract

Proper missing data imputation is an important procedure to obtain superior results for data analysis based on survey data. This paper deals with both a model based imputation method and model estimation method. We utilized a Bayesian method to solve a boundary solution problem in which we applied a maximum likelihood estimation method. We also deal with a missing mechanism model selection problem using forecasting results and a comparison between model accuracies. We utilized MWPE(modified within precinct error) (Bautista et al., 2007) to measure prediction correctness. We applied proposed ML and Bayesian methods to the Korean presidential election exit poll data of 2012. Based on the analysis, the results under the missing at random mechanism showed superior prediction results than under the missing not at random mechanism.

조사를 통하여 수집된 자료에 기반하여 분석을 수행하는데 있어서 결측값에 대한 적절한 대체 방법은 보다 정확한 결과를 얻기 위한 매우 중요한 절차이다. 본 연구에서는 모형에 기반하여 결측자료에 대한 대체방법과 모형 추정방법을 다루었다. 특히 최대우도추정 방법의 적용에서 발생할 수 있는 변방값 문제(bounday soluntion problem)를 해결하기 위하여 베이지안 방법을 적용하였다. 분석된 결과를 바탕으로 하여 예측을 수행한 후 결측체계에 따른 정확성 비교를 수행하여 결측체계에 따른 결측모형의 선택 문제를 다루었다. 예측의 정확도를 측정하기 위하여 Bautista 등 (2007)이 제안한 MWPE(modified within precinct error) 이용하여 비교를 수행 하였다. 본 연구에서 제시된 방법들은 2012년에 시행된 제 18대 대통령 선거 당일 시행된 출구조사의 자료를 적용하여 분석을 수행하였다. 분석 결과 임의결측체계의 가정에 따른 결과가 비임의체계 가정에 따른 결과보다 예측의 정확도가 더 높았다.

Keywords

References

  1. Agresti, A. (2002). Categorical Data Analysis, Second edition, John Wiley & Sons Inc., New Jersey.
  2. Baker, S. G. and Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse, Journal of the American Statistical Association, 83, 62-69. https://doi.org/10.1080/01621459.1988.10478565
  3. Baker, S. G., Rosenberger, W. F. and Dersimonian, R. (1992). Closed-form estimates for missing counts in two-way contingency tables, Statistics in Medicine, 11, 643-657. https://doi.org/10.1002/sim.4780110509
  4. Bautista, R., Callegaro, M., Vera, J. A. and Abundis, F. (2007). Studying nonresponse in Mexican exit polls, International Journal of Public Opinion Research, 19, 492-503. https://doi.org/10.1093/ijpor/edm013
  5. Chib, S. (1995). Marginal likelihood from the Gibbs output, Journal of the American Statistical Association, 90, 1313-1321. https://doi.org/10.1080/01621459.1995.10476635
  6. Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output, Journal of the American Statistical Association, 96, 270-281. https://doi.org/10.1198/016214501750332848
  7. Choi, B., Choi, J. W. and Park, Y. S. (2009). Bayesian methods for an incomplete two-way contingency table with application to the Ohio (Buckeye state polls), Survey Methodology, 35, 37-51.
  8. Choi, B. and Kim, G. M. (2012). A model selection method using EM algorithm for missing data, Journal of the Korean Data Analysis Society, 14, 767-779.
  9. Choi, B., Kim, D. Y., Kim, K. W. and Park, Y. S. (2008). Nonignorable nonresponse imputation and rotation group bias estimation on the rotation sample survey, The Korean Journal of Applied Statistics, 21, 361-375. https://doi.org/10.5351/KJAS.2008.21.3.361
  10. Choi, B., Park, Y. S. and Lee, D. H. (2007). Election forecasting using pre-election survey data with nonignorable nonresponse, Journal of the Korean Data Analysis Society, 9, 2321-2333.
  11. Clarke, P. S. (2002). On boundary solutions and identifiability in categorical regression with non-ignorable non-response, Biometrical Journal, 44, 701-717. https://doi.org/10.1002/1521-4036(200209)44:6<701::AID-BIMJ701>3.0.CO;2-1
  12. Dempster, A. P., Laird, N. M. and Rubin, D. M. (1977). Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 4, 1-38.
  13. Forster, J. J. and Smith, P. W. (1998). Model-based inference for categorical survey data subject to nonignorable non-response, Journal of the Royal Statistical society, Series B, 60, 57-70. https://doi.org/10.1111/1467-9868.00108
  14. Green, P. E. and Park, T. (2003). A Bayesian hierarchical model for categorical data with nonignorable nonresponse, Biometrics, 59, 886-896. https://doi.org/10.1111/j.0006-341X.2003.00103.x
  15. Ibrahim, J. G., Zhu, H. and Tang, N. (2008). Model selection criteria for missing-data problems using the EM algorithm, Journal of the American Statistical Association, 103, 1648-1658. https://doi.org/10.1198/016214508000001057
  16. Little, J. A. and Rubin, D. B. (2002). Statistical analysis with missing data, Second edition, Wiley, New York.
  17. Kim, S. Y. and Kwon, S. P. (2009). The effect of survey refusal and noncontact on nonresponse error: For economically active population survey, The Korean Journal of Applied Statistics, 22, 667-676. https://doi.org/10.5351/KJAS.2009.22.3.667
  18. Kim, Y. W. and Nam, S. J. (2009). Forming weighting adjustment cells for unit-nonresponse in sample surveys, Communications for Statistical Applications and Methods, 16, 103-113. https://doi.org/10.5351/CKSS.2009.16.1.103
  19. Kwak, J. and Choi, B. (2014). A comparison study for accuracy of exit poll based on nonresponse model, Journal of the Korean Data & Information Science Society, 25, 53-64. https://doi.org/10.7465/jkdi.2014.25.1.53
  20. Pak, G. D. and Shin, K. I. (2010). Non-response imputation for panel data, Communications for Statistical Applications and Methods, 17, 899-907. https://doi.org/10.5351/CKSS.2010.17.6.899
  21. Park, J. S., Kang, C., and Kim, K. K. (2013). A simulation study of imputation methods for transportation corporation's survey data, Journal of the Korean Data Analysis Society, 15, 1903-1912.
  22. Park, T. and Brown, M. B. (1994). Models for categorical data with nonignorable nonresponse, Journal of the American Statistical Association, 89, 44-52. https://doi.org/10.1080/01621459.1994.10476444
  23. Park, T. (1998). An approach to categorical data with nonignorable nonresponse, Biometrics, 54, 1579-1690. https://doi.org/10.2307/2533682
  24. Park, T. S. and Lee, S. Y. (1998). Analysis of categorical data with nonresponses, The Korean Journal of Applied Statistics, 11, 83-95.
  25. Park, Y. S., Kim, K. H., and Choi, B. (2013). Dynamic Bayesian analysis for irregularly and incompletely observed contingency tables, Journal of the Korean Statistical Society, 42, 277-289. https://doi.org/10.1016/j.jkss.2012.08.008
  26. Park, Y. S. and Choi, B. (2010). Bayesian analysis for incomplete multi-way contingency tables with nonignorable nonresponse, Journal of Applied Statistics, 37, 1439-1453. https://doi.org/10.1080/02664760903046078
  27. Rubin, D. B., Stern, H. S. and Vehovar, V. (1995). Handling "Don't know" survey responses: The case of the Slovenian Plebiscite, Journal of the American Statistical Association, 90, 822-828, nonresponse, Journal of Applied Statistics, 37, 1439-1453.
  28. Song, J. (2011). Selection of variables to form imputation classes in Hotdeck imputation, Journal of the Korean Data Analysis Society, 13, 1321-1329.
  29. Song, J. (2014). A comparison of imputation methods for multiple response questions, Journal of the Korean Data Analysis Society, 16, 691-701.
  30. Yoon, Y. H. and Choi, B. (2012). Model selection method for categorical data with non-response, Journal of the Korean Data & Information Science Society, 23, 627-641. https://doi.org/10.7465/jkdi.2012.23.4.627

Cited by

  1. An estimation method for non-response model using Monte-Carlo expectation-maximization algorithm vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.587