DOI QR코드

DOI QR Code

Big data and statistics

빅데이터와 통계학

  • Received : 2013.06.30
  • Accepted : 2013.08.08
  • Published : 2013.09.30

Abstract

We investigate the roles of statistics and statisticians in the big data era. Definition and application areas of big data are reviewed and statistical characteristics of big data and their meanings are discussed. Various statistical methodologies applicable to big data analysis are illustrated, and two real big data projects are explained.

빅데이터 시대를 맞이하여 통계학과 통계학자의 역할에 대하여 살펴본다. 빅데이터에 대한 정의 및 응용분야를 살펴보고, 빅데이터 자료의 통계학적 특징들 및 이와 관련한 통계학적 의의에 대해서 설명한다. 빅데이터 자료 분석에 유용하게 사용되는 통계적 방법론들에 대해서 살펴보고, 국외와 국내의 빅데이터 관련 프로젝트를 소개한다.

Keywords

References

  1. Bellman. R. (1961). Adaptive control processes: A guided tour, Princeton University Press, Princeton.
  2. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 289-300.
  3. Bickel, P. J. and Levina, E. (2004). Some theory for Fisher's linear discriminant function, naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli , 10, 989-1010. https://doi.org/10.3150/bj/1106314847
  4. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
  5. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
  6. Dempster, A. P (1972). Covariance selection. Biometrics, 28, 157-175. https://doi.org/10.2307/2528966
  7. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. The Annals of statistics, 32, 407-409. https://doi.org/10.1214/009053604000000067
  8. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
  9. Fraiman, R., Justel, A. and Svarc, M. (2008). Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association, 103, 1294-1303. https://doi.org/10.1198/016214508000000544
  10. Freund, Y. and Schapire, R. E (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
  11. Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302-332. https://doi.org/10.1214/07-AOAS131
  12. Friedman, J. H., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432-441. https://doi.org/10.1093/biostatistics/kxm045
  13. Gill, P. E, Murray, W. and Saunders, M. A (1997), User's guide for SNOPT 5.3: A Fortran package for large-scale nonlinear programming, Technical Report NA 97-4. University of California, San Diego.
  14. Gorban, A. N, Kegl, B., Wunsch, D. C and Zinovyev, A. (2007). Principal manifolds for data visualization and dimension reduction, Springer, New York.
  15. Grant, M., Boyd, S., and Ye, Y. (2008), CVX: Matlab software for disciplined convex programming. Web page and software, http://stanford.edu/-boyd/cvx.
  16. Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004), The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391-1415.
  17. Hastie, T., Tibshirani, R. and Friedman, J. (2009). Elements of statistical learning, 2nd Edition, Springer, New York.
  18. Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19, 984-1006. https://doi.org/10.1198/jcgs.2010.09208
  19. Hoffman, M., Blei, D. M. and Bach, F. (2010). Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems, 23, 856-864.
  20. IBM (2012). http://www-01.ibm.com/software/data/bigdata.
  21. International Data Corporation (2012). Worldwide big data technology and services 2012-2015 forecast, International Data Corporation, Florida.
  22. Jung J. (2011). Engine of creating value. New chances in the big data era and strategies, National Information Society Agency, Seoul.
  23. Kolaczyk, E. D. (2009). Statistical analysis of network data, Springer, New York.
  24. Lam, C. (2012). Hadoop in action, Manning Publications Co., Stamford.
  25. Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10, 2295-2328.
  26. Makoto. S. (2013). Impact of big data, Hanbit Inc., Seoul.
  27. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, New York.
  28. National Information Society Agency (2012). Top-10 globally advanced case study of : Big data lead the world, National Information Society Agency, Seoul.
  29. Nishimoto, S., Yu, A. T., Naselaris, T., Benjamini, Y., Yu, B. and Gallant, J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21, 1641-1646. https://doi.org/10.1016/j.cub.2011.08.031
  30. Park. C, Kim, Y., Kim, J., Song J. and Choi, H.. (2013). Datamining using R, 2nd edition, Kyohak Publishing Co., Seoul.
  31. Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society B, 69, 659-677. https://doi.org/10.1111/j.1467-9868.2007.00607.x
  32. Seeger, M. (2009). Bayesian modelling in machine learning: A tutorial review, Probabilistic Machine Learning and Medical Image Processing, Saarland University, Saarland.
  33. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A. and West, M. (2010). Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19, 419-348. https://doi.org/10.1198/jcgs.2010.10016
  34. Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581. https://doi.org/10.1198/016214506000000302
  35. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267-288.
  36. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67, 91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
  37. Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39, 1335-1371. https://doi.org/10.1214/11-AOS878
  38. Van der Laan, M. (2011). Targeted learning: Causal inference for observational and experimental data, Springer, New York.
  39. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences, Havard University, Cambridge.
  40. Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942. https://doi.org/10.1214/09-AOS729
  41. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

Cited by

  1. Standardizing Unstructured Big Data and Visual Interpretation using MapReduce and Correspondence Analysis vol.27, pp.2, 2014, https://doi.org/10.5351/KJAS.2014.27.2.169
  2. A study on the invigorating strategies for open government data vol.25, pp.4, 2014, https://doi.org/10.7465/jkdi.2014.25.4.769
  3. Types of literary therapy's subjective perceptions utilized by Q-methodology vol.26, pp.6, 2015, https://doi.org/10.7465/jkdi.2015.26.6.1465
  4. Home sales index prediction model based on cluster and principal component statistical approaches in a big data analytic concept vol.21, pp.1, 2017, https://doi.org/10.1007/s12205-016-0574-6
  5. Study on the social issue sentiment classification using text mining vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1167
  6. Enhancing the performance of taxi application based on in-memory data grid technology vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1035
  7. Energy ICT convergence with big data services vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1141
  8. Crime risk implementation for safe return service vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1097
  9. Sensing the room: an integrated implementation process to visualize indoor temperature data on floor plans vol.2, pp.1, 2014, https://doi.org/10.1186/s40327-014-0010-2
  10. Structuring of unstructured big data and visual interpretation vol.25, pp.6, 2014, https://doi.org/10.7465/jkdi.2014.25.6.1431
  11. An Analysis of Science Magazine in the View of Infographic vol.34, pp.6, 2014, https://doi.org/10.14697/jkase.2014.34.6.0601
  12. Big data mining for natural disaster analysis vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1105
  13. 빅데이터 역량 평가를 위한 참조모델 및 수준진단시스템 개발 vol.39, pp.2, 2016, https://doi.org/10.11627/jkise.2016.39.2.054
  14. Current status of big data-related study and tasks of early childhood education in Korea vol.36, pp.6, 2016, https://doi.org/10.18023/kjece.2016.36.6.008
  15. 확률 및 통계와 교원임용시험 vol.28, pp.6, 2013, https://doi.org/10.7465/jkdi.2017.28.6.1539
  16. 어린이집 CCTV 빅데이터의 활용을 위한 기초 연구 vol.13, pp.6, 2017, https://doi.org/10.14698/jkcce.2017.13.06.043
  17. Data Visualization using Linear and Non-linear Dimensionality Reduction Methods vol.23, pp.12, 2013, https://doi.org/10.9708/jksci.2018.23.12.021