DOI QR코드

DOI QR Code

Analysis of big data using Rhipe

Rhipe를 활용한 빅데이터 처리 및 분석

  • Ko, Youngjun (Digital Convergence Center, Jeju Technopark) ;
  • Kim, Jinseog (Department of Statistics and Information Science, Dongguk University)
  • 고영준 (제주테크노파크 디지털융합센터) ;
  • 김진석 (동국대학교 정보통계학과)
  • Received : 2013.06.26
  • Accepted : 2013.08.03
  • Published : 2013.09.30

Abstract

The Hadoop system was developed by the Apache foundation based on GFS and MapReduce technologies of Google. Many modern systems for managing and processing the big data have been developing based on the Hadoop because the Hadoop was designed for scalability and distributed computing. The R software has been considered as a well-suited analytic tool in the Hadoop based systems because the R is flexible to other languages and has many libraries for complex analyses. We introduced Rhipe which is a R package supporting MapReduce programming easily under the Hadoop system, and implemented a MapReduce program using Rhipe for multiple regression especially. In addition, we compared the computing speeds of our program with the other packages (ff and bigmemory) for processing the large data. The simulation results showed that our program was more fast than ff and bigmemory as the size of data increases.

최근 Hadoop은 빅데이터의 저장, 처리 및 분석을 위한 표준시스템으로 인식되고 있으며, 많은 빅데이터 관련 시스템들이 Hadoop에 기반하여 구축되고 있다. 또한 R은 다른 소프트웨어와의 연동이 쉽고 다양한 분석 라이브러리들을 탑재하고 있어서 Hadoop 환경하에서 빅데이터의 분석을 위한 공통 분석 플렛폼으로 여겨지고 있다. 본 논문에서는 Hadoop 환경에서 분산 데이터 처리를 위한 R패키지인 Rhipe를 소개하고 빅데이터를 이용한 병렬 다중회귀분석을 위해 MapReduce 프로그램을 작성하는 방법을 예시하였다. 또한 시뮬레이션을 통해 기존의 대용량처리를 위한 R 분석패키지인 ff와 bigmemory와의 연산속도를 비교하였으며, 데이터의 크기가 커짐에 따라 Rhipe를 이용한 MapReduce 프로그램의 계산속도가 ff와 bigmemory에 비해 우수함을 확인하였다.

Keywords

References

  1. Adler, D., Nenadic, O., Zucchini, W. and Glaser, C. (2007). The ff package: Handling large data sets in R with memory mapped pages of binary flat files. UseR2007, http://www.r-project.org/conferences/ useR-2007/program/presentations/adler.pdf.
  2. Ahn, C. and Hwang, S. (2012). Big data technologies and main issues. Communications of the Korea Information Science Society, 30, 10-17.
  3. Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, 107-113.
  4. Duan, S.Wu, W., Wang, B. and Yang, J. (2011). Design and implementation of parallel statistical algorithm based on Hadoop's MapReduce model. Proceedings of IEEE CCIS2011, 134-138.
  5. Eddelbuttel, D. (2012). CRAN task view: High-performance and parallel computing with R. http:// cranr-projectorg/web/views/HighPerformanceComputing.html.
  6. Gantz, J. and Reinsel, D. (2011). Extracting value from chaos. IDC, http://www.emc.com/collateral/ analyst-reports/idc-extracting-value-from-chaos-ar.pdf.
  7. Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Purdue University, West Lafayette, http://www.purdue.edu/policies/pages/teach_res_outreach/viii_3_1. html.
  8. Han, K., Park, S., and Ahn, J. (2012). Development of a R function for visualizing statistical information on Google static maps. Korean Data & Information Science Society, 23, 971-981. https://doi.org/10.7465/jkdi.2012.23.5.971
  9. Hedlund, B. (2011). Understanding Hadoop clusters and the network. http://bradhedlund.com/2011/09/ 10/understanding-hadoop-clusters-and-the-network/.
  10. Jun, B., Kim, H., Choi, W. (2012). Future society and big data technologies. http://www.dbguide.net/ upload/24/20120613133956573971110.pdf.
  11. Kane, M. J. and Emerson, J. W. (2010a). bigmemory: Manage massive matrices with shared memory and memory-mapped files. R package version 4.2.3.
  12. Kane, M. J. and Emerson, J. W. (2010b). biganalytics: A library of utilities for big.matrix objects of package bigmemory. R package version 1.0.12.
  13. Ko, Y. (2013). A study on the development of R application for big data analysis, Master's thesis, Dongguk University. Gyeongju, Korea.
  14. Lee, K., Choi, H., Jung, Y. (2011). Massive data processing and management in cloud computing: A survey. Journal of KISS: Databases, 38, 104-125.
  15. Lumley, T. (2009). biglm: Bounded memory linear and generalized linear models. R package version 0.7.
  16. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. (2011). Big data : The next frontier for innovation, competition and productivity, McKinsey & Company.
  17. Oracle (2012). Oracle: Big data for the enterprise. Oracle White Paper, http://www.oracle.com/us/ products/database/big-data-for-enterprise-519135.pdf.
  18. Park, Y. (2013). A Study on large data analytics and parallel computing with R, Master's theis, Dongguk University, Gyeongju, Korea.
  19. Revolution (2011). Advanced big data analytics with R and Hadoop. Revolution Analytics, http://www.revolutionanalytics.com/why-revolution-r/whitepapers/R-and-Hadoop-Big-Data-Analytics.pdf.
  20. Shvachko, K., Kuang, H., Radia, S. and Chansler, R. (2010). The Hadoop distributed file system. IEEE 26th Symposium on Date of Conference, 3-7.
  21. White, T. (2012). Hadoop: The definitive guide, 3rd ed., O'Reily Media, Inc., Sebastopol, CA.
  22. World Economic Forum (2012). Big data, big impact: New possibilities for international development. http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf.
  23. Zikopoulos, P., Eaton, C., Roos, de D., Deutsch, T. and Lapis, G. (2012). Understanding big data: Analytics for enterprise class Hadoop and streaming data, McGraw-Hill, http://public.dhe.ibm.com/common/ ssi/ecm/en/iml14296usen/IML14296USEN.PDF.

Cited by

  1. A Study on a Working Pattern Analysis Prototype using Correlation Analysis and Linear Regression Analysis in Welding BigData Environment vol.9, pp.10, 2014, https://doi.org/10.13067/JKIECS.2014.9.10.1071
  2. Rhipe Platform for Big Data Processing and Analysis vol.27, pp.7, 2014, https://doi.org/10.5351/KJAS.2014.27.7.1171
  3. An elastic distributed parallel Hadoop system for bigdata platform and distributed inference engines vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1129
  4. Learning algorithms for big data logistic regression on RHIPE platform vol.27, pp.4, 2016, https://doi.org/10.7465/jkdi.2016.27.4.911
  5. Big data distributed processing system using RHadoop vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1155
  6. Structuring of unstructured big data and visual interpretation vol.25, pp.6, 2014, https://doi.org/10.7465/jkdi.2014.25.6.1431
  7. RHadoop platform for K-Means clustering of big data vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.609
  8. Research on Big Data Integration Method vol.22, pp.1, 2013, https://doi.org/10.9708/jksci.2017.22.01.049
  9. Performance Comparison of Logistic Regression Algorithms on RHadoop vol.22, pp.4, 2017, https://doi.org/10.9708/jksci.2017.22.04.009
  10. 빅데이터 통합모형 비교분석 vol.28, pp.4, 2013, https://doi.org/10.7465/jkdi.2017.28.4.755