DOI QR코드

DOI QR Code

A Big Data Preprocessing using Statistical Text Mining

통계적 텍스트 마이닝을 이용한 빅 데이터 전처리

  • Jun, Sunghae (Department of Statistics, Cheongju University)
  • Received : 2015.08.28
  • Accepted : 2015.09.19
  • Published : 2015.10.25

Abstract

Big data has been used in diverse areas. For example, in computer science and sociology, there is a difference in their issues to approach big data, but they have same usage to analyze big data and imply the analysis result. So the meaningful analysis and implication of big data are needed in most areas. Statistics and machine learning provide various methods for big data analysis. In this paper, we study a process for big data analysis, and propose an efficient methodology of entire process from collecting big data to implying the result of big data analysis. In addition, patent documents have the characteristics of big data, we propose an approach to apply big data analysis to patent data, and imply the result of patent big data to build R&D strategy. To illustrate how to use our proposed methodology for real problem, we perform a case study using applied and registered patent documents retrieved from the patent databases in the world.

빅 데이터는 여러 분야에서 다양하게 사용되고 있다. 예를 들어, 컴퓨터학과 사회학에서 빅 데이터에 대한 서로간의 접근방법에 대한 차이는 있겠지만 빅 데이터의 분석을 통한 활용 측면에서는 공통적인 부분을 갖는다. 따라서 대부분의 분야에서 빅 데이터에 대한 의미 있는 분석과 활용은 필요하게 된다. 통계학과 기계학습은 빅 데이터의 분석을 위한 다양한 방법론을 제공한다. 본 논문에서는 빅 데이터분석 과정에 대하여 알아보고 수집된 빅데이터의 원천에서부터 분석을 거쳐 최종적으로 분석결과를 활용하는 전체 과정을 위한 효율적인 빅 데이터 분석방법에 대하여 연구한다. 특히, 빅 데이터의 특성을 갖는 여러 데이터 중 하나인 특허문서 데이터에 대하여 빅데이터분석을 적용하여 효과적인 특허분석을 수행하고 이 결과를 연구개발 기획에 적용하는 방법론에 대하여 제안한다. 제안방법에 대한 실제적용을 위하여 전 세계 특허데이터베이스로부터 실제 기업의 전체 출원, 등록 특허 문서를 수집, 분석하고 연구개발 업무에 활용하는 전 과정에 대한 사례연구를 수행하였다.

Keywords

References

  1. IBM, "What is big data?" www-01.ibm.com/software/data/bigdata, 2015.
  2. Gartner, "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data," www.gartner.com/newsroom/id/1731916, 2015.
  3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, 2011.
  4. B. Choi, J. Kong, and M. Han, "The Model of Network Packet Analysis based on Big Data", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 392-399, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.392
  5. K. Kim, J. Jeong, and G. Park, "Assessment of External Force Acting on Ship Using Big Data in Maritime Traffic", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 379-384, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.379
  6. S. Hong, and M. Han, "The Efficient Method of Parallel Genetic Algorithm using MapReduce of Big Data", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 385-391, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.385
  7. H. Yoon, S. Park, "Pattern and Instance Generation for Self-knowledge Learning in Korean", Journal of Korean Institute of Intelligent Systems, Vol. 25, No. 1, pp. 63-69, 2015. https://doi.org/10.5391/JKIIS.2015.25.1.063
  8. S. Jun, "A Big Data Learning for Patent Analysis", Journal of Korean Institute of Intelligent Systems, Vol. 23, No. 5, pp. 406-411, 2013. https://doi.org/10.5391/JKIIS.2013.23.5.406
  9. S. Choi, and S. Jun, "Vacant technology forecasting using new Bayesian patent clustering," Technology Analysis & Strategic Management, Vol. 26, Iss. 3, pp. 241-251, 2014. https://doi.org/10.1080/09537325.2013.850477
  10. S. Park, and S. Jun, "A Technology Forecasting Model Using Support Vector Clustering and Voting Approach," Information - An International Interdisciplinary Journal, Vol. 16, No. 2(B), pp. 1523-1528, 2013.
  11. H. Kim, J. Kim, J. Lee, S. Park, D. Jang, "A Novel Methodology for Extracting Core Technology and Patents by IP Mining", Journal of Korean Institute of Intelligent Systems, Vol. 25, No. 4, pp. 392-397, 2015. https://doi.org/10.5391/JKIIS.2015.25.4.392
  12. S. Jun, "Technology Forecasting of Intelligent Systems using Patent Analysis", Journal of Korean Institute of Intelligent Systems, Vol. 21, No. 1, pp. 100-105, 2011. https://doi.org/10.5391/JKIIS.2011.21.1.100
  13. D. Hunt, L. D. Nguyen, and M. Rodgers, Patent Searching Tools & Techniques, Wiley, 2007.
  14. A. T. Roper, S. W. Cunningham, A. L. Porter, T. W. Mason, F. A. Rossini, and J. Banks, Forecasting and Management of Technology, Wiley, 2011.
  15. S. Jun, and J. Choi, "Patent and Big Data, What's the Connection?", Proceedings of KIIS Autumn Conference 2014 Vol. 24, No. 2, pp 183-184, 2014.
  16. J. Han, and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2001.
  17. R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2015.
  18. K. Hornik, Package 'NLP' - Natural Language Processing Infrastructure, CRAN R Project, 2015.
  19. I. Feinerer, K. Hornik, and D. Meyer, "Text mining infrastructure in R", Journal of Statistical Software, Vol. 25, No. 5, pp. 1-54, 2008.
  20. D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, C. C. Chang, and C. C. Lin, Package 'e1071' - Misc Functions of the Department of Statistics, Probability Theory Group, CRAN R Project, 2015.
  21. C. T. Butts, "Social Network Analysis with sna", Journal of Statistical Software, Vol. 24, Iss. 6, pp. 1-51, 2008.
  22. USPTO, The United States Patent and Trademark Office, http://www.uspto.gov, 2015.
  23. WIPSON, 'WIPS Corporation'. http://www.wipson.com, 2015.
  24. V. Nagali, J. Hwang, D. Sanghera, M. Gaskins, M. Pridgen, T. Thurston, P. Mackenroth, D. Branvold, P. Scholler, and G. Shoemaker, "Procurement Risk Management (PRM) at Hewlett-Packard Company", Interfaces, Vol. 38, Iss. 1, pp. 51-60, 2008. https://doi.org/10.1287/inte.1070.0333
  25. HP Office Site, http://www.hp.com, 2015.
  26. Hewlett-Packard from Wikipedia, https://en.wikipedia.org/wiki/Hewlett-Packard, 2015.
  27. Hewlett-Packard on Forbes Lists, http://www.forbes.com/companies/hewlett-packard, 2015.
  28. S. M. Ross, Introduction to Probability and Statistics for Engineers and Scientists, Elsevier, 2012.