DOI QR코드

DOI QR Code

Mapping Categories of Heterogeneous Sources Using Text Analytics

텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론

  • Kim, Dasom (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (School of Management Information Systems, Kookmin University)
  • 김다솜 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원)
  • Received : 2016.08.17
  • Accepted : 2016.12.28
  • Published : 2016.12.31

Abstract

In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.

최근 다양한 소셜 네트워크 서비스의 증가로 인해 사용자들은 각자의 목적 및 취향에 따라 여러 매체를 동시에 이용하는 경향을 보이고 있다. 또한 특정 주제에 대한 정보를 수집할 때에도 소셜 네트워크 서비스, 인터넷 뉴스, 블로그 등 여러 매체를 동시에 활용하는 것이 일반적이다. 하지만 다양한 매체를 통해 유통되는 문서들은 서로 유사한 주제, 심지어는 동일한 내용을 다루더라도 각 매체 별 정책 및 기준에 따라 각기 다른 카테고리로 관리되고 있으며, 이는 이종 매체를 아우르는 범위에서 특정 카테고리에 대한 탐색을 수행하고자 하는 시도에 걸림돌로 작용하고 있다. 이러한 제약을 극복하기 위해, 본 연구에서는 기존 매체 고유의 카테고리 체계는 그대로 유지하면서 이종 매체 간 카테고리 매핑을 수행하는 방법을 제시한다. 즉, 개별 문서를 다양한 매체의 관점에서 재분류하고 이러한 결과를 문서에 2차원 레이블로 저장함으로써, 이종 매체에 속한 다양한 문서들을 마치한 매체에 속한 것과 같이 동일한 카테고리 기준으로 탐색할 수 있는 논리적 장치를 제안한다. 본 논문에서는 국내 인터넷 뉴스 포털 사이트 두 곳의 뉴스 기사 6,000건에 대해 제안 방법론을 적용한 실험을 통해 각 기사에 매체와 카테고리 정보로 구성된 2차원 레이블을 부여하였으며, 매체 간, 지도 학습과 준지도 학습 간, 동질 학습 데이터와 이질학습 데이터 간의 정확도 비교 실험을 수행하였다. 특히 매우 흥미롭게도, 일부 카테고리에서 이질 학습 데이터를 사용한 준지도 학습의 분류 정확도가 지도 학습 및 동질 학습 데이터를 사용한 준지도 학습의 분류 정확도보다 높게 나타나는 현상을 발견하였다.

Keywords

References

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I., "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3(2003), 993-1022.
  2. Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas and R. A. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol. 41, No. 6(1990), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  3. Hearst, M. A., "Untangling Text Data Mining," Proceedings of the 37th ACL, 1999.
  4. Hofmann, T., "Probabilistic Latent Semantic Indexing," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50-57.
  5. Hong, J. S., N. Kim, and S. Lee. "A Methodology for Automatic Multi-Categorization of Single-Categorized Documents," Journal of Intelligence and Information Systems, Vol. 20, No. 3(2014), 77-92. https://doi.org/10.13088/jiis.2014.20.3.077
  6. Jeong, H., "A Study on Ontology and Topic Modeling-based Multi-dimensional Knowledge Map Services," Journal of Intelligence and Information Systems, Vol. 21, No. 4(2015), 79-92.
  7. Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proceedings of the 10th European Conference on Machine Learning, 1998, 137-142.
  8. Kang, J. H., J. C. Kim, J. H. Lee, S. S. Park and D. S. Jang, "A Comparative Study on Patent Document Classification Algorithms," Proceedings of KIIS Spring Conference, Vol. 26, No. 1(2016), 9-10.
  9. Kim, P. J. and J. Y. Lee, "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities," Journal of the Korean Society for Information Management, Vol. 24, No. 1(2007), 251-271. https://doi.org/10.3743/KOSIM.2007.24.1.251
  10. Ko, Y. and J. Seo, "Automatic Text Categorization based on Semi-Supervised Learning," Journal of KIISE: Software and Applications, Vol. 35, No. 5(2008), 325-334.
  11. Korea Internet Security Agency, 2014 Korea Internet White Paper, Korea Internet Security Agency, 2014.
  12. Korea Research Institute for Vocational Education & Training , THE HRD, Vol. 16, No. 6(2013), 136-151.
  13. Lee, S., J. Kim and S. H. Myaeng, "An Extension of Topic Models for Text Classification: A Term Weighting Approach", Proceedings of the 2015 International Conference on Big Data and Smart Computing(BigComp), 2015, 217-224.
  14. Li, C., D. R. Byun, and S. C. Park "BPNN Algorithm with SVD Technique for Korean Document Categorization", Journal of the Korea Industrial Information System Society, Vol. 15, No. 2(2010), 49-57.
  15. Liu, B., Y. Dai, X. Li, W. S. Lee and P. S. Yu, "Building Text Classifiers Using Positive and Unlabeled Examples", Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, 179-188.
  16. Lu, Y., S. Okada and K. Nitta, "Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification", Proceedings of 26th IEA/AIE, 2013, 351-360.
  17. McKinsey Global Institute, Big Data : The next Frontier for Innovation, Competition, and Productivity, McKinsey and Company, 2011.
  18. Nigam, K., A. K. McCallum, S. Thrun and T. Mitchell, "Learning to Classify Text from Labeled and Unlabeled Documents", Proceedings of 15th national conference on artificial intelligence, 1998, 792-799.
  19. Nigam, K., A. K. McCallum, S. Thrun and T. Mitchell, "Text Classification from Labeled and Unlabeled Documents Using EM", Machine Learning, Vol. 39, No. 2(2000), 103-134. https://doi.org/10.1023/A:1007692713085
  20. Nigam, K., A. McCallum, and T. Mitchell, "Semi-Supervised Text Classification Using EM", Supervised Learning, MIT Press, 2006.
  21. Rogati, M. and Y. Yang, "High-Performing Feature Selection for Text Classification", Proceedings of the International Conference on Information and Knowledge Management, 2002, 659-661.
  22. Rubin, T. N., A. Chambers, P. Smyth and M. Steyvers, "Statistical Topic Models for Multi-label Document Classification", Machine learning, Vol. 88, No. 1(2012), 157-208. https://doi.org/10.1007/s10994-011-5272-5
  23. Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1986.
  24. Salton, G., A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing", Communications of the ACM, Vol. 18, No. 11(1975), 613-620. https://doi.org/10.1145/361219.361220
  25. Silva, C. and B. Ribeiro, "Labeled and Unlabeled Data in Text Categorization", Proceedings of the IEEE International Joint Conference on Neural Networks, 2004, 2971-2976.
  26. Sun, A., "Short Text Classification Using Very Few Words", Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, 1145-1146.
  27. Vapnik, V. N., The Nature of Statistical Learning Theory, Springer, 1995.
  28. Yoon, S, S. Kim, and K. Shin, "Development of the Accident Prediction Model for Enlisted Men through an Integrated Approach to Datamining and Textmining," Journal of Intelligence and Information Systems, Vol. 21, No.3(2015), 1-17. https://doi.org/10.13088/jiis.2015.21.3.01

Cited by

  1. 전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안 vol.23, pp.3, 2016, https://doi.org/10.13088/jiis.2017.23.3.069
  2. 스마트제조를 위한 머신러닝 기반의 설비 오류 발생 패턴 도출 프레임워크 vol.23, pp.2, 2016, https://doi.org/10.7838/jsebs.2018.23.2.097
  3. 이질성 학습을 통한 문서 분류의 정확성 향상 기법 vol.24, pp.3, 2016, https://doi.org/10.13088/jiis.2018.24.3.021