DOI QR코드

DOI QR Code

A Semantic Text Model with Wikipedia-based Concept Space

위키피디어 기반 개념 공간을 가지는 시멘틱 텍스트 모델

  • Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul) ;
  • Chang, Jae-Young (Department of Computer Engineering, Hansung University)
  • Received : 2014.07.18
  • Accepted : 2014.08.19
  • Published : 2014.08.31

Abstract

Current text mining techniques suffer from the problem that the conventional text representation models cannot express the semantic or conceptual information for the textual documents written with natural languages. The conventional text models represent the textual documents as bag of words, which include vector space model, Boolean model, statistical model, and tensor space model. These models express documents only with the term literals for indexing and the frequency-based weights for their corresponding terms; that is, they ignore semantical information, sequential order information, and structural information of terms. Most of the text mining techniques have been developed assuming that the given documents are represented as 'bag-of-words' based text models. However, currently, confronting the big data era, a new paradigm of text representation model is required which can analyse huge amounts of textual documents more precisely. Our text model regards the 'concept' as an independent space equated with the 'term' and 'document' spaces used in the vector space model, and it expresses the relatedness among the three spaces. To develop the concept space, we use Wikipedia data, each of which defines a single concept. Consequently, a document collection is represented as a 3-order tensor with semantic information, and then the proposed model is called text cuboid model in our paper. Through experiments using the popular 20NewsGroup document corpus, we prove the superiority of the proposed text model in terms of document clustering and concept clustering.

텍스트마이닝 연구의 기본적인 난제는 기존 텍스트 표현모델이 자연어 문장으로 기술된 텍스트 데이터로부터 의미 또는 개념 정보를 표현하지 않는데 기인한다. 기존 텍스트 표현모델인 벡터공간 모델(vector space model), 불리언 모델(Boolean model), 통계 모델(statistical model), 텐서공간 모델(tensor space model) 등은 'Bag-of-Words' 방식에 바탕을 두고 있다. 이러한 텍스트 모델들은 텍스트에 포함된 단어와 그것의 출현 횟수만으로 텍스트를 표현하므로, 단어의 함축 의미, 단어의 순서 및 텍스트의 구조를 전혀 표현하지 못한다. 대부분의 텍스트 마이닝 기술은 대상 문서를 'Bag-of-Words' 방식의 텍스트 모델로 표현함을 전제로 하여 발전하여 왔다. 하지만 오늘날 빅데이터 시대를 맞이하여 방대한 규모의 텍스트 데이터를 보다 정밀하게 분석할 수 있는 새로운 패러다임의 표현모델을 요구하고 있다. 본 논문에서 제안하는 텍스트 표현모델은 개념공간을 문서 및 단어와 동등한 매핑 공간으로 상정하여, 그 세 가지 공간에 대한 연관 관계를 모두 표현한다. 개념공간의 구성을 위해서 위키피디어 데이터를 활용하며, 하나의 개념은 하나의 위키피디어 페이지로부터 정의된다. 결과적으로 주어진 텍스트 문서집합을 의미적으로 해석이 가능한 3차 텐서(3-order tensor)로 표현하게 되며, 따라서 제안 모델을 텍스트 큐보이드 모델이라 명명한다. 20Newsgroup 문서집합을 사용하여 문서 및 개념 수준의 클러스터링 정확도를 평가함으로써, 제안 모델이 'Bag-of-Word' 방식의 대표적 모델인 벡터공간 모델에 비해 우수함을 보인다.

Keywords

References

  1. Antonellis, I. and Gallopoulos, E., Exploring term-document matrices from matrix models in text mining, SIAM Text Mining Workshop, SIAM Conference on Data Mining, 2006.
  2. Berry, M. W., Survey of text mining : Clustering, Classification, and Retrieval, Springer-Verlag, 2003.
  3. Cai, D., He, X., Wen, J. R., Han, J., and Ma, W. Y., Support Tensor Machines for Text Categorization, Technical Report UIUCDCS-R-2006-2714, 2006.
  4. Cavnar, W. B. and Trenkle, J. M., N-Gram-Based Text Categorization, Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, 1994.
  5. Faulkner, A., Automated Classification of Stance in Student Essays : An Approach Using Stance Target Information and the Wikipedia Link-Based Measure, Science, Vol. 376, No. 12, p. 86, 2014.
  6. Gabrilovich, E. and Markovitch, S., Feature generation for text categorization using world knowledge, Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1048-1053, 2005.
  7. Howard, T. and Croft, W. B., Inference networks for document retrieval, Proceedings of International ACM SIGIR, pp. 1-24, 1989.
  8. http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaosar.pdf.
  9. http://www.statsoft.com/textbook/text-mining/.
  10. Jiang, C., Coenen, F., Sanderson, R., and Zito, M., Text Classification Using Graph Mining-Based Feature Extraction, Knowledge-Based Systems, Vol. 23, No. 4, pp. 302-308, 2009.
  11. Kimbrough, S., Executive Briefing : Text Mining for Business Intelligence, INSEAD-UNILEVER workshop, 2006.
  12. Lancaster, F. W. and Fayen, E. G., Information Retrieval On-Line, Melville Publishing Co., 1973.
  13. Maron, M. and Kuhns, J., On relevance, probabilistic indexing and information retrieval, Journal of the Association for Computing Machinery, Vol. 7, pp. 216-244, 1960. https://doi.org/10.1145/321033.321035
  14. Martinez, D. and Baldwin, T., Word sense disambiguation for event trigger word detection, Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics, pp. 41-48, 2010.
  15. Navigli, R., Word sense disambiguation : A survey, ACM Computing Surveys, Vol. 41, No. 2, pp. 1-69, 2009.
  16. Ribeiro, B. and Muntz, R. A., Belief Network Model for IR, Proceedings of International ACM SIGIR, pp. 253-260, 1996.
  17. Salton, G., Wong, A., and Yang, C. S., A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. https://doi.org/10.1145/361219.361220
  18. Schenker, A., Last, M., Bunke, H., and Kandel, A., Classification of Web Documents Using a Graph Model, Proceedings of 7th International Conference on Document Analysis and Recognition, pp. 240-244, 2003.
  19. Sui, Z., Zhao, Q., and Liu, Y., Inducting Concept Hierarchies from Text based on FCA, Proceedings of Fourth International Conference on Innovative Computing, Information and Control, pp. 1080-1083, 2009.
  20. Tamara, G. K. and Bader, B., Tensor Decompositions and Applications, SIAM Review, Vol. 51, No. 3, pp. 455-500, 2009. https://doi.org/10.1137/07070111X
  21. The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012.
  22. Witten, I. H., Text Mining, http://www.cs.waikato.ac.nz/-ihw/papers/04-IHW-Textmining.pdf.
  23. Wu, J., Xuan, Z., and Pan, D., Enhancing Text Representation for Classification Tasks with Semantic Graph Structures, International Journal of Innovative Computing, Information Control, Vol. 7, No. 5(B), pp. 2689-2698, 2011.
  24. Yeon, J., Shim, J., and Lee, S. G., Outlier Detection Techniques for Biased Opinion Discovery, Journal of Society for e-Business Studies, Vol. 18, No. 4, pp. 315-326, 2013. https://doi.org/10.7838/jsebs.2013.18.4.315
  25. Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., Text representation : from vector to tensor, Fifth IEEE International Conference on Data Mining, pp. 725-728, 2005.

Cited by

  1. A Tensor Space Model based Semantic Search Technique vol.21, pp.4, 2016, https://doi.org/10.7838/jsebs.2016.21.4.001
  2. Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links vol.20, pp.4, 2015, https://doi.org/10.7838/jsebs.2015.20.4.061
  3. Multidimensional Text Warehousing for Automated Text Classification vol.11, pp.2, 2018, https://doi.org/10.4018/JITR.2018040110
  4. 한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구 vol.24, pp.3, 2014, https://doi.org/10.13088/jiis.2018.24.3.221