DOI QR코드

DOI QR Code

Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links

위키피디아 링크를 이용한 랭크 기반 개념 계층구조의 자동 구축

  • Lee, Ga-hee (School of Electrical and Computer Engineering, University of Seoul) ;
  • Kim, Han-joon (School of Electrical and Computer Engineering, University of Seoul)
  • Received : 2015.08.18
  • Accepted : 2015.10.17
  • Published : 2015.11.30

Abstract

In general, we have utilized the hierarchical concept tree as a crucial data structure for indexing huge amount of textual data. This paper proposes a generality rank-based method that can automatically develop hierarchical concept structures with the Wikipedia data. The goal of the method is to regard each of Wikipedia articles as a concept and to generate hierarchical relationships among concepts. In order to estimate the generality of concepts, we have devised a special ranking function that mainly uses the number of hyperlinks among Wikipedia articles. The ranking function is effectively used for computing the probabilistic subsumption among concepts, which allows to generate relatively more stable hierarchical structures. Eventually, a set of concept pairs with hierarchical relationship is visualized as a DAG (directed acyclic graph). Through the empirical analysis using the concept hierarchy of Open Directory Project, we proved that the proposed method outperforms a representative baseline method and it can automatically extract concept hierarchies with high accuracy.

흔히 대용량 텍스트 데이터의 분류를 위한 인덱싱 데이터 구조로서 계층 개념 트리가 활용된다. 본 논문은 개념 계층구조를 자동적으로 구축하기 위해 위키피디아를 이용한 일반성 랭크 기반 기법을 제안한다. 이것의 목적은 위키피디아 문서를 하나의 개념으로 정의하여 이들 간의 계층적 위상관계를 생성하는 것이다. 이를 위해 위키피디아 문서들 간의 링크 개수를 주요 인자로 하여 개념 일반성을 가늠하는 랭킹함수를 고안하였으며, 이를 활용하여 개념 간 확률적 포함관계를 산출함으로써 안정적인 개념 간 계층 구조를 생성한다. 결과적으로 계층적 관계를 담은 개념쌍은 DAG 구조로 시각화 된다. Open Directory Project 계층구조를 사용한 성능 분석을 통해 제안 기법이 기준 기법에 비해 성능이 우수하며 고품질 계층 관계를 안정적으로 추출할 수 있음을 확인하였다.

Keywords

References

  1. Agrawal, D., Das, S., and El Abbadi, A., "Big data and cloud computing: new wine or just new bottles?," Proceedings of VLDB Endowment, Vol. 3, No. 1-2, pp. 1647-1648, 2010. https://doi.org/10.14778/1920841.1921063
  2. Allan, J., "Automatic hypertext link typing," Proceedings of the 7th ACM Conference on Hypertext, pp. 42-52, 1996.
  3. Amiri, H., Ahmad, A., Rahgozar, M., and Oroumchian, F., "Query Expansion Using Wikipedia Concept Graph," University of Wollongong in Dubai, 2008.
  4. Conklin, J., "Hypertext: An Introduction and Survey," IEEE Computer, Vol. 20, No. 9, pp. 17-41, 1987.
  5. De Melo, G. and Weikum, G., "MENTA: Inducing multilingual taxonomies from Wikipedia," Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1099-1108, 2010.
  6. Dubitzky, W., Wolkenhauer, O., Yokota, H., and Cho, K. H., "Encyclopedia of systems biology," Springer Publishing Company, 2013.
  7. Jensen, F. V., "An introduction to Bayesian Networks," UCL press, London, Vol. 210, 1996.
  8. Kim, H. and Chang, J., "A Semantic Text Model with Wikipedia-based Concept Space," The Journal of Society for e-Business Studies, Vol. 19, No. 3, pp. 107-123, 2014. https://doi.org/10.7838/jsebs.2014.19.3.107
  9. Kim, H. and Hong, K., "Building Semantic Concept Networks by Wikipedia-Based Formal Concept Analysis," Advanced Science Letters, Vol. 21, No. 3, pp. 435-438, 2015. https://doi.org/10.1166/asl.2015.5868
  10. Lee, G. and Kim H., "Automated Development of Concept Hierarchy Tree using Backlink Information of Wikipedia," Database Research, Vol. 31, No. 1, pp. 40-49, 2015.
  11. Lohr, S., "The age of big data," New York Times, Vol. 11, 2012.
  12. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. H., "Big data: The next frontier for innovation, competition, and productivity," The McKinsey Global Institute, 2011.
  13. McAfee, A., Brynjolfsson, E., Daven port, T. H., Patil, D. J., and Barton, D., "Big data," The Management Revolution Harvard Bus Review, Vol. 90, No. 10, pp. 61-67, 2012.
  14. Miller, G. A., "WordNet: a lexical database for English," Communications of the ACM, Vol. 38, No. 11, pp. 39-41, ACM, 1995. https://doi.org/10.1145/219717.219748
  15. Nastase, V., Strube, M., Borschinger, B., Zirn, C., and Elghafari, A., "WikiNet: A Very Large Scale Multi-Lingual Concept Network," LREC, 2010.
  16. Open directory project, http://dmoz.org
  17. Perugini, S., "Supporting mutiple paths to objects in information hierarchies: Faceted classification, facet search, and symbolic links," Information Processing and Management, Vol. 46, No. 1, pp. 22-43, 2010. https://doi.org/10.1016/j.ipm.2009.06.007
  18. Sanderson, M. and Croft, B., "Deriving concept hierarchies from text," Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 206-213, 1999.
  19. STAMFORD, Conn, "Gartner Says Solving Big Data Challenge Involves More Than Just Managing Volumes of Data," http://www.gartner.com/newsroom/id/1731916, 2011.
  20. Strube, M. and Ponzetto, S. P., "WikiRelate! Computing semantic relatedness using Wikipedia," AAAI, Vol. 6, pp. 1419-1424, 2006.
  21. Vassiliadis, P. and Sellis, T., "A survey of logical models for OLAP databases," ACM SIGMOD Record, Vol. 28, No. 4, pp. 64-69, 1999. https://doi.org/10.1145/344816.344869
  22. Wikipedia, http://en.wikipedia.org.
  23. Xu, M., Wang, Z., Bie, R., Li, J., Zheng, C., Ke, W., and Zhou, M., "Discovering missing semantic relations between entities in Wikipedia," The Semantic Web-ISWC 2013, pp. 673-686, 2013.

Cited by

  1. 용어 확장을 통한 핀테크 기술 적용가능 산업의 탐색 :네트워크 분석 및 토픽 모델링 접근 vol.26, pp.1, 2015, https://doi.org/10.7838/jsebs.2021.26.1.001