A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet

Chung, Eun-Kyung;

doi:10.3743/KOSIM.2009.26.3.261

Journal of the Korean Society for information Management (정보관리학회지)

Volume 26 Issue 3
/
Pages.261-278
/
2009
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet

문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구

Chung, Eun-Kyung

정은경 (이화여자대학교 사회과학대학 문헌정보학)

Published : 2009.09.30

https://doi.org/10.3743/KOSIM.2009.26.3.261 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Identifying optimal feature sets in Text Categorization(TC) is crucial in terms of improving the effectiveness. In this study, experiments on feature expansion were conducted using author provided keyword sets and article titles from typical scientific journal articles. The tool used for expanding feature sets is WordNet, a lexical database for English words. Given a data set and a lexical tool, this study presented that feature expansion with synonymous relationship was significantly effective on improving the results of TC. The experiment results pointed out that when expanding feature sets with synonyms using on classifier names, the effectiveness of TC was considerably improved regardless of word sense disambiguation.

기계학습 기반 문서범주화 기법에 있어서 최적의 자질을 구성하는 것이 성능향상에 있어서 중요하다. 본 연구는 학술지 수록 논문의 필수적 구성요소인 저자 제공 키워드와 논문제목을 대상으로 자질확장에 관한 실험을 수행하였다. 자질확장은 기본적으로 선정된 자질에 기반하여 WordNet과 같은 의미기반 사전 도구를 활용하는 것이 일반적이다. 본 연구는 키워드와 논문제목을 대상으로 WordNet 동의어 관계 용어를 활용하여 자질확장을 수행하였으며, 실험 결과 문서범주화 성능이 자질확장을 적용하지 않은 결과와 비교하여 월등히 향상됨을 보여주었다. 이러한 성능향상에 긍정적인 영향을 미치는 요소로 파악된 것은 정제된 자질 기반 및 분류어 기준의 동의어 자질확장이다. 이때 용어의 중의성 해소 적용과 비적용 모두 성능향상에 영향을 미친 것으로 파악되었다. 본 연구의 결과로 키워드와 논문제목을 활용한 분류어 기준 동의어 자질 확장은 문서 범주화 성능향상에 긍정적인 요소라는 것을 제시하였다.

Keywords

References

이재윤. 2005. 자질선정 기준과 가중치 할당 방식 간의 관계를 고려한 문서 자동분류의 개 선에 관한 연구. 한국문헌정보학회지, 39(2): 123-146
Barak, L., I. Dagan, & E. Shnarch, 2009. 'Text categorization from category name via lexical reference.' Proceedings of NAACL HLT 2009: Short Papers, 33-36
Bloehdorn, S., & A. Hotho. 2004. Boosting for text classification with semantic features. Proceedings of the MSW 2004 Workshop at the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Brank, J., M. Grobelnik, N. Milic-Frayling, & D. Mladenic. 2002. Interaction of feature selection methods and linear classification models. Proceedings of the ICML Workshop on Text Learning.
de Buenaga Rodriguez, M., J. Gomez-Hidalgo, & B. Diaz-Agudo. 1997. Using WordNet to complement training information in text categorization. In the Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing, 150-157
Chen, J., H. Huang, S. Tian, & Y. Qu. 2009. Feature selection for text classification with Naive Bayes, Expert Systems with Applications, 36: 5432-5435 https://doi.org/10.1016/j.eswa.2008.06.054
Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press
Forman, G. 2003. 'An extensive empirical study of feature selection metrics for text classification.' Journal of Machine Learning, 3: 1289-1305 https://doi.org/10.1162/153244303322753670
John, G. H., R. Kohavi, & K. Pfleger. 1994. 'Irrelevant features and the subset selection problem.' Proceedings of the 11th International Conference on Machine Learning, 121-129
Kehagias, A., V. Petridis, V. G. Kaburlasos, & P. Fragkou. 2001. A comparison of word-and sense-based text categorization using several classification algorithms. Journal of Intelligent Information System https://doi.org/10.1023/A:1025554732352
Lewis, D. D. 1995. Evaluating and optimizing autonomous text categorization systems. Unpublished Doctoral Dissertation, University of Massachusetts, Massachusetts
Miller, G. 1995. 'WordNet: A lexical database for English.' Communications of the ACM, 38(11): 39-41 https://doi.org/10.1145/219717.219748
Mansuy, T., & R. J. Hilderman. 2006. Evaluating WordNet features in Text Classification models
Rosso, P., E. Ferretti, D. Jimenez, & V. Vidal. 2004. Text categorization and information retrieval using WordNet senses, Proceedings of GWC2004, 299-304
Scott, S., & S. Matwin. 1998. Text classification using WordNet Hypernyms. In the Proceedings of the Workshop on Usage of WordNet in Natural Language Processing Systems, 45-52
Sebastiani, F. 2002. Hypertext categorization. In A. Zanasi(Eds.), Text Mining and Its Applications(109-129), Southhampton, U.K.: WIT Press
Sebastiani, F. 2005. Text categorization. In A. Zanasi(Eds.), Text mining and its applications( 109-129), Southhampton, U.K.; WIT Press
van Rijsbergen, C. J. 1979. Information Retrieval. Butterworths, London
Verikas, A., & M. Bacauskiene. 2002. 'Feature selection with neural networks.' Pattern Recognition Letters, 23: 1323-1335 https://doi.org/10.1016/S0167-8655(02)00081-8
Witten, I. H., & E. Frank. 2000. Data Mining: Practical Machine Learning Tools and Techniques with JAVA Implementations. CA: San Diego, Academic Press
Yang, Y. 1999. 'An evaluation of statistical approaches to text categorization.' Information Retrieval, 1: 69-90 https://doi.org/10.1023/A:1009982220290

Cited by

Web Document Classification Based on Hangeul Morpheme and Keyword Analyses vol.19D, pp.4, 2012, https://doi.org/10.3745/KIPSTD.2012.19D.4.263
An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014, https://doi.org/10.4275/KSLIS.2014.48.3.491

Journal of the Korean Society for information Management (정보관리학회지)

A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet

문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)