DOI QR코드

DOI QR Code

Improving Hypertext Classification Systems through WordNet-based Feature Abstraction

워드넷 기반 특징 추상화를 통한 웹문서 자동분류시스템의 성능향상

  • Roh, Jun-Ho (School of Electrical and Computer Engineering, University of Seoul) ;
  • Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul) ;
  • Chang, Jae-Young (Department of Computer Engineering, Hansung University)
  • Received : 2013.03.18
  • Accepted : 2013.04.22
  • Published : 2013.05.31

Abstract

This paper presents a novel feature engineering technique that can improve the conventional machine learning-based text classification systems. The proposed method extends the initial set of features by using hyperlink relationships in order to effectively categorize hypertext web documents. Web documents are connected to each other through hyperlinks, and in many cases hyperlinks exist among highly related documents. Such hyperlink relationships can be used to enhance the quality of features which consist of classification models. The basic idea of the proposed method is to generate a sort of ed concept feature which consists of a few raw feature words; for this, the method computes the semantic similarity between a target document and its neighbor documents by utilizing hierarchical relationships in the WordNet ontology. In developing classification models, the ed concept features are equated with other raw features, and they can play a great role in developing more accurate classification models. Through the extensive experiments with the Web-KB test collection, we prove that the proposed methods outperform the conventional ones.

본 논문은 기계학습 기법에 기반한 웹문서 자동분류 시스템의 성능을 높이기 위한 새로운 형태의 특징가공 기법을 제안한다. 제안 기법은 하이퍼텍스트 웹문서에 대한 자동분류를 효과적으로 수행하기 위해 하이퍼링크 관계를 활용하여 특징 집합을 확장시킨다. 웹문서는 하이퍼링크 관계를 통해 서로 연결된 구조를 가지며, 그 관계는 많은 경우 연관도가 높은 문서들 간에 존재한다. 이러한 링크 정보가 분류모델의 주요 인자가 되는 특징 집합의 질을 높이는 중요한 역할을 수행할 수 있다. 제안 기법의 기본 아이디어는 워드넷 온톨로지를 기반으로 분류 대상 문서와 인접 문서들에 포함된 단어(특징)들 간의 의미적 유사도를 평가함으로써 다수의 특징들로 구성된 추상화된 개념적 특징을 생성하는 것이다. 여기서 유사도 함수는 워드넷 안에서 특징들 간의 상/하위어 관계 정보를 정량적으로 계산하게 된다. 분류모델의 구축시 추상화된 개념 특징은 일반 특징과 동일하게 간주하여 보다 정확한 분류 모델을 구축하는데 기여한다. Web-KB 문서집합을 이용한 실험을 통해 제안 기법이 기존 기법 보다 우수함을 보였다.

Keywords

References

  1. Chakrabarti, S., Dom, B., and Indyk, P., "Enhanced hypertext categorization using hyperlinks," Proceedings of the ACM SIGMOD International Conference, pp. 307-318, 1998.
  2. Chang, J. Y., "A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in On-Line Shopping Mall," The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp. 19-33, 2009.
  3. Elberrichi, Z., Rahmoun, A., and Bentaalah, M. A., "Using WordNet for Text Categorization," The International Arab Journal of Information Technology, Vol. 5, No. 1, pp. 16-24, 2008.
  4. Jiang, J. and Conrath, D., "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proceedings of International Conference on Research in Computational Linguistics, pp. 19-33, 1997.
  5. Lee, J. W., Park, S. C., Lee, S. K., Park, J. H., Kim, H. J., and Lee, S. G., "Semantic Search and Recommendation of e-Catalog Documents through Concept Network," The Journal of Society for e-Business Studies, Vol. 15, No. 3, pp. 131-145, 2010.
  6. Lu, Z., Liu, Y., Zhao, S., and Chen, X., "Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization," Proceedings of the 2nd International Conference on Future Networks, pp. 105-109, 2010.
  7. MALLET, MAchine Learning for Language Toolkit, http://mallet.cs.umass.edu/.
  8. Mansuy, T. and Hilderman, R., "Evaluating WordNet Features in Text Classification Models," Proceedings of the 19th International Florida Artificial Intelligence Research Symposium, pp. 568-573, 2006.
  9. Mitchell, T. M., Machine Learning, McGraw-Hill, 1997.
  10. Oh, H. J. and Myaeng, S. H., "A Hypertext Categorization Method using Incrementally Computable Class Link Information," Journal of Korean Institute of Information Scientist and Engineers, Vol. 29, No. 7-8, pp. 498-509, 2002.
  11. Oh, S. J., Ahn, J. H., and Park, J. S., "Ontology Selection Ranking Model based on Semantic Similarity Approach," The Journal of Society for e-Business Studies, Vol. 14, No. 2, pp. 95-116, 2009.
  12. Priss, U., "Formal Concept Analysis in Information Science," Annual Review of Information Science and Technology, Vol. 40, No. 1, pp. 521-543, 2006.
  13. RiTa.WordNet, A WordNet library for Java/Processing, http://rednoise.org/rita/wordnet/documentation/index.htm.
  14. Scott, S. and Matwin, S., "Feature engineering for text classification," Proceedings of 16th International Conference on Machine Learning, pp. 379-388, 1999.
  15. Utard, H. and Fürnkranz, J., "Link-Local Features for Hypertext Classification," Semantics, Web and Mining : Joint International Workshops, Lecture Notes in Computer Science, Vol. 4289, pp. 51-64, 2005.
  16. Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., "Text representation: from vector to tensor," Proceedings of 5th IEEE International Conference on Data Mining, pp. 725-728, 2005.
  17. Zhao, Y., Karypis, G., and Fayyad, U., "Hierarchical Clustering Algorithms for Document Datasets," Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp. 141-168, 2005. https://doi.org/10.1007/s10618-005-0361-3

Cited by

  1. An Opinion Document Clustering Technique for Product Characterization vol.19, pp.2, 2014, https://doi.org/10.7838/jsebs.2014.19.2.095
  2. The Study of Developing Korean SentiWordNet for Big Data Analytics : Focusing on Anger Emotion vol.19, pp.4, 2014, https://doi.org/10.7838/jsebs.2014.19.4.001
  3. Comparison Between Optimal Features of Korean and Chinese for Text Classification vol.25, pp.4, 2015, https://doi.org/10.5391/JKIIS.2015.25.4.386
  4. A Two-Phase On-Device Analysis for Gender Prediction of Mobile Users Using Discriminative and Popular Wordsets vol.21, pp.1, 2016, https://doi.org/10.7838/jsebs.2016.21.1.065
  5. An Analysis of IT Proposal Evaluation Results using Big Data-based Opinion Mining vol.41, pp.1, 2018, https://doi.org/10.11627/jkise.2018.41.1.001
  6. 텍스트 분석 기술 및 활용 동향 vol.42, pp.2, 2017, https://doi.org/10.7840/kics.2017.42.2.471
  7. 생체신호를 활용한 학습기반 영유아 스트레스 상태 식별 모델 연구 vol.22, pp.2, 2013, https://doi.org/10.7838/jsebs.2017.22.2.001
  8. 학회 웹사이트의 토픽 정보추출을 이용한 주제에 따른 학회 자동분류 기법 vol.22, pp.2, 2013, https://doi.org/10.7838/jsebs.2017.22.2.061
  9. A Study on Classification Scheme Generation for Automatic Classification of Unlabeled Documents vol.21, pp.12, 2013, https://doi.org/10.9728/dcs.2020.21.12.2211
  10. Modeling of Child Stress-State Identification Based on Biometric Information in Mobile Environment vol.2021, pp.None, 2021, https://doi.org/10.1155/2021/5531770