DOI QR코드

DOI QR Code

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young (Department of Game Design and Development, Sangmyung University) ;
  • Chang, Juno (Department of Game Design and Development, Sangmyung University) ;
  • Kihl, Taesuk (Department of Game Design and Development, Sangmyung University)
  • Received : 2013.07.08
  • Accepted : 2013.09.27
  • Published : 2013.12.31

Abstract

In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

Keywords

References

  1. K. Nyberg, T. Raiko, T. Tiinanen, and E. Hyvonen, "Document classification utilising ontologies and relations between documents," in Proceeding of the 8th Workshop on Mining and Learning with Graphs, Washington: DC, pp. 86-93, 2010.
  2. R. K. Ayyasamy, B. Tahayna, S. Alhashmi, S. Eu-Gene, and S. Egerton, "Mining Wikipedia knowledge to improve document indexing and classification," in Proceeding of 10th International Conference on Information Sciences, Signal Processing and their Applications, Kuala Lumpur, Malaysia, pp. 806-809, 2010.
  3. R. Ferreira, F. Freitas, P. Brito, J. Melo, R. Lima, and E. Costa, "RetriBlog: an architecture-centered framework for developing blog crawlers," Expert Systems with Applications, vol. 40, no. 4, pp. 1177-1195, 2013. https://doi.org/10.1016/j.eswa.2012.08.020
  4. S. Park, C. W. Kim, and D. U. An, "E-mail classification and category reorganization using dynamic category hierarchy and PCA," Journal of Information and Communication Engineering, vol. 7, no. 3, pp. 351-355, 2009.
  5. H. Yun, "Classifying temporal topics with similar patterns on Twitter," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 295-300, 2011. https://doi.org/10.6109/jicce.2011.9.3.295
  6. H. Yun, "Quantifying influence in social networks and news media," Journal of Information and Communication Convergence Engineering, vol. 10, no. 2, pp. 135-140, 2012. https://doi.org/10.6109/jicce.2012.10.2.135
  7. B. Baharudin, L. H. Lee, and K. Khan, "A review of machine learning algorithms for text-documents classification," Journal of Advances in Information Technology, vol. 1, no. 1, pp. 4-20, 2010.
  8. T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, "Statistical topic models for multi-label document classification," Machine Learning, vol. 88, no. 1-2, pp. 157-208, 2012. https://doi.org/10.1007/s10994-011-5272-5
  9. G. Lu, P. Huang, L. He, C. Cu, and X. Li, "A new semantic similarity measuring method based on Web search engines," WSEAS Transactions on Computers, vol. 9, no. 1, pp. 1-10, 2010.
  10. Z. Jialei, C. G. Hwang, G. D. Jung, and Y. K. Choi, "A design of K-XMDR search system using topic maps," Journal of Information and Communication Engineering, vol. 9, no. 3, pp. 287-294, 2011. https://doi.org/10.6109/jicce.2011.9.3.287
  11. S. Samarawickrama and L. Jayaratne, "Automatic text classification and focused crawling," in Proceeding of 6th International Conference on Digital Information Management, Melbourne, Australia, pp. 143-148, 2011.
  12. A. K. McCallum, MALLET: a machine learning for language toolkit [Internet]. Available: http://mallet.cs.umass.edu.
  13. A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
  14. J. H. Lim, Y. S. Hwang, S. Y. Park, and H. C. Rim, "Semantic role labeling using maximum entropy model," in Proceeding of the Conference on Computational Natural Language Learning, Boston: MA, pp. 122-125, 2004.
  15. H. Tan, T. Zhao, H. Wang, and W. P. Hong, "Identification of Chinese event types based on local feature selection and explicit positive & negative feature combination," International Journal of the Korean Institute of Maritime Information and Communication Sciences, vol. 5, no. 3, pp. 233-238, 2007.
  16. Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceeding of the 14th International Conference on Machine Learning, Nashville: TN, pp. 412-420, 1997.
  17. K. Seki and J. Mostafa, "An application of text categorization methods to gene ontology annotation," in Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 138-145, 2005.
  18. T. Kihl, J. Chang, and S. Y. Park, "Application tag system based on experience and pleasure for hedonic searches," in Convergence and Hybrid Information Technology, Heidelberg, Germany: Springer, pp. 342-352, 2012.
  19. S. Y. Park, J. Chang, and T. Kihl, "Application of Web search results for document classification," in Future Information Communication Technology and Applications, Heidelberg, Germany: Springer, pp. 293-298, 2013.

Cited by

  1. The Sensitivity Analysis for Customer Feedback on Social Media vol.19, pp.4, 2015, https://doi.org/10.6109/jkiice.2015.19.4.780
  2. 단어 군집 기반 모바일 애플리케이션 범주화 vol.19, pp.3, 2014, https://doi.org/10.9708/jksci.2014.19.3.017
  3. 모바일 앱 트렌드를 고려한 2단계 군집화 방법 vol.20, pp.4, 2013, https://doi.org/10.9708/jksci.2015.20.4.017
  4. 소셜 빅 데이터를 이용한 여행사 평가에 관한 연구 vol.19, pp.10, 2013, https://doi.org/10.6109/jkiice.2015.19.10.2241
  5. Keyword Analysis Based Document Compression System vol.16, pp.1, 2013, https://doi.org/10.6109/jicce.2018.16.1.48