DOI QR코드

DOI QR Code

Guiding Practical Text Classification Framework to Optimal State in Multiple Domains

  • Choi, Sung-Pil (Department of Information Technology Research, KISTI) ;
  • Myaeng, Sung-Hyon (School of Engineering, Information and Communications University) ;
  • Cho, Hyun-Yang (Library & Information Science Department, Kyonggi University)
  • Published : 2009.06.25

Abstract

This paper introduces DICE, a Domain-Independent text Classification Engine. DICE is robust, efficient, and domain-independent in terms of software and architecture. Each module of the system is clearly modularized and encapsulated for extensibility. The clear modular architecture allows for simple and continuous verification and facilitates changes in multiple cycles, even after its major development period is complete. Those who want to make use of DICE can easily implement their ideas on this test bed and optimize it for a particular domain by simply adjusting the configuration file. Unlike other publically available tool kits or development environments targeted at general purpose classification models, DICE specializes in text classification with a number of useful functions specific to it. This paper focuses on the ways to locate the optimal states of a practical text classification framework by using various adaptation methods provided by the system such as feature selection, lemmatization, and classification models.

Keywords

References

  1. F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47, 2002. https://doi.org/10.1145/505282.505283
  2. J. Dorre, P. Gerstl, and R. Seiffert, “Text mining: finding nuggets in mountains of textual data,” in Proceedings of KDD-99, 5th ACM International Conference on Knowledge Discovery and Data Mining. San Diego, US: ACM Press, New York, USA, pp. 398–401.
  3. D. D. Lewis, D. L. Stern, and A. Singhal, “ATTICS: a software platform for on-line text classification,” in Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, M. A. Hearst, F. Gey, and R. Tong, Eds. Berkeley, US: ACM Press, New York, US, 1999, pp. 267–268.
  4. G. Forman, “An extensive empirical study of feature selection metrics for text classification,” Journal of Machine Learning Research, Vol. 3, pp. 1289–1305, March 2003.
  5. Moulinier, “Feature selection: a useful preprocessing step,” in Proceedings of BCSIRSG-97, the 19th Annual Colloquium of the British Computer Society Information Retrieval Specialist Group, ser. Electronic Workshops in Computing, J. Furner and D. Harper, Eds. Aberdeen, UK: Springer Verlag, Heidelberg, DE, 1997.
  6. Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of ICML-97, 14th International Conference on Machine Learning, D. H. Fisher, Ed. Nashville, US: Morgan Kaufmann Publishers, San Francisco, US, 1997, pp. 412–420.
  7. Moschitti, “A study on optimal parameter tuning for rocchio text classifier,” in Proceedings of ECIR-03, 25th European Conference on Information Retrieval, F. Sebastiani, Ed. Pisa, IT: Springer Verlag, pp. 420–435, 2003.
  8. H. Paijmans, “Text categorization as an information retrieval task,” The South African Computer Journal, vol. 21, pp. 4–15, 1999.
  9. D. Mladenic, “Feature subset selection in text learning,” in Proceedings of ECML-98, 10th European Conference on Machine Learning, C. Needellec and C. Rouveirol, Eds. Chemnitz, DE: Springer Verlag, Heidelberg, DE, 1998, pp. 95-100, published in the “Lecture Notes in Computer Science” Series Number 1398.
  10. M. Radovanovic and M. Ivanovic, “Interactions between document representation and feature selection in text categorization,” in Proceedings of DEXA-06, 17th International Conference on Database and Expert Systems Applications, ser. Lecture Notes in Computer Science, vol. 4080. Krakow, Poland: Springer-Verlag, pp.489–498, 2006.
  11. M. Makrehchi and M. S. Kamel, “Text classification using small number of features,” in Proceedings of MLDM-05, 4th International Conference on Machine Learning and Data Mining in Pattern Recognition, ser. Lecture Notes in Artificial Intelligence, Vol. 3587. Leipzig, Germany: Springer-Verlag, pp. 580–589, 2005.
  12. D. Mladenic and M. Grobelnik, “Feature selection on hierarchy of web documents,” Decision Support Systems, Vol. 35, No. 1, pp. 45–87, 2003.
  13. Kolcz, V. Prabakarmurthi, and J. K. Kalita, “String match and text extraction: Summarization as feature selection for text categorization,” in Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management, H. Paques, L. Liu, and D. Grossman, Eds. Atlanta, US: ACM Press, New York, US, pp. 365–370, 2001.
  14. G. Forman, “A pitfall and solution in multi-class feature selection for text classification,” in Proceedings of ICML-04, 21st International Conference on Machine Learning, C. E. Brodley, Ed. Banff, CA: Morgan Kaufmann Publishers, San Francisco, US, 2004.
  15. E. Gabrilovich and S. Markovitch, “Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5,” in Proceedings of ICML-04, 21st International Conference on Machine Learning, C. E. Brodley, Ed. Banff, CA: Morgan Kaufmann Publishers, San Francisco, US, 2004.
  16. L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the use of feature selection and negative evidence in automated text categorization,” in Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, J. L. Borbinha and T. Baker, Eds. Lisbon, PT: Springer Verlag, Heidelberg, DE, 2000, pp. 59–68, published in the “Lecture Notes in Computer Science” Series Number 1923.
  17. E. Montanes, I. Díaz, J. Ranilla, E. F. Combarro, and J. Fern´andez, "Scoring and selecting terms for text categorization," IEEE Intelligent Systems, Vol. 20, No. 3, pp. 40-47, 2005.
  18. H. Taira and M. Haruno, “Feature selection in svm text categorization,” in Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence. Orlando, US: AAAI Press, Menlo Park, US, pp. 480–486, 1999.
  19. Y.-S. Lai and C.-H. Wu, “Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology,” ACM Transactions on Asian Language Information Processing, Vol. 1, No. 1, pp. 34–64, 2002. https://doi.org/10.1145/595576.595579
  20. Lee and G. G. Lee, “Information gain and divergence-based feature selection for machine learning-based text categorization,” Information Processing and Management, Vol. 42, No. 1, pp. 155–165, 2006. https://doi.org/10.1016/j.ipm.2004.08.006
  21. P. Soucy and G. W. Mineau, “A simple feature selection method for text classification,” in Proceeding of IJCAI-01, 17th International Joint Conference on Artificial Intelligence, B. Nebel, Ed., Seattle, US, pp. 897–902, 2001.
  22. S. Cohen, E. Ruppin, and G. Dror, “Feature selection based on the shapely value,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotand, August 2005, pp. 665–670.
  23. Kolcz and A. Chowdhury, “Avoidance of model re-induction in svmbased feature selection for text categorization,” in Proceedings of the International Joint Conference on Artificial Intelligence, Hyderabad, India, pp. 889–894, 2007.
  24. J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan, and W.-Y. Ma, “OCFS: optimal orthogonal centroid feature selection for text categorization,” in Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, August 2005, pp. 122–129.
  25. Z. Zheng, X. Wu, and R. Srihari, “Feature selection for text categorization on imbalanced data,” SIGKDD Explorations, Vol. 6, No. 1, pp. 80–89, 2004. https://doi.org/10.1145/1007730.1007741
  26. R. Basili, A. Moschitti, and M. T. Pazienza, “An hybrid approach to optimize feature selection process in text classification,” in Proceedings of AI*IA-01, 7th Congress of the Italian Association for Artificial Intelligence, F. Esposito, Ed. Bari, IT: Springer Verlag, Heidelberg, DE, 2001, pp. 320–325, published in the “Lecture Notes in Computer Science” Series Number 2175.
  27. W. Wibowo and H. E. Williams, “Simple and accurate feature selection for hierarchical categorisation,” in Proceedings of the 2002 ACM Symposium on Document engineering. McLean, US: ACM Press, New York, US, pp. 111–118, 2005.
  28. Dhillon, S. Mallela, and R. Kumar, “A divisive information-theoretic feature clustering algorithm for text classification,” Journal of Machine Learning Research, Vol. 3, pp. 1265–1287, March 2003. https://doi.org/10.1162/153244303322753661
  29. G. Wang and F. H. Lochovsky, “Feature selection with conditional mutual information maximin in text categorization,” in Proceedings of CIKM-04, 13th ACM International Conference on Information and Knowledge Management, D. A. Evans, L. Gravano, O. Herzog, C. Zhai, and M. Ronthaler, Eds. Washington, US: ACM Press, New York, US, pp. 342–349, 2004.
  30. Anagnostopoulos, A. Broder, and K. Punera, “Effective and efficient classification on a search-engine model,” in CIKM, 2006.
  31. (2001) The online plain text english dictionary. [Online]. Available: http://www.mso.anu.edu.au/ralph/OPTED/
  32. F. Sebastiani, “A tutorial on automated text categorisation,” in Proceedings of ASAI-99, 1st Argentinean Symposium on Artificial Intelligence, A. Amandi and R. Zunino, Eds., Buenos Aires, AR, 1999, pp. 7–35, an extended version appears as [1].
  33. (2008) Korean Indexing Engine. [Online]. Available:www.kristalinfo.com/K-Lab/idx.
  34. (2008) Korean Morphological Analyzer. [Online]. Available:www.kristalinfo.com/K-Lab/ma.

Cited by

  1. 그리드 기반의 고성능 과학기술지식처리 프레임워크 개발 vol.9, pp.12, 2009, https://doi.org/10.5392/jkca.2009.9.12.877