Document Clustering using Non-negative Matrix Factorization and Fuzzy Relationship

비음수 행렬 분해와 퍼지 관계를 이용한 문서군집

  • 박선 (BK21-전북 전자정보 고급인력양성사업단) ;
  • 김경준 (한국과학기술원 전산학과)
  • Received : 2009.12.28
  • Accepted : 2010.04.30
  • Published : 2010.04.30

Abstract

This paper proposes a new document clustering method using NMF and fuzzy relationship. The proposed method can improve the quality of document clustering because the clustered documents by using fuzzy relation values between semantic features and terms to distinguish well dissimilar documents in clusters, the selected cluster label terms by using semantic features with NMF, which is used in document clustering, can represent an inherent structure of document set better. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

본 논문은 비음수 행렬 분해와 퍼지 관계를 이용한 새로운 문서군집 방법을 제안한다. 제안된 방법은 비음수 행렬 분해된 의미특징을 이용하여 군집 레이블과 군집의 대표 용어들을 선택함으로서 문서군집의 내부구조를 더 잘 표현할 수 있으며, 퍼지 관계 값을 이용한 군집은 문서군집에 유사하지 않은 문서를 더 잘 구분함으로써 문서군집의 성능을 높일 수 있다. 실험결과 제안방법을 적용한 문서군집방법이 다른 문서군집 방법에 비하여 좋은 성능을 보인다.

Keywords

References

  1. Hu, T., Xiong, H., Zhou, W., Sung, S. Y., Luo, H.: Hypergraph Partitioning for Document Clustering: A Unified Clique Perspective. In proceeding of SIGIR'08, 871-872 (2008)
  2. Ricardo, B. Y., Berthier, R. N.: Moden Information Retrieval, ACM Press (1999)
  3. Chakrabarti, S.: Mining the Web : Discovering Knowledge from Hypertext Data. Morgan Kaufmann (2003)
  4. Han. J., Kamber., M.: Data Mining Concepts and Techniques Second Edition. Morgan Kaufmann (2006)
  5. Ji, X., Xu, W., Zhu, S.: Document Clustering with Prior Knowledge. In proceeding of SIGIR'06, 405-412 (2006)
  6. Lee, D. D., Seung, H. S.: Learning the parts of objects by non-negative matrix factorization. Nature, 401:788-791 (1999) https://doi.org/10.1038/44565
  7. Xu, W., Liu, X. and Gong, Y.: Document clustering based on non-negative matrix factorization. In proceeding of ACM SIGIR'03 (2003)
  8. Haruechaiyasak, C., Shyu, M. L., Chen, S. C.: Web Document Classification Based on Fuzzy Association. In proceedings of the 25th Annual International Computer Software and Applications Conference (COMPSAC'02) (2002)
  9. S. Basu, A. Banerjee, R. Mooney, "Semi-supervised Clustering by Seeding", Proceeding of International Conference on Machine Learning (ICML), 19-26, 2002.
  10. Li, T., Ma, S., Ogihara, M.: Document Clustering via Adaptive Subspace Iteration. In proceeding of SIGIR'04, 218-225 (2004)
  11. Wang, F., Zhang, C.: Regularized Clustering for Documents. In proceeding of ACM SIGIR'07, 95-102 (2007)
  12. Park, S., An, D. U., Char, B. R., Kim, C. W.: Document Clustering with Cluster Refinement and Non-negative Matrix Factorizaion. In proceeding of ICONIP'09, (2009)
  13. Park, S., An, D. U., Cha, B. R., Kim, C. W.: Document Clustering with Semantic Feature and Fuzzy Association. In proceeding of ICISTM'10, (2010)
  14. Frankes, W. B. Ricardo, B. Y.: Information Retrieval, Data Structure & Algorithms. Prentice-Hall (1992)
  15. The 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/, 2009.
  16. Xu, W., Gong, Y.: Document Clustering by Concept Factorization. In proceeding of SIGIR'04, 202-209 (2004)