Refining Initial Seeds using Max Average Distance for K-Means Clustering

K-Means 클러스터링 성능 향상을 위한 최대평균거리 기반 초기값 설정

  • 이신원 (중원대학교 IT공학부) ;
  • 이원휘 (전북대학교 대학원 컴퓨터공학과)
  • Received : 2010.11.30
  • Accepted : 2011.03.09
  • Published : 2011.04.30

Abstract

Clustering methods is divided into hierarchical clustering, partitioning clustering, and more. If the amount of documents is huge, it takes too much time to cluster them in hierarchical clustering. In this paper we deal with K-Means algorithm that is one of partitioning clustering and is adequate to cluster so many documents rapidly and easily. We propose the new method of selecting initial seeds in K-Means algorithm. In this method, the initial seeds have been selected that are positioned as far away from each other as possible.

대규모 데이터에 대한 특성에 따라 몇 개의 클러스터로 군집화하는 클러스터링 기법은 계층적 클러스터링이나 분할 클러스터링 등 다양한 기법이 있는데 그 중에서 K-Means 알고리즘은 구현이 쉬우나 할당-재계산에 소요되는 시간이 증가하게 된다. 본 논문에서는 초기 클러스터 중심들 간의 거리가 최대가 되도록 하여 초기 클러스터 중심들이 고르게 분포되도록 함으로써 할당-재계산 횟수를 줄이고 전체 클러스터링 시간을 감소시키고자 한다.

Keywords

References

  1. Giordano Adami, Paolo Avesani, and Diego Sona, "Clustering documents in a web directory", Proceedings of the 5th ACM international workshop on Web information and data management, pp.66-73, 2003.
  2. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, "Introduction to Information Retrieval", Cambridge University Press, pp.331-338, 2008.
  3. Jain, A. K. and Dubes, R. C., "Algorithms for Clustering Data". Prentice-Hall advanced reference series. Prentice-Hall, Inc., Upper Saddle River, NJ. 1988.
  4. S. P. Lloyd, "Least squares quantization in PCM", Special issue on quantization, IEEE Trans. Inform. Theory, 28, pp.129-137, 1982. https://doi.org/10.1109/TIT.1982.1056489
  5. McQueen, J. "Some methods for classification and analysis of multivariate observations", In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.281-297, 1967.
  6. D.A.Meedeniya, and A.S.Perera, "Evaluation of Partition-Based Text Clustering Techniques to Categorize Indic Language Documents", IEEE International Advance Computing Conference (IACC 2009), pp.1497-1500, 2009.
  7. Paul Bunn, and Rafail Ostrovsky, "Secure Two-Party k-Means Clustering", Proceedings of the 14th ACM conference on Computer and communications security, Alexandria, Virginia, USA, pp.486-497, 2007.
  8. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman and Chaitanya Swamy, "The Effectiveness of Lloyd-Type Methods for then k-Means Problem", Proceedings of the 47th Annual IEEE Symposium on Foundaions of Computer Science, pp.165-176, 2006.
  9. Nachiketa Sahoo, Jamie Callan, Ramayya Krishnan, George Duncan, and Rema Padman, "Incremental hierarchical clustering of text documents", Proceedings of the 15th ACM international conference on Information and knowledge management, pp.357-366, 2006.
  10. Yu Yonghong, and Bai Wenyang, "Text clustering based on term weights automatic partition", Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference, pp.373-377, 2010.
  11. 이신원 "정보검색을 위한 개선된 K-Means 알고리즘을 이용한 계층적 클러스터링에 관한 연구", 박사학위 논문, 전북대학교, 2005.