DOI QR코드

DOI QR Code

Comparison of Initial Seeds Methods for K-Means Clustering

K-Means 클러스터링에서 초기 중심 선정 방법 비교

  • Lee, Shinwon (Department of Computer System Engineering, Jungwon University)
  • 이신원 (중원대학교 컴퓨터시스템공학과)
  • Received : 2012.09.19
  • Accepted : 2012.10.15
  • Published : 2012.12.31

Abstract

Clustering method is divided into hierarchical clustering, partitioning clustering, and more. K-Means algorithm is one of partitioning clustering and is adequate to cluster so many documents rapidly and easily. It has disadvantage that the random initial centers cause different result. So, the better choice is to place them as far away as possible from each other. We propose a new method of selecting initial centers in K-Means clustering. This method uses triangle height for initial centers of clusters. After that, the centers are distributed evenly and that result is more accurate than initial cluster centers selected random. It is time-consuming, but can reduce total clustering time by minimizing the number of allocation and recalculation. We can reduce the time spent on total clustering. Compared with the standard algorithm, average consuming time is reduced 38.4%.

클러스터링 기법은 데이터에 대한 특성에 따라 몇 개의 클러스터로 군집화 하는 계층적 클러스터링이나 분할 클러스터링 등 다양한 기법이 있는데 그 중에서 K-Means 알고리즘은 구현이 쉬우나 할당-재계산에 소요되는 시간이 증가하게 된다. 또한 초기 클러스터 중심이 임의로 설정되기 때문에 클러스터링 결과가 편차가 심하다. 본 논문에서는 클러스터링에 소요되는 시간을 줄이고 안정적인 클러스터링을 하기 위해 초기 클러스터 중심 선정 방법을 삼각형 높이를 이용하는 방법을 제안하고 비교 실험해 봄으로서 할당-재계산 횟수를 줄이고 전체 클러스터링 시간을 감소시키고자 한다. 실험결과로 평균 총소요시간을 보면 최대평균거리를 이용하는 방법은 기존 방법에 비해서 17.9% 감소하였고, 제안한 방법은 38.4% 감소하였다.

Keywords

References

  1. Giordano Adami, Paolo Avesani, and Diego Sona, "Clustering documents in a web directory", Proceedings of the 5th ACM international workshop on Web information and data management, pp.66-73, 2003.
  2. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, "Introduction to Information Retrieval", Cambridge University Press, pp.331-338, 2008.
  3. Jain, A. K. and Dubes, R. C., "Algorithms for Clustering Data". Prentice-Hall advanced reference series. Prentice-Hall, Inc., Upper Saddle River, NJ. 1988.
  4. S. P. Lloyd, "Least squares quantization in PCM", Special issue on quantization, IEEE Trans. Inform. Theory, 28, pp.129-137, 1982. https://doi.org/10.1109/TIT.1982.1056489
  5. McQueen, J. "Some methods for classification and analysis of multivariate observations", In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.281-297, 1967.
  6. D.A.Meedeniya, and A.S.Perera, "Evaluation of Partition-Based Text Clustering Techniques to Categorize Indic Language Documents", IEEE International Advance Computing Conference(IACC 2009), pp.1497-1500, 2009.
  7. Paul Bunn, and Rafail Ostrovsky, "Secure Two-Party k-Means Clustering", Proceedings of the 14th ACM conference on Computer and communications security, Alexandria, Virginia, USA, pp.486-497, 2007.
  8. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman and Chaitanya Swamy, "The Effectiveness of Lloyd-Type Methods for then k-Means Problem", Proceedings of the 47th Annual IEEE Symposium on Foundaions of Computer Science, pp.165-176, 2006.
  9. Nachiketa Sahoo, Jamie Callan, Ramayya Krishnan , George Duncan, and Rema Padman, "Incremental hierarchical clustering of text documents", Proceedings of the 15th ACM international conference on Information and knowledge management, pp.357-366, 2006.
  10. Yu Yonghong, and Bai Wenyang, "Text clustering based on term weights automatic partition", Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference, pp.373-377, 2010.
  11. Shinwon Lee, "A Study on Hierarchical Clustering using Advanced K-Means Algorithm for Information Retrieval", Chonbuk University doctoral thesis, 2005.
  12. Madhu Yedla et al., "Enhancing K-means Clustering Algorithm with Improved Initial Center", International Journal of Computer Science and Information Technologies, Vol. 1(2), pp.121-125, 2010.
  13. Shinwon Lee, Wonhee Lee, "Refining Initial Seeds using Max Average Distance for K-Means Clustering", Korean Society for Internet Information, pp.103-112, 2011.

Cited by

  1. Baseline building energy modeling of cluster inverse model by using daily energy consumption in office buildings vol.140, 2017, https://doi.org/10.1016/j.enbuild.2017.01.086
  2. Cluster Analysis of Snowfall Observatory Using K-means Algorithm vol.18, pp.2, 2018, https://doi.org/10.9798/KOSHAM.2018.18.2.55
  3. K-Means 군집모형과 계층적 군집(교차효율성 메트릭스에 의한 평균연결법, Ward법)모형 및 혼합모형을 이용한 컨테이너항만의 클러스터링 측정에 대한 실증적 비교 및 검증에 관한 연구 vol.34, pp.3, 2012, https://doi.org/10.38121/kpea.2018.09.34.3.17