A study on the ordering of PIM family similarity measures without marginal probability

주변 확률을 고려하지 않는 확률적 흥미도 측도 계열 유사성 측도의 서열화

  • Received : 2015.02.10
  • Accepted : 2015.03.18
  • Published : 2015.03.31


Today, big data has become a hot keyword in that big data may be defined as collection of data sets so huge and complex that it becomes difficult to process by traditional methods. Clustering method is to identify the information in a big database by assigning a set of objects into the clusters so that the objects in the same cluster are more similar to each other clusters. The similarity measures being used in the cluster analysis may be classified into various types depending on the nature of the data. In this paper, we computed upper and lower limits for probability interestingness measure based similarity measures without marginal probability such as Yule I and II, Michael, Digby, Baulieu, and Dispersion measure. And we compared these measures by real data and simulated experiment. By Warrens (2008), Coefficients with the same quantities in the numerator and denominator, that are bounded, and are close to each other in the ordering, are likely to be more similar. Thus, results on bounds provide means of classifying various measures. Also, knowing which coefficients are similar provides insight into the stability of a given algorithm.

데이터마이닝 기법 중의 하나인 군집분석은 다양한 특성을 지닌 관찰대상에 대해 유사성을 바탕으로 동질적인 군집으로 묶은 후, 동일 군집에 속해 있는 공통된 특성을 조사하는데 이용되는 기법이다. 본 논문에서는 주변 확률을 고려하지 않는 확률적 흥미도 측도 기반 유사성 측도인 Yule I과 II, Michael, Digby, Baulieu, 그리고 Dispersion 측도에 대해 상한 및 하한을 설정함으로써 이들의 대소관계를 규명하였다. 그 결과, 세 가지 유형의 대소 관계가 성립한다는 사실을 수식의 증명뿐만 아니라 실제 데이터 및 모의실험 데이터에 의해서도 확인할 수 있었다. 이들 측도들은 각 경계에 있는 측도와는 더욱 더 유사한 값을 가지므로 각 측도의 상한 및 하한은 여러 가지 측도들을 분류하는 도구가 되며, 실제 값의 관점에서 각 측도들의 관계를 알게 되면 주어진 알고리즘의 안정화에 도움이 될 수 있을 것이다.



  1. Baulieu, F. B. (1989). A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6, 233-246.
  2. Choi, S. S., Cha, S. H. and Tappert, C. (2010). A survey of binary similarity and distance measures. Journal on Systemics, Cybernetics and Informatics, 8, 43-48.
  3. Gordon, A. D. (1999). Classification, Chapman & Hall, London-New York.
  4. Kim, M., Jeon, J., Woo, K. and Kim, M. (2010). A new similarity measure for categorical attribute-based clustering. Journal of Korean Institute of Information Scientists and Engineers : Databases, 37, 71-81.
  5. Lee, J. H. (2013). Big data, data mining and temporary reproduction. The Journal of Intellectual Property, 8, 93-125.
  6. Lee, K. A. and Kim, J. H. (2011). Comparison of clustering with yeast microarray gene expression data. Journal of the Korean Data & Information Science Society, 22, 741-753.
  7. Lim, J. S. and Lim, D. H. (2012). Comparison of clustering methods of microarray gene expression data. Journal of the Korean Data & Information Science Society, 23, 39-51.
  8. Michael, E. L. (1920). Marine ecology and the coefficient of association. Journal of Animal Ecology, 8, 54-59.
  9. Park, H. C. (2012). Exploration of PIM based similarity measures as association rule thresholds. Journal of the Korean Data & Information Science Society, 23, 1127-1135.
  10. Park, H. C. (2014). Comparison of cosine family similarity measures in the aspect of association rule. Journal of the Korean Data Analysis Society, 16, 729-737.
  11. Park, H. J. and Kim, J. T. (2013). Classification of universities in Daegu.Gyungpook by support vector cluster analysis. Journal of the Korean Data & Information Science Society, 24, 783-791.
  12. Ryu, J. Y. and Park, H. C. (2013). A study on Jaccard dissimilarity measures for negative association rule generation. Journal of the Korean Data Analysis Society, 15, 3111-3121.
  13. Stanfill, C. and Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29, 1213-1228.
  14. Warrens, M. J. (2008). Bounds of resemblance measures for binary (presence/absence) variables. Journal of Classification, 25, 195-208.
  15. Yeo, I. K. (2011). Clustering analysis of Korea's meteorological data. Journal of the Korean Data & Information Science Society, 22, 941-949.
  16. Yule, G. U. (1900). On the association of attributes in statistics. Philosophical Transactions of the Royal Society, 75, 257-319.
  17. Yule, G. U. (1912). On the methods of measuring the association between two attributes. Journal of the Royal Statistical Society, 75, 579-652.

Cited by

  1. Bounds of PIM-based similarity measures with partially marginal proportion vol.26, pp.4, 2015,
  2. Generally non-linear regression model containing standardized lift for association number estimation vol.27, pp.3, 2016,
  3. Signed Hellinger measure for directional association vol.27, pp.2, 2016,