Comparison of Cosine Family Similarity Measures in the Aspect of Association Rule

연관성 규칙에서의 코사인 계열 유사성 측도들의 비교 연구

  • Received : 2014.03.07
  • Accepted : 2014.04.10
  • Published : 2014.04.30

Abstract

Today, governments and private organizations began to actively take advantage of data mining techniques to get meaningful information from big data. The most widely used data mining technique is association rule mining. This technique finds the relevance among several items using association rule thresholds such as support, confidence, lift, and other interestingness measures. And it is being utilized in the shopping mall analysis, the health and medical fields, and social network analysis, etc. Among these interestingness measures, confidence is the most frequently used, but it has the drawback that it can not exhibit the direction of association. In this paper, we have tried to compare confidence with some cosine family similarity measures, and checked whether the criteria of association threshold were met for. And then the comparative studies with cosine family similarity measures were shown by numerical example. The results showed that Forbes II coefficient was the best association threshold.

오늘날 공공 기관이나 민간단체에서는 빅 데이터로부터 의미 있는 정보를 얻기를 원하고 있고, 모바일 기기의 급속한 보급으로 데이터의 양이 기하급수적으로 증가하게 되었으며, 이로 인하여 빅 데이터 분석의 필요성이 대두되었다. 또한 빅 데이터로부터 가치 있는 정보를 얻기 위해 데이터마이닝 기법이 활발하게 활용되고 있다. 이 기법 중에서 연관성 규칙은 지지도, 신뢰도, 향상도 등의 흥미도 측도를 이용하여 빅 데이터 내에 있는 항목들 간에 관련성을 찾아내는 것으로 쇼핑몰이나 보건 및 의료 분야, 그리고 사회연결망 분석 등에서 많이 활용되고 있다. 연관성 규칙을 생성하기 위한 흥미도 측도는 크게 객관적 흥미도 측도와 주관적 흥미도 측도로 나누어진다. 본 논문에서는 객관적인 측도 중에서 군집분석이나 다차원 분석에서 많이 활용되고 있는 코사인 계열 유사성 측도들에 대해 연관성 평가 기준의 관점에서 여러 가지 특성을 살펴보는 동시에 흥미도 측도의 기준에 대한 충족여부를 점검하고, 예제를 이용하여 이들 측도들의 유용성에 대해 알아보았다. 그 결과, Forbes II 계수가 가장 바람직한 연관성 평가 기준으로 활용할 수 있다는 사실을 발견할 수 있었다.

Keywords

References

  1. Agrawal, R., Imielinski, R., Swami, A. (1993). Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD Conference on Management of Data, 207-216.
  2. Cho, K. H., Park, H. C. (2011a). Study on the multi intervening relation in association rules, Journal of the Korean Data Analysis Society, 13(1), 297-306. (in Korean).
  3. Cho, K. H., Park, H. C. (2011b). Discovery of insignificant association rules using external variable, Journal of the Korean Data Analysis Society, 13(3), 1343-1352. (in Korean).
  4. Choi, S. S. (2008). Correlation analysis of binary similarity and dissimilarity measures, Ph. D. paper, Pace University.
  5. Choi, S. S., Cha, S. H., Tappert, C. C. (2010). A survey of binary similarity and distance measures, Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.
  6. Freitas, A. (1999). On rule interestingness measures, Knowledge-based System, 12, 309-315. https://doi.org/10.1016/S0950-7051(99)00019-2
  7. Hilderman, R. J., Hamilton, H. J. (2000). Applying objective interestingness measures in data mining systems, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 432-439.
  8. Jin, D. S., Kang, C., Kim, K. K., Choi, S. B. (2011). CRM on travel agency using association rules, Journal of the Korean Data Analysis Society, 13(6), 2945-2952. (in Korean).
  9. Jung, Y. C. (2012). Big data, Communicationbooks Press, Seoul. (in Korean).
  10. Liu, B., Hsu, W., Chen, S., Ma, Y. (2000). Analyzing the subjective interestingness of association rules, IEEE Intelligent Systems, 15(5), 47-55. https://doi.org/10.1109/5254.889106
  11. Liu, B., Hsu, W., Ma, Y. (1999). Mining association rules with multiple minimum supports, Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, 337-241.
  12. Park, H. C. (2011a). The application of some similarity measures to association rule thresholds, Journal of the Korean Data Analysis Society, 13(3), 1331-1342. (in Korean).
  13. Park, H. C. (2011b). The proposition of attributably pure confidence in association rule mining, Journal of the Korean Data and Information Science Society, 22(2), 235-243. (in Korean).
  14. Park, H. C. (2012a). Utilization of association rule thresholds considering relative occurrence rates, Journal of the Korean Data Analysis Society, 14(4), 1861-1870. (in Korean).
  15. Park, H. C. (2012b). Exploration of symmetric similarity measures by conditional probabilities as association rule thresholds, Journal of the Korean Data Analysis Society, 14(2), 707-716. (in Korean).
  16. Park, H. C. (2013a). Utilization of association rule thresholds considering frequency and relatively prior rates, Journal of the Korean Data Analysis Society, 15(2), 709-718. (in Korean).
  17. Park, H. C. (2013b). A proposition of association rule thresholds considering relative occurrence/nonoccurrence, Journal of the Korean Data Analysis Society, 15(4), 1841-1850. (in Korean).
  18. Park, H. C. (2014). Comparison of confidence measures useful for classification model building, Journal of the Korean Data and Information Science Society, 25(2), 365-371. (in Korean). https://doi.org/10.7465/jkdi.2014.25.2.365
  19. Park, W. J. (2012). The anticipation & anxiety for big data utilization, Journal of Communications and Radio Spectrum, 51, 28-47. (in Korean).
  20. Piatetsky-Shapiro, G. (1991). Discovery, analysis and presentation of strong rules, Knowledge Discovery in Databases, 229-248.
  21. Saygin, Y., Vassilios, S. V., Clifton, C. (2002). Using unknowns to prevent discovery of association rules, Proceedings of 2002 Conference on Research Issues in Data Engineering, 45-54.
  22. Silberschatz, A., Tuzhilin, A. (1996). What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge Data Engineering, 8, 970-974. https://doi.org/10.1109/69.553165
  23. Tan, P. N., Kumar, V., Srivastava, J. (2002). Selecting the right interestingness measure for association patterns, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 32-41.