A Proposition of Association Rule Thresholds Considering Relative Occurrence/Nonoccurrence Rates

상대적 발생/비발생 비율을 고려한 연관성 평가 기준의 모형 개발

  • Received : 2013.07.04
  • Accepted : 2013.08.04
  • Published : 2013.08.31

Abstract

Big data is a collection of large data sets that it becomes difficult to process using traditional data processing applications. Data mining is drawing attention in the big data era. It is a method to find useful information for huge amounts of data in database. The techniques of data mining are association rules, decision tree, clustering, neural network and so on. Association rule technique searches for interesting relationships among items in a given large data set, and has been applied in various fields like internet shopping mall, finance, health and medical science, insurance, image analysis, and manufacturing control. There are three primary quality measures for meaningful association rules; support, confidence, and lift. In this paper, we propose some association thresholds considering relative occurrence and nonoccurrence rates for association rule exploration of rare cases. The comparative studies with several kinds of supports and confidences are shown by numerical example. As a result, the higher the simultaneous occurrence frequency, the values of support and confidence considering relative occurrence and nonoccurrence rates are greater than the existing supports and confidences.

오늘날 빅 데이터 시대를 맞이하여 엄청난 규모의 데이터베이스 안에서 유용한 정보를 찾아내 주는 데이터 마이닝 기술이 주목받고 있다. 데이터 마이닝 기법 중에서 가장 많이 연구되고 있는 연관성 규칙은 지지도, 신뢰도, 그리고 향상도를 기반으로 하여 거대한 양의 데이터베이스에 내재되어 있는 항목들 간의 관련성을 탐색하는 데 활용되고 있다. 연관성 규칙을 생성하고자 하는 경우 먼저 사용자가 지정한 최소 지지도의 조건을 만족하는 빈발항목집합을 생성한 후, 생성된 빈발항목집합에 대해 최저 신뢰도의 조건을 만족하는 규칙을 연관성 규칙으로 채택한다. 이 때 항목 발생 비율이 매우 작은 경우에는 빈발항목집합에 포함될 가능성이 매우 희박하기 때문에 신뢰도의 값이 매우 크다고 할지라도 연관성 규칙으로 채택되지 않는다. 이를 위해 본 논문에서는 항목의 상대적 발생 및 비발생 비율을 동시에 고려한 연관성 규칙 평가 모형을 제안하고, 예제를 이용하여 기존의 연관성 평가 기준과 비교하였다. 그 결과, 그 결과, 동시발생빈도의 값이 증가함에 따라 본 논문에서 제안한 연관성 평가 기준 모두가 증가하는 것으로 나타났다. 또한 지지도와 신뢰도의 값들도 기존의 것들 보다 더 큰 값을 가지므로 사용자가 지정한 최소 지지도와 신뢰도의 조건을 만족하는 규칙으로 채택될 가능성이 가장 큰 것으로 나타났다.

Keywords

References

  1. Agrawal, R., Imielinski, R., Swami, A. (1993). Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD Conference on Management of Data, 207-216.
  2. Agrawal, R., Srikant, R. (1994). Fast algorithms for mining association rules, Proceedings of the 20th VLDB Conference, 487-499.
  3. Bayardo, R. J. (1998). Efficiently mining long patterns from databases, Proceedings of ACM SIGMOD Conference on Management of Data, 85-93.
  4. Cai, C. H., Fu, A. W. C., Cheng, C. H., Kwong, W. W. (1998). Mining association rules with weighted items, Proceedings of International Database Engineering and Applications Symposium, 68-77.
  5. Cho, K. H., Park, H. C. (2011a). Study on the multi intervening relation in association rules, Journal of the Korean Data Analysis Society, 13(1), 297-306. (in Korean).
  6. Cho, K. H., Park, H. C. (2011b). Discovery of insignificant association rules using external variable, Journal of the Korean Data Analysis Society, 13(3), 1343-1352. (in Korean).
  7. Han, J., Fu, Y. (1995). Discovery of multiple-level association rules from large databases, Proceeding of the 21st VLDB Conference, 420-431.
  8. Han, J., Fu, Y. (1999). Mining multiple-level association rules in large databases, IEEE Transactions on Knowledge and Data Engineering, 11(5), 68-77.
  9. Han, J., Pei, J., Yin, Y. (2000). Mining frequent patterns without candidate generation, Proceedings of ACM SIGMOD Conference on Management of Data, 1-12.
  10. Jin, D. S., Kang, C., Kim, K. K., Choi, S. B. (2011). CRM on travel agency using association rules, Journal of the Korean Data Analysis Society, 13(6), 2945-2952. (in Korean).
  11. Liu, B., Hsu, W., Ma, Y. (1999). Mining association rules with multiple minimum supports, Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, 337-341.
  12. Park, H. C. (2011). The application of some similarity measures to association rule thresholds, Journal of the Korean Data Analysis Society, 13(3), 1331-1342. (in Korean).
  13. Park, H. C. (2012a). Exploration of symmetric similarity measures by conditional probabilities as association rule thresholds, Journal of the Korean Data Analysis Society, 14(2), 707-716. (in Korean).
  14. Park, H. C. (2012b). Utilization of association rule thresholds considering relative occurrence rates, Journal of the Korean Data Analysis Society, 14(4), 1861-1870. (in Korean).
  15. Park, J. S., Chen, M. S., Philip, S. Y. (1995). An effective hash-based algorithms for mining association rules, Proceedings of ACM SIGMOD Conference on Management of Data, 175-186.
  16. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L. (1999). Discovering frequent closed itemsets for association rules, Proceedings of the 7th International Conference on Database Theory, 398-416.
  17. Pei, J., Han, J., Mao, R. (2000). CLOSET: An efficient algorithm for mining frequent closed itemsets, Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 21-30.
  18. Saygin, Y., Vassilios, S. V., Clifton, C. (2002). Using unknowns to prevent discovery of association rules, Proceedings of 2002 Conference on Research Issues in Data Engineering, 45-54.
  19. Srinkant, R., Vu, Q., Agrawal, R. (1997). Mining association rules with item constraints, Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 67-73.
  20. Toivonen, H. (1996). Sampling large database for association rules, Proceedings of the 22nd VLDB Conference, 134-145.