DOI QR코드

DOI QR Code

The Effect of Bias in Data Set for Conceptual Clustering Algorithms

  • Received : 2019.07.04
  • Accepted : 2019.07.13
  • Published : 2019.09.30

Abstract

When a partitioned structure is derived from a data set using a clustering algorithm, it is not unusual to have a different set of outcomes when it runs with a different order of data. This problem is known as the order bias problem. Many algorithms in machine learning fields try to achieve optimized result from available training and test data. Optimization is determined by an evaluation function which has also a tendency toward a certain goal. It is inevitable to have a tendency in the evaluation function both for efficiency and for consistency in the result. But its preference for a specific goal in the evaluation function may sometimes lead to unfavorable consequences in the final result of the clustering. To overcome this bias problems, the first clustering process proceeds to construct an initial partition. The initial partition is expected to imply the possible range in the number of final clusters. We apply the data centric sorting to the data objects in the clusters of the partition to rearrange them in a new order. The same clustering procedure is reapplied to the newly arranged data set to build a new partition. We have developed an algorithm that reduces bias effect resulting from how data is fed into the algorithm. Experiment results have been presented to show that the algorithm helps minimize the order bias effects. We have also shown that the current evaluation measure used for the clustering algorithm is biased toward favoring a smaller number of clusters and a larger size of clusters as a result.

Keywords

References

  1. I.H. Witten, E. Frank, M Hall, and C. Palestro, "Data Mining: Practical Machine Learning Tools and Techniques," Elsevier Science & Technology, pp. 9-33, 2017.
  2. T. Mitchell, Machine Learning, McGraw-Hill Education, 1997.
  3. P. Berkhin, "A Survey of Clustering Data Mining Techniques," Grouping Multidimensional Data, Springer, Berlin, Heidelberg, pp. 25-71, 2006. DOI:https://doi.org/10.1007/3-540-28349-8_2
  4. U. Luxburg, "Clustering Stability: An Overview," Foundations and Trends in Machine Learning," Vol. 2, No. 3, pp. 235-274, 2010, DOI:10.1561/2200000008
  5. M.B. Zafar, I. Valera, M.G. Rodriguez, and K. Gummadi, "Fairness Constraints: Mechanisms for Fair Classification,"
  6. M. Hildebrandt, "Preregistration of Machine Learning Research Design. Against P-hacking," in Being Profiled: Cogitas Ergo Sum, ed. E. Bayamlioglu, I. Baraliuc, L. Janssens, and M. Hildebrandt, Amsterdam University Press, 2018.
  7. D. Fisher, "Knowledge acquisition via incremental conceptual clustering", Machine Learning Vol.2, pp. 139-172, 1987. DOI: htps://doi.org/10.1007/BF00114265
  8. V. Kanageswari and A.Pethalakshmi, "A Novel Approach of Clustering Using COBWEB". International Journal of Information Technology (IJIT), Vol. 3 No. 3, pp 37-42, Jun 2017. DOI: https://doi.org/10.33144/24545414
  9. D. Fisher, "Iterative Optimization and Simplification of Hierarchical Clustering," Journal of AI Research, Vol.4, pp. 147-179, 1996. DOI: https://doi.org/10.1613/jair.276
  10. F. Cao, J. Liang, and L. Bai, "A New Initialization Method for Categorical Data Clustering," Expert Systems with Applications, Vol 35, Issue 7, pp. 10223-10228, Sep. 2009, DOI: https://doi.org/10.1016/j.eswa.2009.01.060
  11. G. Biswas, J.B. Weinberg, and D. Fisher, "ITERATE: A Conceptual Clustering Algorithm for Data Mining," IEEE Tr. on Systems, Man and Cybernetics, Vol. 28, Part C No. 2. 1998. DOI: https://doi.org/10.1109/5326.669556
  12. UCI repository, https://archive.ics.uci.edu