DOI QR코드

DOI QR Code

A Novel Similarity Measure for Sequence Data

  • Pandi, Mohammad. H. (School of Computer Engineering, Iran University of Science and Technology) ;
  • Kashefi, Omid (School of Computer Engineering, Iran University of Science and Technology) ;
  • Minaei, Behrouz (School of Computer Engineering, Iran University of Science and Technology)
  • Received : 2011.01.05
  • Accepted : 2011.06.21
  • Published : 2011.09.30

Abstract

A variety of different metrics has been introduced to measure the similarity of two given sequences. These widely used metrics are ranging from spell correctors and categorizers to new sequence mining applications. Different metrics consider different aspects of sequences, but the essence of any sequence is extracted from the ordering of its elements. In this paper, we propose a novel sequence similarity measure that is based on all ordered pairs of one sequence and where a Hasse diagram is built in the other sequence. In contrast with existing approaches, the idea behind the proposed sequence similarity metric is to extract all ordering features to capture sequence properties. We designed a clustering problem to evaluate our sequence similarity metric. Experimental results showed the superiority of our proposed sequence similarity metric in maximizing the purity of clustering compared to metrics such as d2, Smith-Waterman, Levenshtein, and Needleman-Wunsch. The limitation of those methods originates from some neglected sequence features, which are considered in our proposed sequence similarity metric.

Keywords

References

  1. G. Dong and J. Pei, Sequence Data Mining: Springer; 1 edition (August 9, 2007), 2007.
  2. Y. Jiong, "CLUSEQ: Efficient and Effective Sequence Clustering," in 19th International Conference on Data Engineering, Bangalore, India, 2003, pp.101-112.
  3. W. Cohen, et al., "A Comparison of String Metrics for Matching Names and Records," in ACM International Conference on Knowledge Discovery and Data Mining (KDD) 09, Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.
  4. T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, Vol.147, 1981, pp.195-197. https://doi.org/10.1016/0022-2836(81)90087-5
  5. D. C. Torney, et al., "Computation of d2:A Measure of Sequence Dissimilarity," Computers and DNA, 1990, pp.109-125.
  6. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Soviet Physics Doklady, Vol.10, 1966, pp.707-10.
  7. S. B. Needleman and C. D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," Journal of Molecular Biology, Vol.48, 1970, pp.443-453. https://doi.org/10.1016/0022-2836(70)90057-4
  8. K. Min, et al., "Typographical and Orthographical Spelling Error Correction," 2008.
  9. Hodge and Austin, "An Evaluation of Phonetic Spell Checkers " Mechanisms of Radiation Eflects in Electronic Materials 2001.
  10. B. Bansal, et al., "Isolated-word Error Correction for Partially Phonemic Languages using Phonetic Cues," in International Conference on Knowledge based Computer Systems (KBCS 2004), Hyderabad, India, 2004, pp.509-519.
  11. R. Mitton, "Ordering the suggestions of a spellchecker without using context," Nat. Lang. Eng., Vol.15, 2009, pp.173-192. https://doi.org/10.1017/S1351324908004804
  12. L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29, June 2007 2007.
  13. M. Bilenko and R. J. Mooney, "Adaptive Duplicate Detection Using Learnable String Similarity Measures," in ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 2003.
  14. E. S. Ristad and P. N. Yianilos, "Learning String Edit Distance," IEEE Transactions on Pattern Recognition and Machine Intelligence, Vol.20, 1998, pp.522-532. https://doi.org/10.1109/34.682181
  15. P. Jaccard, "Etude comparative de la distribution florale dans une portion des Alpes et des Jura," Bulletin de la Societe Vaudoise des Sciences Naturelles, Vol.37, 1901, pp.547-579.
  16. V. J. Hodge and J. Austin, "A Novel Binary Spell Checker.," in International Conference on Artificial Neural Networks (ICANN'2001), Vienna, Austria, 2001.
  17. R. W. Hamming, "Error detecting and error correcting codes," Bell System Technical Journal, Vol.29, 1950, pp.147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
  18. J. Zobel and P. Dart, "Finding approximate matches in large lexicons," Softw. Pract. Exper., Vol.25, 1995, pp.331-345. https://doi.org/10.1002/spe.4380250307
  19. A. Elmagarmid, et al., "Duplicate Record Detection: A Survey," IEEE Transactions on Knowledge and Data Engineering, Vol.19, 2007, pp.1-16. https://doi.org/10.1109/TKDE.2007.250581
  20. A. E. Monge and C. P. Elkan, "The field matching problem: Algorithms and applications," in Second International Conference on Knowledge Discovery and Data Mining, (KDD), 1996.
  21. M. A. Jaro, "Advances in record linkage methodology as applied to the 1985 census of Tampa Florida," Journal of the American Statistical Society, Vol.84, 1989, pp.414-20. https://doi.org/10.1080/01621459.1989.10478785
  22. M. A. Jaro, "Probabilistic linkage of large public health data file," Statistics in Medicine, Vol.14, 1995, pp.491-8. https://doi.org/10.1002/sim.4780140510
  23. W. E. Winkler, "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage'," in Section on Survey Research Methods, American Statistical Association, 1990, pp.472-477.
  24. J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retrieval," in 19th annual international ACM SIGIR conference on Research and development in information retrieval 1996.
  25. J. M. Ponte and W. B. Croft, "A language modeling approach to information retrieval," in 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98), ACM, New York, NY, USA, 1998, pp.275-281.
  26. C. D. Manning, et al., An Introduction to Information Retrieval: Cambridge University Press; 1 edition (July 7, 2008), 2008.
  27. Altschul, et al., "Basic local alignment search tool," J Mol Biol, Vol.215, 1990, pp.403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
  28. W. Pearson, "Rapid and sensitive sequence comparisonwith fastp and fasta," Methods Enzymol, Vol.183, 1990, pp.63-98. https://doi.org/10.1016/0076-6879(90)83007-V
  29. L. Noe and G. Kucherov, "YASS: enhancing the sensitivity of DNA similarity search.," Nucleic Acids Research, Vol.33(2), 2005, pp.540-543. https://doi.org/10.1093/nar/gki478
  30. W. G. Liu, et al., "Bio-Sequence Database Scanning on a GPU," 2006.
  31. K. H. Rosen, "The Growth of Functions," in Discrete Mathematics and its Applications, 4th edition ed: McGraw-Hill, 1998, pp.80-90.
  32. D. J. Hartfiel, Markov Set-Chains: Springer-Verlag, 1998.
  33. T. Hastie, et al., "Hierarchical clustering," in The Elements of Statistical Learning, ed New York: Springer, 2009, pp.520-528.
  34. E. Amigo, et al., "A comparison of extrinsic clustering evaluation metrics based on formal constraints," Information Retrieval, Vol.12, 2008, pp.461-486.

Cited by

  1. A service scenario generation scheme based on association rule mining for elderly surveillance system in a smart home environment vol.25, pp.7, 2012, https://doi.org/10.1016/j.engappai.2012.02.003
  2. Preserving Differential Privacy for Similarity Measurement in Smart Environments vol.2014, 2014, https://doi.org/10.1155/2014/581426
  3. Instance-Level Subsequence Matching Method based on a Virtual Window vol.3, pp.2, 2014, https://doi.org/10.3745/KTCCS.2014.3.2.43
  4. On private Hamming distance computation vol.69, pp.3, 2014, https://doi.org/10.1007/s11227-013-1063-z