DOI QR코드

DOI QR Code

Word Similarity Calculation by Using the Edit Distance Metrics with Consonant Normalization

  • Received : 2014.03.24
  • Accepted : 2014.08.01
  • Published : 2015.12.31

Abstract

Edit distance metrics are widely used for many applications such as string comparison and spelling error corrections. Hamming distance is a metric for two equal length strings and Damerau-Levenshtein distance is a well-known metrics for making spelling corrections through string-to-string comparison. Previous distance metrics seems to be appropriate for alphabetic languages like English and European languages. However, the conventional edit distance criterion is not the best method for agglutinative languages like Korean. The reason is that two or more letter units make a Korean character, which is called as a syllable. This mechanism of syllable-based word construction in the Korean language causes an edit distance calculation to be inefficient. As such, we have explored a new edit distance method by using consonant normalization and the normalization factor.

Keywords

References

  1. R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," Journal of the ACM, vol. 21, no. 1, pp. 168-173, 1974. https://doi.org/10.1145/321796.321811
  2. G. Navarro, "A guided tour to approximate string matching," ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001. https://doi.org/10.1145/375360.375365
  3. K. Roh, J. W. Kim, E. Kim, K. Park, and H. G. Cho, "Edit distance problem for the Korean alphabet," Journal of Korean Institute of Information Scientists and Engineers: Systems and Theory, vol. 37, no. 2, pp. 103-109, 2010.
  4. B. Bae, S. S. Kang, and B. Y. Hwang, "Edit distance calculation by phonetic rules and word-length normalization," in Proceedings of the European Computing Conference (ECC'12), Prague, Czech Republic, 2012, pp. 315-319.
  5. R. W. Hamming, "Error detecting and error correcting codes," Bell System Technical Journal, vol. 29, no. 2, pp. 147-160, 1950. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
  6. V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions and reversals," Soviet Physics-Doklady, vol. 10, no. 8, pp. 707-710, 1966.
  7. F. J. Damerau, "A technique for computer detection and correction of spelling errors," Communications of the ACM, vol. 7, no. 3, pp. 171-176, 1964. https://doi.org/10.1145/363958.363994
  8. W. E. Winkler, "String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage," in Proceedings of the Survey Research Methods Section, 1990, pp. 354-359.
  9. S. H. Cha, "Taxonomy of nominal type histogram distance measures," in Proceedings of the American Conference on Applied Mathematics, Cambridge, MA, 2008, pp. 325-330.
  10. D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York, NY: Cambridge University Press, 1997.
  11. K. Kukich, "Techniques for automatically correcting words in text," ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, 1992. https://doi.org/10.1145/146370.146380
  12. L. Ahmedi and E. Jajaga, "A database normalization tool using Semantic Web technologies," International Journal of Systems Applications, Engineering & Development, vol. 5, no. 4, pp. 502-517, 2011.
  13. A. N. Khan, L. M. Sheikh, and S. Sarfraz, "Psyche mining with PsycheTagger: a computational linguistics approach to text mining," International Journal of Computers and Communications, vol. 6, no. 2, pp. 119-127, 2012.
  14. H. Lee, "A study on the efficient education of pronunciation in Korean phonetic transformation rules," M.S. thesis, Dong-A University, Busan, Korea, 2008.
  15. S. Chang, S. Kim, and S. Chung, This Slip of the Tongue that Slip of the Pen. Seoul, Korea: Ministry of Culture and Tourism, 2000.