A Phoneme-based Approximate String Searching System for Restricted Korean Character Input Environments

제한된 한글 입력환경을 위한 음소기반 근사 문자열 검색 시스템

  • 윤태진 (부산대학교 정보컴퓨터공학부) ;
  • 조환규 (부산대학교 정보컴퓨터공학부) ;
  • 정우근 (부산대학교 정보컴퓨터공학부)
  • Received : 2010.06.01
  • Accepted : 2010.08.04
  • Published : 2010.10.15

Abstract

Advancing of mobile device is remarkable, so the research on mobile input device is getting more important issue. There are lots of input devices such as keypad, QWERTY keypad, touch and speech recognizer, but they are not as convenient as typical keyboard-based desktop input devices so input strings usually contain many typing errors. These input errors are not trouble with communication among person, but it has very critical problem with searching in database, such as dictionary and address book, we can not obtain correct results. Especially, Hangeul has more than 10,000 different characters because one Hangeul character is made by combination of consonants and vowels, frequency of error is higher than English. Generally, suffix tree is the most widely used data structure to deal with errors of query, but it is not enough for variety errors. In this paper, we propose fast approximate Korean word searching system, which allows variety typing errors. This system includes several algorithms for applying general approximate string searching to Hangeul. And we present profanity filters by using proposed system. This system filters over than 90% of coined profanities.

모바일 기기가 발전함에 따라 입력 수단에 대한 연구는 중요한 이슈이다 키패드, 쿼티키패드, 터치, 음성인식 등 다양한 입력장치가 사용되고 있으나 아직 데스크톱 입력장치에 비해 편의성이 떨어져서 입력 시의 오타나 탈자 등의 오류가 포함되는 경우가 많다. 이러한 입력 오류는 문자 메시지 등 사람과의 의사소통에는 문제를 일으키지 않으나 사전, 주소록 등의 데이터베이스 검색에는 치명적인 오류로서 원하는 검색 결과를 얻지 못하게 된다. 특히 한글의 경우 자음과 모음의 조합을 통해 글자를 생성하는 특성상 1만자가 넘는 글자의 조합이 가능하여 영문에 비하여 오류의 빈도가 높다. 기존의 검색 시스템은 Suffix Tree등을 이용하여 입력 오류를 처리하지만 다양한 오류에 대응하기에는 한계가 있다. 본 논문에서는 오자, 탈자 등의 입력 오류를 허용하면서 빠른 검색이 가능한 근사 한글 단어 검색 시스템을 제안하고자 한다. 이 시스템은 기존의 알파벳에 적용된 근사 문자열 검색(Approximate String Searching)을 한글에 효과적으로 적용할 수 있는 여러 가지 알고리즘과 기법이 포함되어 있다. 그리고 제안된 시스템을 이용한 변형 욕설 필터링 시스템의 개발에 대해 이야기하고자 한다. 이 시스템은 유저의 각종 변형 욕설 입력에 대해 90% 이상의 필터링 성능을 보였다.

Keywords

Acknowledgement

Supported by : 연구재단

References

  1. A. Apostolico. The myriad virtues of subword trees. Combinatorial Algorithms on Words, pp.85-96, 1985.
  2. W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. Commun. ACM, vol.16, no.4, pp.230-236, 1973. https://doi.org/10.1145/362003.362025
  3. Chang-Keon Ryu, Hyong-Jun Kim, Seung-Hyun Ji, Gyun Woo, and Hwan-Gue Cho. Detecting and tracing plagiarized documents by reconstruction plagiarism-evolution tree. Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on, pp.119-124, July 2008.
  4. Hyong-Jun Kim, Chang-Keon Ryu, and Hwan-Gue Cho. A detecting and tracing algorithm for unauthorized internet-news plagiarism using spatiotemporal document evolution model. In SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing, pp.863-868, New York, NY, USA, 2009. ACM.
  5. Chang-Keon Ryu, Hyong-Jun Kim, and Hwan-Gue Cho. Reconstructing evolution process of documents in spatio-temporal analysis. In ICCIT '08: Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology, pp.136-142, Washington, DC, USA, 2008. IEEE Computer Society.
  6. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, and D.J. Basic local alignment search tool. Journal of Molecular Biology., 215, 1990.
  7. K. M. Chao and L. Zhang, Sequence Comparison Theory and Methods, Springer, 2009.
  8. Sreenivas Gollapudi and Rina Panigrahy. A dictionary for approximate string search and longest prefix search. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pp.768-775, New York, NY, USA, 2006. ACM.
  9. Trinh N. D. Huynh, Wing-Kai Hon, Tak-Wah Lam, and Wing-Kin Sung. Approximate string matching using compressed suffix arrays. Theoretical Computer Science, vol.352, no.1, pp.240-249, 2006. https://doi.org/10.1016/j.tcs.2005.11.022
  10. Marios Hadjieleftheriou, Nick Koudas, and Divesh Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pp.429-440, New York, NY, USA, 2009. ACM.
  11. Gonzalo Navarro and Edgar Chavez. A metric index for approximate string matching. Theoretical Computer Science, vol.352, no.1, pp.266-279, 2006. https://doi.org/10.1016/j.tcs.2005.11.037
  12. Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The r*-tree: an efficient and robust access method for points and rectangles. In SIGMOD '90: Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pp.322-331, New York, NY, USA, 1990. ACM.
  13. Antonin Guttman. R-trees: a dynamic index structure for spatial searching. Readings in database systems, pp.599-609, 1988.
  14. http://en.wikipedia.org/wiki/swearfilter.
  15. Korea Game Industry Agency, Sound Game language guide research, 2008.
  16. Shekhar Dhupelia. Designing a vulgarity filtering system. In Game Programming Gems 5. Charles River Media, 2005.
  17. Lai C. An empirical study of three machine learning methods for spam filtering. Know-Based System, vol.20, no.3, pp.249-254, 2007. https://doi.org/10.1016/j.knosys.2006.05.016
  18. Kyo Hyeon Park and Jee Hyong Lee. Developing a vulgarity filtering system for online games using svm. In Proceedings of the Korean Institute of Information Scientists and Engineers Autumn, 2006.
  19. Ramachandran A Feamster N and Vempala S. Filtering spam with behavioral blacklisting. In Proceedings of the 14th ACM Conference on Computer and Communications Security (Alexandria, Virginia), pp.342-351, 2001.
  20. Imoxion, Lewdness/Profanity Filtering System Using Syllable Information, patent, In 2001-0067853, 2001.
  21. http://en.wikipedia.org/wiki/scunthorpe problem.