DOI QR코드

DOI QR Code

The Utilization of Local Document Information to Improve Statistical Context-Sensitive Spelling Error Correction

통계적 문맥의존 철자오류 교정 기법의 향상을 위한 지역적 문서 정보의 활용

  • 이정훈 (부산대학교 전기전자컴퓨터공학과) ;
  • 김민호 (부산대학교 전기전자컴퓨터공학과) ;
  • 권혁철 (부산대학교 전기컴퓨터공학부)
  • Received : 2017.03.02
  • Accepted : 2017.05.20
  • Published : 2017.07.15

Abstract

The statistical context-sensitive spelling correction technique in this thesis is based upon Shannon's noisy channel model. The interpolation method is used for the improvement of the correction method proposed in the paper, and the general interpolation method is to fill the middle value of the probability by (N-1)-gram and (N-2)-gram. This method is based upon the same statistical corpus. In the proposed method, interpolation is performed using the frequency information between the statistical corpus and the correction document. The advantages of using frequency of correction documents are twofold. First, the probability of the coined word existing only in the correction document can be obtained. Second, even if there are two correction candidates with ambiguous probability values, the ambiguity is solved by correcting them by referring to the correction document. The method proposed in this thesis showed better precision and recall than the existing correction model.

본 논문에서의 문맥의존 철자오류(Context-Sensitive Spelling Error) 교정 기법은 샤논(Shannon)의 노이지 채널 모형(noisy channel model)을 기반으로 한다. 논문에서 제안하는 교정 기법의 향상에는 보간(interpolation)을 사용하며, 일반적인 보간 방법은 확률의 중간 값을 채우는 방식으로 N-gram에 존재하지 않는 빈도를 (N-1)-gram과 (N-2)-gram 등에서 얻는다. 이와 같은 방식은 동일 통계 말뭉치를 기반으로 계산하는데 제안하는 방식에서는 통계 말뭉치와 교정 문서간의 빈도 정보를 이용하여 보간 한다. 교정 문서의 빈도를 이용하였을 때 이점은 다음과 같다. 첫째 통계 말뭉치에 존재하지 않고 교정 문서에서만 나타나는 신조어의 확률을 얻을 수 있다. 둘째 확률 값이 모호한 두 교정 후보가 있더라도 교정 문서를 참고로 교정하게 되어 모호성을 해소한다. 제안한 방법은 기존 교정 모형보다 정밀도와 재현율의 성능향상을 보였다.

Keywords

Acknowledgement

Grant : (엑소브레인-3세부) 컨텍스트 인지형 Deep-Symbolic 하이브리드 지능 원천 기술 개발 및 언어 지식 자원 구축

Supported by : 정보통신기술연구진흥센터

References

  1. Minho Kim, Hyuk-chul Kwon, Sungki Choi, "Context-sensitive Spelling Error Correction using Eojeol N-gram," Journal of KIISE, 41.12: 1081-1089, 2014. (in Korean) https://doi.org/10.5626/JOK.2014.41.12.1081
  2. C. W. Young, C. M. Eastman, and R. L. Oakman, "An analysis of ill-formed input in natural language queries to document retrieval systems," Information Processing and Management, Vol. 27, No. 6, pp. 615-622, 1991. https://doi.org/10.1016/0306-4573(91)90002-4
  3. A. M. Wing, and A. D. Baddeley, Spelling errors in handwriting: a corpus and distributional analysis, Cognitive processes in spelling, London: Academic Press, pp. 251-285, 1980.
  4. Kenneth W. Church and William A. Gale, "Probability scoring for spelling correction," Statistics and Computing, Vol. 1, No. 2, pp. 93-103, 1991. https://doi.org/10.1007/BF01889984
  5. Eric Brill and Robert C. Moore, "An improved error model for noisy channel spelling correction," Proc. ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 286-293, 2000.
  6. Golding, Andrew R. and Dan Roth and J. Moon, "A Winnow-Based Approach to Context-Sensitive Spelling correction," Machine Learning, Vol. 34, pp. 107-130, 1998.
  7. E. Mays, F. J. Damerau, and R. L. Mercer, "Context Based Spelling Correction," Information Processing & Management, Vol. 23, No. 5, pp. 517-522, 1991.
  8. Islam, Aminul and Diana Inkpen, "Real-Word Spelling Correction using Google Web 1T 3-grams," Proc. of International Conference on Natural Language Processing and Knowledge Engineering, Vol. 3, pp. 1241-1249, 2009.
  9. Islam, Aminul and Diana Inkpen, "Semantic text similarity using corpus-based word similarity and string similarity," ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 2, pp. 1-25, 2008.
  10. Islam, Aminul and Diana Inkpen, "Real-word spelling correction using Google Web 1T n-gram data set," Proc. of the 18th ACM Conference on Information and Knowledge Management, pp. 1689-1692, 2009.
  11. Islam, Aminul and Diana Inkpen, "Real-word spelling correction using Google Web 1T n-gram with backoff," Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference, pp. 24-27, 2009.
  12. Han-young Seo, Sungki Choi, Hyuk-chul Kwon, "Improvement for Statistical Context-sensitive Spelling Correction using Korean WordNet," Proc. of the KIISE Korea Computer Congress 2014, pp. 607-609, 2014. (in korea)
  13. Jung-Hun Lee, Minho Kim, Hyuk-chul Kwon, "Improving the Performance of Statistical Context- Sensitive Spelling Error Correction Techniques Using Interpolation Smoothing between Corpora," Proc. of the KIISE Korea Computer Congress 2016, pp. 786-788, 2016. (in Korean)
  14. Jung-Hun Lee, Minho Kim, Hyuk-chul Kwon, "Improved Statistical Language Model for Contextsensitive Spelling Error Candidates," Journal of Korea Multimedia Society, Vol. 20, No. 2, pp. 371- 381, 2017. (in Korean) https://doi.org/10.9717/kmms.2017.20.2.371