Context-Weighted Metrics for Example Matching

문맥가중치가 반영된 문장 유사 척도

  • 김동주 (한양대학교 컴퓨터공학과) ;
  • 김한우 (한양대학교 컴퓨터공학과)
  • Published : 2006.11.25

Abstract

This paper proposes a metrics for example matching under the example-based machine translation for English-Korean machine translation. Our metrics served as similarity measure is based on edit-distance algorithm, and it is employed to retrieve the most similar example sentences to a given query. Basically it makes use of simple information such as lemma and part-of-speech information of typographically mismatched words. Edit-distance algorithm cannot fully reflect the context of matched word units. In other words, only if matched word units are ordered, it is considered that the contribution of full matching context to similarity is identical to that of partial matching context for the sequence of words in which mismatching word units are intervened. To overcome this drawback, we propose the context-weighting scheme that uses the contiguity information of matched word units to catch the full context. To change the edit-distance metrics representing dissimilarity to similarity metrics, to apply this context-weighted metrics to the example matching problem and also to rank by similarity, we normalize it. In addition, we generalize previous methods using some linguistic information to one representative system. In order to verify the correctness of the proposed context-weighted metrics, we carry out the experiment to compare it with generalized previous methods.

본 논문은 영한 기계번역을 위한 예제기반 기계번역에서 예제 문장의 비교를 위한 척도에 관한 것으로 주어진 질의 문장과 가장 유사한 예제 문장을 찾아내는데 사용되는 유사성 척도를 제안한다. 제안하는 척도는 편집거리 알고리즘에 기반을 둔 것으로 표면어가 일치하지 않는 단어에 대해 기본적으로 단어의 표제어 정보와 품사 정보를 이용하여 유사도를 계산한다. 편집거리 척도는 비교 단위의 순서에 의존적이기는 하지만 순서만 일치하면 동일한 유사성 기여도를 갖는 것으로 판단하기 때문에 완전 문맥을 반영하지는 못한다. 따라서 본 논문에서는 완전 문맥 반영을 위해 추가적으로 이들 정보 외에 일치하는 단위 정보를 갖는 연속된 단어들에 대해 연속 정보를 반영한 문맥 가중치를 제안한다. 또한 비유사성 정도를 의미하는 척도인 편집거리 척도를 유사성 척도로 변경하고, 문맥 가중치가 적용된 척도를 문장 비교에 적용하기 위하여 정규화를 수행하며, 이를 통하여 유사도에 따른 순위를 결정한다. 또한 언어적 정보를 이용한 기존 방법류들에 대한 일반화를 시도하였으며, 문맥 가중치가 적용된 척도의 우수성을 증명하기 위해 일반화된 기존 방법류들과의 비교 실험을 수행하였다.

Keywords

References

  1. H. L. Somers, 'New Paradigms' in MT: the State of the Play now that the Dust has Settled,' In 10th European Summer School in Logic, Language and Information, Workshop on Machine Translation, pp.22-33, 1998
  2. M. Nagao, 'A Framework of a Mechanical Translation between Japanese and English by Analogy. Principle,' In Artificial and human intelligence, A. Elithorn and R. Banerji (Eds.), Amsterdam: North-Holland, pp.173-180, 1984
  3. M. Kay, 'The Proper Place of Men and Machines in Language Translation, Research Report CSL-80-11, Xerox Palo Alto Research Center. Palo Alto. Calif., Reprinted in Machine Translatin vol.12, pp.3-23 (1997), 1980
  4. F. Mandreoli, R. Martoglia and P. Tiberio, 'Searching similar (sub)sentences for example-based machine translation,' In Atti del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD), pp.208-221, Isola d'Elba, Italy, 2002
  5. L. Cranias, H. Papageorgiou and S. Piperidis, 'A matching technique in example-based machine translation,' In Proc. 15th Int. Conf. on Computational Linguistics, pp.100-104, 1994 https://doi.org/10.3115/991886.991901
  6. L. Cranias, H. Papageorgiou and S. Piperidis, 'Clustering A technique for search space reduction in example-based machine translation,' In Proc. Int. Conf. on Systems, Man, and Cybernetics, pp.1-6, 1994
  7. T. Doi, H. Yamamoto and E. Sumita, 'Graph-based retrieval tor example-based machine translation using edit-distance,' In Proc. Workshop Example-Based Machine Translation at MT Submmit X, pp.51-58, 2005
  8. E. Sumita and H. Iida, 'Experiments and Prospects of Example-Based Machine Translation,' In Proc. of the 29th Annual Meeting of the ACL, pp.185-192, 1991 https://doi.org/10.3115/981344.981368
  9. S. Nirenburg, C. Domashnev and D. Grannes, 'Two approaches to matching in example-based machine translation,' In Proc. 5th Int. Conf. on Theoretical and Methodological Issues in Machine Translation, pp.47-57, 1993
  10. T. Baldwin and H. Tanaka, 'The Effects of Word Order and Segmentation on Translation Retrieval Performance,' In Proc. of the 18th Int. Conf. on Computational Linguistics, pp.35-41, 2000 https://doi.org/10.3115/990820.990826
  11. E. Sumita, 'Example-based machine translation using DP-matching between word sequences,' In Proc. of the ACL Workshop on Data-Driven Methods in MT., pp.1-8, 2001 https://doi.org/10.3115/1118037.1118038
  12. V. I. Levenshtein, 'Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,' Soviet Physics-Doklady, vol.10 no.8 pp.707-710, 1996, Translated from Doklady Akademii Nauk SSSR, vol.163, no.4 pp.845-848, 1965
  13. F. J. Damerau, 'A Technique for Computer Detection and Correction of Spelling Errors,' Communications of the ACM, vol.7, no.3, pp.171-176, 1964 https://doi.org/10.1145/363958.363994
  14. J. Zobel and P. Dart, 'Phonetic String Matching : Lessons from Information Retrieval,' In Proc. of the 19th Annual International ACM SIGIR Conf., pp.166-172, 1996 https://doi.org/10.1145/243199.243258
  15. E. W. Meyers and W. Miller, 'Row replacement algorithms for screen editors,' ACM Transactions on Programming Languages and Systems (TOPLAS), vol.11, no.1, pp.33-56, 1989 https://doi.org/10.1145/59287.59290
  16. S. Needleman and D. Wunsch, 'A General Method Applicable to the search for similarities in the amino acid sequence of two proteins,' Journal of Molecular Biology, vol.48, no.3, pp.443-453, 1970 https://doi.org/10.1016/0022-2836(70)90057-4