DOI QR코드

DOI QR Code

O-JMeSH: creating a bilingual English-Japanese controlled vocabulary of MeSH UIDs through machine translation and mutual information

  • Soares, Felipe (Computer Science Department, The University of Sheffield) ;
  • Tateisi, Yuka (National Bioscience Database Center, Japan Science and Technology Agency) ;
  • Takatsuki, Terue (Database Center for Life Science, Research Organization of Information and Systems) ;
  • Yamaguchi, Atsuko (Database Center for Life Science, Research Organization of Information and Systems)
  • Received : 2021.03.17
  • Accepted : 2021.09.10
  • Published : 2021.09.30

Abstract

Previous approaches to create a controlled vocabulary for Japanese have resorted to existing bilingual dictionary and transformation rules to allow such mappings. However, given the possible new terms introduced due to coronavirus disease 2019 (COVID-19) and the emphasis on respiratory and infection-related terms, coverage might not be guaranteed. We propose creating a Japanese bilingual controlled vocabulary based on MeSH terms assigned to COVID-19 related publications in this work. For such, we resorted to manual curation of several bilingual dictionaries and a computational approach based on machine translation of sentences containing such terms and the ranking of possible translations for the individual terms by mutual information. Our results show that we achieved nearly 99% occurrence coverage in LitCovid, while our computational approach presented average accuracy of 63.33% for all terms, and 84.51% for drugs and chemicals.

Keywords

Acknowledgement

Felipe Soares would like to acknowledge Google's TensorFlow Research Cloud (TFRC) program as well as AWS Diagnostic Development Initiative (DDI) initiative for providing computational resources. We would also like to acknowledge DeepL for providing access to their API to perform automatic translation.

References

  1. Sarkar IN, Schenk R, Miller H, Norton CN. LigerCat: using "MeSH Clouds" from journal, article, or gene citations to facilitate the identification of relevant biomedical literature. AMIA Annu Symp Proc 2009;2009:563-567.
  2. Liu W, Islamaj Dogan R, Kim S, Comeau DC, Kim W, Yeganova L, et al. Author name disambiguation for PubMed. J Assoc Inf Sci Technol 2014;65:765-781. https://doi.org/10.1002/asi.23063
  3. Sanyal DK, Bhowmick PK, Das PP. A review of author name disambiguation techniques for the PubMed bibliographic database. J Inf Sci 2019;47:227-254. https://doi.org/10.1177/0165551519888605
  4. Jimeno-Yepes AJ, McInnes BT, Aronson AR. Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation. BMC Bioinformatics 2011;12:223. https://doi.org/10.1186/1471-2105-12-223
  5. Chen Q, Allot A, Lu Z. Keep up with the latest coronavirus research. Nature 2020;579:193. https://doi.org/10.1038/d41586-020-00694-1
  6. Yamada R, Tatieisi Y. open-japanese-mesh: assigning MeSH UIDs to Japanese medical terms via open Japanese-English glossaries. Genomics Inform 2020;18:e22. https://doi.org/10.5808/GI.2020.18.2.e22
  7. Ogawa Y, Nakamura M, Ohno T, Toyama K. Extraction of legal bilingual phrases from the Japanese Official Gazette, English edition. J Inf Telecommun 2018;2:359-373. https://doi.org/10.1080/24751839.2017.1380272
  8. Soares F, Villegas M, Gonzalez-Agirre A, Krallinger M, Armengol-Estape J. Medical word embeddings for Spanish: development and evaluation. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, 2019 Jun 7, Minneapolis, MN, USA. Stroudsburg: Association for Computational Linguistics, 2019. pp. 124-133.
  9. McEwan CJ, Ounis I, Ruthven I. Building bilingual dictionaries from parallel web documents. In: Proceedings of the 24th European Colloquium on Information Retrieval Research, 2002 Mar 25-27, Glasgow, Scotland. Berlin: Springer, 2002. pp. 303-323.
  10. Aji S, Kaimal R. Document summarization using positive pointwise mutual information. Int J Comput Sci Inf Technol 2012;4:47-55.