DOI QR코드

DOI QR Code

GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction

  • Oh, So-Yeon (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Ji-Hyeon (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Kim, Seo-Jin (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Nam, Hee-Jo (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Park, Hyun-Seok (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University)
  • Received : 2018.07.31
  • Accepted : 2018.08.23
  • Published : 2018.09.30

Abstract

Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.

Keywords

References

  1. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform 2005;6:57-71. https://doi.org/10.1093/bib/6.1.57
  2. Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform 2007;8:358-375. https://doi.org/10.1093/bib/bbm045
  3. Biber D, Conrad S, Reppen R. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press, 1998.
  4. Genomics and Informatics archives. Seoul: Korea Genome Organization, 2018. Accessed 2018 Jul 29. Available from: https://genominfo.org/articles/archive.php.
  5. Hagedorn G, Mietchen D, Morris RA, Agosti D, Penev L, Berendsohn WG, et al. Creative Commons licenses and the non-commercial condition: implications for the re-use of biodiversity information. Zookeys 2011;(150):127-149.
  6. Creative Commons, Attribution-NonCommercial 4.0 International. Mountain View: Creative Commons, 2018. Accessed 2018 Jul 18. Available from: https://creativecommons.org/li- censes/by-nc/4.0/.
  7. Shinyama Y. PDFMiner.six: Python PDF parser and analyzer. San Francisco: GitHub Inc., 2018. Accessed 2018 July 17. Available from: https://github.com/pdfminer/pdfminer.six.
  8. Bernardi L, Ratsch E, Kania R, Saric J, Rojas JH, Schatz BR, et al. Mining information for functional genomics. IEEE Intell Syst 2002;17:66-79. https://doi.org/10.1109/MIS.2002.1005634
  9. Bird S, Klein E, Loper E. Natural Language Processing with Python. Sebastopol: O'Reilly Media Inc., 2009.
  10. Perkins J. Python Text Processing with NLTK 2.0 Cookbook. Birmingham: Packt Publishing, 2010.
  11. Collier N, Mima H, Lee SZ, Ohta T, Tateisi Y, Yakushiji A, et al. The GENIA project: knowledge acquisition from biology texts. Genome Inform 2000;11:448-449.
  12. Kim JD, Ohta T, Teteisi Y, Tsujii J. GENIA corpus manual. Technical report TR-NLP-UT-2006-1. Tokyo: Tsujii Laboratory, University of Tokyo, 2006.
  13. Tsuruoka Y. GENIA tagger: part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text. Tokyo: University of Tokyo, 2006. Accessed 2018 Jul 27. Available from: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ tagger.
  14. Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of English: The Penn Treebank. Comput Linguist 1993;19:313-330.
  15. Abney S. Parsing by chunks. In: Principle-Based Parsing (Berwick R, Abney S, Tenny C, eds.). Dordrecht: Springer, 1991. pp. 257-278.
  16. Breckbaldwin. Coding chunkers as taggers: IO, BIO, BMEWO, and BMEWO+. Accessed 2018 Jul 27. Available from: https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers- io-bio-bmewo-and-bmewo/.
  17. Chinchor NA. Overview of MUC-7. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference, 1998 Apr 29-May 1, Fairfax, VA.
  18. Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvist Invest 2007;30:3-26. https://doi.org/10.1075/li.30.1.03nad