DOI QR코드

DOI QR Code

A Study on the Semiautomatic Construction of Domain-Specific Relation Extraction Datasets from Biomedical Abstracts - Mainly Focusing on a Genic Interaction Dataset in Alzheimer's Disease Domain -

바이오 분야 학술 문헌에서의 분야별 관계 추출 데이터셋 반자동 구축에 관한 연구 - 알츠하이머병 유관 유전자 간 상호 작용 중심으로 -

  • 최성필 (경기대학교 문헌정보학과) ;
  • 유석종 (한국과학기술정보연구원 생명의료융합기술연구실) ;
  • 조현양 (경기대학교 문헌정보학과)
  • Received : 2011.11.20
  • Accepted : 2016.12.22
  • Published : 2016.12.31

Abstract

This paper introduces a software system and process model for constructing domain-specific relation extraction datasets semi-automatically. The system uses a set of terms such as genes, proteins diseases and so forth as inputs and then by exploiting massive biological interaction database, generates a set of term pairs which are utilized as queries for retrieving sentences containing the pairs from scientific databases. To assess the usefulness of the proposed system, this paper applies it into constructing a genic interaction dataset related to Alzheimer's disease domain, which extracts 3,510 interaction-related sentences by using 140 gene names in the area. In conclusion, the resulting outputs of the case study performed in this paper indicate the fact that the system and process could highly boost the efficiency of the dataset construction in various subfields of biomedical research.

본 논문에서는 생의학 분야의 특정 세부 분야에 특화된 관계 추출 학습 말뭉치를 효율적으로 구축할 수 있는 시스템을 소개한다. 이 시스템은 대상 분야에 해당하는 용어집(유전자, 단백질, 질환 명칭 등)을 입력하면, 대용량 상호 작용 데이터베이스를 통해서 이들 용어 간의 연관 관계를 1차적으로 생성하고 생성된 연관 관계 집합을 다시 학술 데이터베이스에서 검색하여 최종적으로 연관 관계 포함 문장을 추출하는 형태로 수행된다. 개발된 시스템의 유용성 검증을 위해서 알츠하이머병 분야에서의 유전자 간 상호 작용 학습 말뭉치를 구축하는데 본 시스템을 적용하였고, 140개의 유전자 집합을 입력하여 이 분야에 특화된 학습 집합인 유전자 쌍 및 상호 작용 포함 문장 3,510 건을 추출하였다. 본 논문에서 제안한 시스템을 활용함으로써 기존에 완전 수작업으로 수행되던 연관 관계 추출용 학습 말뭉치 구축의 효율성을 높일 수 있고 다양한 세부 분야에 적합한 학습 말뭉치 구축에 도움을 줄 수 있다.

Keywords

Acknowledgement

Grant : 초고성능컴퓨팅 기반 건강한 고령사회 대응 빅데이터 기술개발

Supported by : 한국과학기술정보연구원

References

  1. Alex, B., Grover, C., Haddow, B., Kabadjor, M., Klein, E., Matthews, M., Wang, X. 2008. Assisted Curation: Does Text Mining Really Help?. In Pacific Symposium on Biocomputing (Vol. 13, pp. 556-567).
  2. Alnazzawi, N., Thompson, P., & Ananiadou, S. 2014. Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@ EACL (pp. 69-74).
  3. Bader, G. D., Betel, D., & Hogue, C. W. V. 2003. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research, 31(1): 248-250. https://doi.org/10.1093/nar/gkg056
  4. Blaschke, C., Hirschman, L., & Valencia, A. 2002. Information extraction in molecular biology. Briefings in Bioinformatics, 3(2): 154-165. https://doi.org/10.1093/bib/3.2.154
  5. Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., & Wong, Y. W. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2): 139-155. https://doi.org/10.1016/j.artmed.2004.07.016
  6. Chatr-aryamontri, A., Ceol, A., Palazzi, L. M., Nardelli, G., Schneider, M. V., Castagnoli, L., & Cesareni, G. 2007. MINT: the Molecular INTeraction database. Nucleic Acids Research, 35(Database issue), D572-D 574. https://doi.org/10.1093/nar/gkl950
  7. Choi, S.-P., & Myaeng, S.-H. 2010. Simplicity is Better: Revisiting Single Kernel PPI Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 206-214). Stroudsburg, PA, USA: Association for Computational Linguistics.
  8. Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. 2002. Mining MEDLINE: abstracts, sentences, or phrases? Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 326-337.
  9. Fundel, K., Kuffner, R., & Zimmer, R. 2007. RelEx-Relation extraction using dependency parse trees. Bioinformatics, 23(3): 365-371. https://doi.org/10.1093/bioinformatics/btl616
  10. Haddow, B., & Alex, B. 2008. Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks. In D. T. Nicoletta Calzolari (Conference Chair) Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis (Ed.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association (ELRA).
  11. Hastie, T., Tibshirani, R., & Friedman, J. 2009. The Elements of Statistical Learning. New York, NY: Springer New York.
  12. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Apweiler, R. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Research, 32(Database issue), D452-D455. https://doi.org/10.1093/nar/gkh052
  13. Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. 2005. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.
  14. Huang, C.-C., & Lu, Z. 2016. Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics, 17(1): 132-144. https://doi.org/10.1093/bib/bbv024
  15. Ivanovic, M., & Budimac, Z. 2014. An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158-5166. https://doi.org/10.1016/j.eswa.2014.02.045
  16. Kim, J.-D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., & Tsujii, J. ichi. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 1-6). Stroudsburg, PA, USA: Association for Computational Linguistics.
  17. Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. 2008. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 9(Suppl 2), S4. https://doi.org/10.1186/gb-2008-9-s2-s4
  18. Lee, J ., Kim, S., Lee, S., Lee, K., & Kang, J. 2012 . High Precision Rule Based PPI Extraction and Per-pair Basis Performance Evaluation. In Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (pp. 69-76). New York, NY, USA: ACM.
  19. Li, L., Guo, R., Jiang, Z., & Huang, D. 2014. Improving Kernel-based protein-protein interaction extraction by unsupervised word representation. In Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on (pp. 379-384). IEEE.
  20. Malhotra, A., Younesi, E., Gündel, M., Muller, B., Heneka, M. T., & Hofmann-Apitius, M. 2014. ADO: a disease ontology representing the domain knowledge specific to Alzheimer's disease. Alzheimer's & Dementia: The Journal of the Alzheimer's Association, 10(2), 238-246. https://doi.org/10.1016/j.jalz.2013.02.009
  21. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations (pp. 55-60).
  22. Mintz, M., Bills, S., Snow, R., & Jurafsky, D. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (pp. 1003-1011). Stroudsburg, PA, USA: Association for Computational Linguistics.
  23. Nedellec, C. 2005. Learning language in logic-genic interaction extraction challenge. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol. 7). Citeseer.
  24. Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., & Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(1): 50. https://doi.org/10.1186/1471-2105-8-50
  25. Ravikumar, K., Liu, H., Cohn, J. D., Wall, M. E., & Verspoor, K. 2012. Literature mining of protein-residue associations with graph rules learned through distant supervision. Journal of Biomedical Semantics, 3 Suppl 3, S2.
  26. Rubin, D. L., Shah, N. H., & Noy, N. F. 2008. Biomedical ontologies: a functional perspective. Briefings in Bioinformatics, 9(1): 75-90. https://doi.org/10.1093/bib/bbm059
  27. Saffer, J. D., & Burnett, V. L. 2014. Introduction to Biomedical Literature Text Mining: Context and Objectives. In Biomedical Literature Mining (pp. 1-7). Springer.
  28. Segura Bedmar, I., Martinez, P., & Sanchez Cisneros, D. 2011. The 1st DDIExtraction-2011 Challenge Task: Extraction of Drug-Drug Interactions from Biomedical Texts.
  29. Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535-539. https://doi.org/10.1093/nar/gkj109
  30. Thompson, P., Iqbal, S. A., McNaught, J., & Ananiadou, S. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10(1): 349. https://doi.org/10.1186/1471-2105-10-349
  31. Uzuner, o., South, B. R., Shen, S., & DuVall, S. L. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association: JAMIA, 18(5): 552-556. https://doi.org/10.1136/amiajnl-2011-000203
  32. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., & Eisenberg, D. 2000. DIP: the Database of Interacting Proteins. Nucleic Acids Research, 28(1), 289-291. https://doi.org/10.1093/nar/28.1.289
  33. 박경미, 황규백. 2011. 자연어처리 기반 바이오 텍스트 마이닝 시스템. 정보과학회논문지 : 컴퓨팅의 실제 및 레터, 17(4).(Park, Kyung-Mi, Kyu-Baek Hwang. 2011, A Bio-Text Mining System Based on Natural Language Processing. KIISE Transactions on Computing Practices, 17(4).)
  34. 정창후, 최성필, 이민호, 최윤수. 2010. 기술용어 간 관계추출의 성능평가를 위한 반자동 테스트 컬렉션 구축 프레임워크 개발. 한국콘텐츠학회논문지, 10(2).(Jeong, Chang-Hoo, Sung-Pil Choi, Min-Ho Lee, Yun-Soo Choi. 2010. The Journal of the Korea Contents Association. 10(2).)
  35. 최성필. 2016. 기계 학습을 이용한 바이오 분야 학술 문헌에서의 관계 추출에 대한 실험적 연구. 한국문헌정보학회지, 50(2).(Choi, Sung-Pil. 2016. An Experimental Study on the Relation Extraction from Biomedical Abstracts using Machine Learning. Journal of the Korean Society for Library and Information Science. 50(2).)
  36. 허고은, 송민. 2014. 텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론. 정보관리학회지, 31(1).(Heo, Go Eun, Min Song. 2014. Inferring Undiscovered Public Knowledge by Using Text Mining-driven Graph Model. Journal of the Korean Society for Information Management. 31(1).)

Cited by

  1. 기술과학 분야 학술문헌에 대한 학습집합 반자동 구축 및 자동 분류 통합 연구 vol.35, pp.4, 2016, https://doi.org/10.3743/kosim.2018.35.4.141