DOI QR코드

DOI QR Code

A Large-scale Test Set for Author Disambiguation

저자 식별을 위한 대용량 평가셋 구축

  • 강인수 (경성대학교 컴퓨터정보학부) ;
  • 김평 (한국과학기술정보연구원 정보기술연구실) ;
  • 이승우 (한국과학기술정보연구원 정보기술연구실) ;
  • 정한민 (한국과학기술정보연구원 정보기술연구실) ;
  • 류범종 (한국과학기술정보연구원 정보기술연구실)
  • Published : 2009.11.28

Abstract

To overcome article-oriented search functions and provide author-oriented ones, a namesake problem for author names should be solved. Author disambiguation, proposed as its solution, assigns identifiers of real individuals to author name entities. Although recent state-of-the-art approaches to author disambiguation have reported above 90% performance, there are few academic information services which adopt author-resolving functions. This paper describes a large-scale test set for author disambiguation which was created by KISTI to foster author resolution researches. The result of these researches can be applied to academic information systems and make better service. The test set was constructed from DBLP data through web searches and manual inspection, Currently it consists of 881 author names, 41,673 author name entities, and 6,921 person identifiers.

현재의 논문 중심적 학술정보 탐색의 한계에서 벗어나 저자 중심적 검색을 제공하기 위해서는 저자명이 갖는 동명이인의 문제가 해결되어야 한다. 그 해법으로 제시된 저자식별은 논문에 출현한 저자명 개체에 실세계 연구자에 대응하는 식별자를 부여하는 작업이다. 최근의 선도적 저자식별 연구들은 90%를 상회하는 식별 성능을 보고하고 있으나 실질적인 학술정보서비스에서 저자식별 기능이 탑재된 예는 거의 없다. 본 논문에서는 학술정보서비스에 보다 직접적으로 기여할 수 있는 광범위한 저자식별 연구를 위해 한국과학기술정보연구원에서 새롭게 구축한 대용량 저자식별 평가셋에 대해 기술한다. 평가셋은 DBLP 데이터에 출현한 고빈도 저자명들에 대해 웹 검색을 통한 수작업 식별 과정을 거쳐 만들어졌다. 현재 881개 저자명에 대해 수집된 41,673개의 저자명개체레코드로 구성되어 있으며 총 6,921명의 실세계 저자 식별자가 존재한다.

Keywords

References

  1. Y. Song, J. Huang, I. Councill, J. Li and C. L. Giles, "Efficient topic-based unsupervised name disambiguation," In Proceedings of the ACM IEEE Joint Conference on Digital Libraries (JCDL), 2007(6). https://doi.org/10.1145/1255175.1255243
  2. H. Han, H. Zha, and C. L. Giles, ''Name disambiguation in author citations using a k-way spectral clustering method," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries(JCDL), pp.334-343, 2005(6). https://doi.org/10.1145/1065385.1065462
  3. D. W. Lee, B. W. On, J. W. Kang, and S. H. Park, " Effective and scalable solutions for mixed and split citation problems in digital libraries," In Proceedings of the International Workshop on Information Quality in Information Systems(IQIS), pp.69-76, 2005(6).
  4. P. Kanani and A. McCallum, "Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes," In Proceedings of the 6th International Workshop on Information Integration on the Web(IIWeb-07), 2007(7).
  5. D. M. McRae-Spencer and N. R. Shadbolt, "Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation," In Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp.53-54, 2006(6). https://doi.org/10.1145/1141753.1141762
  6. D. A. Pereira, B. Ribeiro-Neto, N. Ziviani, A. H. F. Laender, M. A. Goncalves, and A. A. Ferreira, "Using web information for author name disambiguation," In Proceedings of ACM/IEEE-CS Joint Conference on Digital Libraries(JCDL), pp.49-58, 2009(6).
  7. J. Huang, S. Ertekin, and C. L. Giles, "Efficient name disambiguation for large scale databases," In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD), pp.536-544. 2006(9).
  8. Y. F. Tan, M. Y. Kan, and D. W. Lee, "Search engine driven author disambiguation," In Proceedings of ACM/IEEE Joint Conference on Digital Libraries(JCDL), pp.314-315, 2006(6). https://doi.org/10.1145/1141753.1141826
  9. M. Ley, "DBLP - some lessons learned," In Proceedings of International Conference on Very Large Data Bases(VLDB), 2009(8).
  10. V. Petricek, I. J. Cox, H. Han, I. G. Councill, and C. L. Giles, "A comparison of on-line computer science citation databases," In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 2005.
  11. O. Fatemieh, K. Manzoor, A. Jain, and A. Ramani, "Home Page Finder. University of Illinois at Urbana-Champaign," 2005.
  12. 강인수, "한글 저자명 군집화를 위한 계층적 기법 비교", 정보관리연구, 제40권, 제2호, pp.95-115, 2009. https://doi.org/10.1633/JIM.2009.40.2.095
  13. I. S. Kang, S. H. Na, S. W. Lee, H. M. Jung, P. Kim, W. K. Sung, and J. H. Lee, "On co-authorship for author disambiguation," Information Processing and Management, Vol.45, No.1, pp.84-97, 2009. https://doi.org/10.1016/j.ipm.2008.06.006

Cited by

  1. A Comparative Study on Authority Records for Japanese Writers in Japan and the United States of America vol.48, pp.1, 2014, https://doi.org/10.4275/KSLIS.2014.48.1.149