Web Crawler Service Implementation for Information Retrieval based on Big Data Analysis

Kim, Hye-Suk;Han, Na;Lim, Suk-Ja;

doi:10.9728/dcs.2017.18.5.933

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Volume 18 Issue 5
/
Pages.933-942
/
2017
/
1598-2009(pISSN)
/
2287-738X(eISSN)

Digital Contents Society (한국디지털콘텐츠학회)

DOI QR Code

Web Crawler Service Implementation for Information Retrieval based on Big Data Analysis

빅데이터 분석 기반의 정보 검색을 위한 웹 크롤러 서비스 구현

Kim, Hye-Suk (Department of Electronics and Computer Engineering, Chonnam National University) ;
Han, Na (Department of Electronics and Computer Engineering, Chonnam National University) ;
Lim, Suk-Ja (Department of Advertisement Design, Gwangju Campus of Korea Polytechnic)

김희숙 (전남대학교 공과대학 전자컴퓨터공학부) ;
한나 (전남대학교 공과대학 전자컴퓨터공학부) ;
임숙자 (한국폴리텍대학 광주캠퍼스 광고디자인학과)

Received : 2017.08.14
Accepted : 2017.08.31
Published : 2017.08.31

https://doi.org/10.9728/dcs.2017.18.5.933 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a web crawler service method for collecting information efficiently about college students and job-seeker's external activities, competition, and scholarship. The proposed web crawler service uses Jsoup tree analysis and Json format data transmission method to avoid problems of duplicated crawling while crawling at high speed. After collecting relevant information for 24 hours, we were able to confirm that the web crawler service is running with an accuracy of 100%. It is expected that the web crawler service can be applied to various web sites in the future to improve the web crawler service.

본 논문에서는 대학생 및 취업준비생의 대외활동, 공모전, 장학금에 대한 효율적인 정보 수집을 위한 웹 크롤러 서비스 방식을 제안한다. 제안된 웹 크롤러 서비스는 빠른 속도로 크롤링하면서 중복해서 크롤링되는 문제를 회피하기 위하여 Jsoup 트리 분석과 Json 형식의 데이터 전송 방식을 이용하였다. 24시간 동안 관련 정보를 수집한 결과 100%의 정확도로 웹 크롤러 서비스가 실행되고 있음을 확인할 수 있었다. 향후 제안된 웹 크롤러 서비스를 적용할 수 있는 웹 페이지 범위를 확대하여 다양한 웹 사이트에 동시에 적용할 수 있도록 개선하면 웹 크롤러 서비스의 양적 내용을 보충할 수 있을 것으로 기대한다.

Keywords

References

Chris Snijders, Uwe Matzat and Ulf-Dietrich Reips, "'Big Data': Big Gaps of Knowledge," International Journal of Internet Science, Vol. 7, No. 1, pp. 1-5, 2012.
S. Y. Bang, H. D. Ha and C. J. Kim, "A Study on BigData-based Software Architecture Design for Utilizing Public Open Data," Journal of Korean Institute of Information Technology, Vol. 13, No. 10, pp. 99-107, Oct. 2015. https://doi.org/10.14801/jkiit.2015.13.10.99
S. G. Lee, S. Y. Lee and J. C. Kim, "Design of a Platform for Collecting and Analyzing Agricultural Big Data," Journal of Digital Contents Society, Vol. 18, No. 1, pp. 149-158, Feb. 2017. https://doi.org/10.9728/dcs.2017.18.1.149
W. S. Cho and J. E. Lee and C. H. Choi, "Refresh Cycle Optimization for Web Crawlers," The Journal of the Korea Contents Association, Vol. 13, No. 6, pp. 30-39, 2013. https://doi.org/10.5392/JKCA.2013.13.06.030
K. S. Park, J. H. Choi, J. B. Kim and J. W. Park, "Design and Implementation of a Search Engine based on Apache Spark," Journal of the Korea Institute of Information and Communication Engineering, Vol. 21, No. 1, pp. 17-28, January 2017. https://doi.org/10.6109/jkiice.2017.21.1.17
D. M. Seo and H. M. Juung, "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, Vol. 13, No. 12, pp. 575-584, 2013. https://doi.org/10.5392/JKCA.2013.13.12.575
J. Y. Kim, D. H. Han and J. M. Kim, "Impact of Diverse Document-evaluation Measure-based Searching Methods in Big Data Search Accuracy," The Journal of the Korea Information Science Society, Vol. 44, No. 5, pp. 553-558, May 2017.
Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman, Mining of massive datasets, Cambridge University Press, 2014.
Russell, Stuart Jonathan, et al., Artificial intelligence: a modern approach, Vol. 2, Upper Saddle River: Prentice hall, 2003.
Hyafil, Laurent and Ronald L. Rivest, "Constructing optimal binary decision trees is NP-complete," Information Processing Letters 5.1, pp. 15-17, 1976. https://doi.org/10.1016/0020-0190(76)90095-8
S. J. Kim, "A Comparative Study on Models of Web-based Information Seeking Behavior," Journal of the Korean Society for Infromation Management, Vol. 21, No. 2, pp. 211-233, June 2004. https://doi.org/10.3743/KOSIM.2004.21.2.211
M. L. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti, "GoGetIt!: a tool for generating structure-driven web crawlers," In Proc. 15th international conference on World Wide Web, pp.1011-1012, 2006.
Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, Vol. 4, No. 2, pp.127-137, December. 2003.
D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, Vol. 13, No. 12, pp.575-584, December. 2013. https://doi.org/10.5392/JKCA.2013.13.12.575
D. Cai, S. Yu, J. R. Wen and W. Y. Ma, "VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report, 2003.
C. Kohlschutter, P. Fankhauser, and W. Nejdl, "Boilerplate Detection using Shallow Text Features," In Proc. of ACM International Conference on Web Search and Data Mining, pp.441-450, 2010.

Cited by

데이터 마이닝을 활용한 북한 산림과학 연구 동향 분석(1962~2016) vol.109, pp.1, 2020, https://doi.org/10.14578/jkfs.2020.109.1.81

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Web Crawler Service Implementation for Information Retrieval based on Big Data Analysis

빅데이터 분석 기반의 정보 검색을 위한 웹 크롤러 서비스 구현

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)