Implementation of a Web Robot and Statistics on the Korean Web

Kim, Sung-Jin;Lee, Sang-Ho;

doi:10.3745/KIPSTC.2003.10C.4.509

The KIPS Transactions:PartC (정보처리학회논문지C)

Volume 10C Issue 4
/
Pages.509-518
/
2003
/
1598-2858(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Implementation of a Web Robot and Statistics on the Korean Web

웹 로봇 구현 및 한국 웹 통계보고

김성진 (숭실대학교 대학원 컴퓨터학과) ;
이상호 (숭실대학교 컴퓨터학부)

Published : 2003.08.01

https://doi.org/10.3745/KIPSTC.2003.10C.4.509 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

A web robot is a program that downloads and stores web pages. Implementation issues for developing web robots have been studied widely and various web statistics are reported in the literature. First, this paper describes the overall architecture of our robot and implementation decisions on several important issues. Second, we show empirical statistics on approximately 74 million Korean web pages. Third, we monitored 1,424 Korean web sites to observe the changes of web pages. We identify what factors of web pages could affect the changes. The factors may be used for the selection of web pages to be updated incrementally.

웹 로봇은 웹 문서를 다운로드하고 저장하는 프로그램이다. 현재 웹 로봇 구현에 대한 여러 연구들이 진행되고, 웹에 대한 다양한 통계들이 보고되고 있다. 첫째, 본 논문에서는 새로운 웹 로봇을 개발하고, 개발된 웹 로봇의 전체적인 구조와 구현 결정들을 기술한다. 둘째, 약 7천 4백만 한국 웹 문서들에 대한 여러 통계치를 보고한다. 셋째, 1,424 개의 한국 웹 사이트를 지속적으로 관찰하여 웹 문서들의 변경 경향을 조사한다. 본 논문에서는 웹 문서의 변경에 영향을 미치는 요소들이 식별된다. 식별된 요소는 갱신할 웹 문서를 선택하기 위한 정보로서 유용하게 활용될 수 있다.

Keywords

References

M. Burner, 'Crawling Towards Eternity : Building an Archive of the World Wide Web,' Web Techniques Magazine, Vol.2, No.5, pp.37-40, 1997
J. Cho and H. Garcia-Molina, 'The Evolution of the Web and Implications for an Incremental Crawler,' Proc. 26th VLDB Conf., pp.200-209, 2000
J. Cho and H. Garcia-Molina, Parallel Crawlers, Proc. 11th WWW Conf., pp.124-135, 2002
M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles and M. Gori, 'Focused Crawling using Context Graphs,' Proc. 26th VLDB Conf., pp.527-534, 2000
A. Heydon and M. Najork, 'Mercator: A Scalable, Extensible Web Crawler,' International Journal of WWW, Vol.2, No.4, pp.219-229, 1999 https://doi.org/10.1023/A:1019213109274
V. Shkapenyuk and T. Suel, 'Design and Implementation of a High-performance Distributed Web Crawler,' Proc. 18th Data Engineering Conf., pp.357-368, 2002
A. Heydon and M. Najork, 'Performance Limitations of the Java Core Libraries,' Proc. 1st Java Grande Conf., pp.35-41, 1999 https://doi.org/10.1145/304065.304092
J. Cho and H. Garcia-Molina, 'Synchronizing a Database to Improve Freshness,' Proc. 26th SIGMOD Conf., pp. 117-128, 2000 https://doi.org/10.1145/342009.335391
B. Brewington and G. Cybenko, 'How Dynamic is the Web?,' Proc. 9th WWW Conf.. pp.257-276, 2000
M. Najork and J. L. Wiener, 'Breadth-first Crawling Yields High-quality Pages,' Proc. 10th WWW Conf., pp. 114-118, 2001 https://doi.org/10.1145/371920.371965
T. Suel and]. Yuan, 'Compressing the Graph Structure of the Web,' Proc. 11th Data Compression Conf., pp. 213-222, 2001 https://doi.org/10.1109/DCC.2001.917152
J. Cho, H. Garcia-Molina, and L. Page, 'Efficient Crawling through URL Ordering,' Proc. 7th WWW Conf., pp. 161-172, 1998
S. Raghavan and H. Garcia-Molina, 'Crawling the Hidden Web,' Proc. 27th VDLB Conf., pp.129-138, 2001

The KIPS Transactions:PartC (정보처리학회논문지C)

Implementation of a Web Robot and Statistics on the Korean Web

웹 로봇 구현 및 한국 웹 통계보고

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)