DOI QR코드

DOI QR Code

Extracting Specific Information in Web Pages Using Machine Learning

머신러닝을 이용한 웹페이지 내의 특정 정보 추출

  • Lee, Joung-Yun (Industrial and Management Engineering, Incheon National University) ;
  • Kim, Jae-Gon (Industrial and Management Engineering, Incheon National University)
  • 이정윤 (인천대학교 산업경영공학과) ;
  • 김재곤 (인천대학교 산업경영공학과)
  • Received : 2018.11.26
  • Accepted : 2018.12.13
  • Published : 2018.12.31

Abstract

With the advent of the digital age, production and distribution of web pages has been exploding. Internet users frequently need to extract specific information they want from these vast web pages. However, it takes lots of time and effort for users to find a specific information in many web pages. While search engines that are commonly used provide users with web pages containing the information they are looking for on the Internet, additional time and efforts are required to find the specific information among extensive search results. Therefore, it is necessary to develop algorithms that can automatically extract specific information in web pages. Every year, thousands of international conference are held all over the world. Each international conference has a website and provides general information for the conference such as the date of the event, the venue, greeting, the abstract submission deadline for a paper, the date of the registration, etc. It is not easy for researchers to catch the abstract submission deadline quickly because it is displayed in various formats from conference to conference and frequently updated. This study focuses on the issue of extracting abstract submission deadlines from International conference websites. In this study, we use three machine learning models such as SVM, decision trees, and artificial neural network to develop algorithms to extract an abstract submission deadline in an international conference website. Performances of the suggested algorithms are evaluated using 2,200 conference websites.

Keywords

References

  1. Coptes, C. and Vapnik, V., Support-Vector Networks, Machine Learning, 1995, Vol. 20, No. 3, pp. 273-297. https://doi.org/10.1007/BF00994018
  2. Emilio, F., Rasquale, D.M., Giacomo, F., and Robert, B., Web data extraction, applications and techniques, Knowledge-Based Systems, 2018, Vol. 70, No. 1, pp. 301-323.
  3. Hwang, M.G., Choi, D.J., and Kim, P.K., A Context Information Extraction Method according to Subject for Semantic Text Processing, Journal of Advanced Information Technology and Convergence, 2010, Vol. 11, No. 8, pp. 197-204.
  4. Jimenez, P. and Corchuelo, R., On learning web information extraction rules with TANGO, Journal Information Systems, 2018, Vol. 62, No. C, pp. 74-103.
  5. Jo, S.R., Sung, H.N., and Ahn. B.H., A Comparative Study on the Performance of SVM and an Artificial Neural Network in Intrusion Detection, Journal of the Korea Academia-Industrial cooperation Society, 2016, Vol. 17, No. 2, pp. 703-712. https://doi.org/10.5762/KAIS.2016.17.2.703
  6. Kim, G.S. and Park, J.A., Development of a Soil Moisture Estimation Model Using Artificial Neural Networks and Classification and Regression Tree(CART), Korean Society of Civil Engineers Journal of Civil Engineering, 2011, Vol. 31, No. 2, pp. 155-163.
  7. Kim, H.S. and Kim, C.S., An Analysis of IT Proposal Evaluation Results using Big Data-based Opinion Mining, Journal of Society of Korea Industrial and Systems Engineering, 2018, Vol. 41, No. 1, pp. 1-10.
  8. Kim, P.J., An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning, Journal of the Korean Society for Information Management, 2018, Vol. 35, No. 2, pp. 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
  9. Lee, J.Y., Moon, J.Y., and Kim, H.J., Examining the Intellectual Structure of Records Management and Archival Science in Korea with Text Mining, Journal of the Korean Society for Library and Information Science, 2017, Vol. 41, No. 1, pp. 345-372. https://doi.org/10.4275/KSLIS.2007.41.1.345
  10. Lee, Y.J., Sim, M.K., Lee, S.S., and Lee, C.K., Study of the Operation of Actuated signal control Based on Vehicle Queue Length estimated by Deep Learning, The Journal of the Korea Institute of Intelligent Transport Systems, 2018, Vol. 17, No. 4, pp. 54-62.
  11. Li, Y., Bontcheva, K., and Cunningham, H., Using Uneven Margins SVM and Perceptron for Information Extraction, Proceedings of the Ninth Conference on Computational Natural Language Learning, 2005, Catalonia, Spain, pp. 72-79.
  12. Noh, T.H. and Lee, S.J., Extraction and Classification of Proper Nouns by Rule-based Machine Learning, Journal of Korean Institute of Information Scientists and Engineers, 2000, Vol. 27, No. 2, pp. 170-172.
  13. Park, N.R., Design and Implementation of Criminal Identification System Based on Deep Learning, [dissertation], [Seongnam-si, Korea] : Gachon University, 2017.
  14. Schneider, K.M. and Textkernel, B.V., Information Extraction from Calls for Papers with Conditional Random Fields and Layout Features, Artificial Intelligence Review, 2006, Vol. 25, No. 1, pp. 67-77. https://doi.org/10.1007/s10462-007-9019-4
  15. Shin, H.S., Kim, J.H., Lee, H.Y., and Choi, K.S., A Method for Automatic Extraction of Term Definition from Text, Annual Conference on Human and Cognitive Language Technology, Chongju-si, Korea, 2002, pp. 292-299.
  16. Son, J.R., SVM Spam Mail Analysis using Feature Selection [dissertation], [Seoul, Korea] : Hankuk University of Foreign Studies, 2005.

Cited by

  1. 일반화 서포트벡터 분위수회귀에 대한 연구 vol.43, pp.4, 2020, https://doi.org/10.11627/jkise.2020.43.4.107