DOI QR코드

DOI QR Code

A Proposal of Join Vector for Semantic Factor Reflection in TF-IDF Based Keyword Extraction

TF-IDF 기반 키워드 추출에서의 의미적 요소 반영을 위한 결합벡터 제안

  • 박대서 (강원대학교 컴퓨터정보통신공학과) ;
  • 김화종 (강원대학교 컴퓨터정보통신공학과)
  • Received : 2017.12.22
  • Accepted : 2018.01.13
  • Published : 2018.02.28

Abstract

Recently, there has been a brisk technological development to handle big data. In the past, finding exact information based on insufficient information was the key. Finding the right information in a lot of information is now becoming key. Finding the right information can help improve user satisfaction and improve performance. In this paper, a new method of keyword extraction is proposed by combining the traditional statistical-based method of keyword extraction with the semantic-based method of keyword extraction. Use TF-IDF and Word2vec to extract keywords. TF-IDF vectorizes news articles based on word frequency, and Word2vec vectorizes news articles on the basis of similarity score between words. Finally, keywords are extracted with a combined vector combination and the performance of this study is evaluated.

최근 빅데이터를 다루기 위한 기술 개발이 활발하게 이루어지고 있으며 과거에는 부족한 정보 속에서 정확한 정보를 찾는 것이 핵심 이었다면 이제는 많은 정보 속에서 정확한 정보를 찾아내는 것이 핵심이 되어가고 있다. 정확한 정보를 찾아내는 것은 사용자 만족도 향상, 업무효율 향상 등 다양한 방향으로 긍정적인 효과를 발휘할 수 있다. 이에 본 논문에서는 기존에 연구되고 있던 통계 기반의 키워드 추출 방식에 의미 기반 키워드 추출 방식을 결합하여 새로운 키워드 추출 방법을 제안한다. 키워드 추출 방법에는 TF-IDF와 Word2vec을 활용하며 TF-IDF로는 단어 빈도수 기반으로 뉴스기사를 통계적으로 벡터화하고 Word2vec의 단어간 유사도 점수를 사용해 뉴스기사 내의 특정 단어와 다른 단어들간의 유사도 평균을 이용해 의미적으로 벡터화한다. 최종적으로 두 벡터를 결합한 결합벡터를 통해 키워드를 추출하고 본 연구의 성능을 평가한다.

Keywords

Acknowledgement

Grant : 빅데이터 자동 태깅 및 태그 기반 DaaS 시스템 개발

Supported by : 정보통신기술진흥센터

References

  1. Yongjae Im, Sunkyoung Baek, and Seungjun, Yeon, "Select and focus on securing competitiveness in the Big Data era", The Korean Institute of Communications and Information Sciences, Vol. 29, No. 11, pp. 3-10, Oct. 2012.
  2. Jieun Son, Seoungbum Kim, Hyunjoong Kim, and Sungzoon Cho, "Review and Analysis of Recommender Systems", Journal of the Korean Institute of Industrial Engineers, Vol. 41, No. 2, pp. 185-208, Apr. 2015. https://doi.org/10.7232/JKIIE.2015.41.2.185
  3. Yongsoo Kim, "Research Trend of Recomm- endation System for Personalized Service", The Korean Institute of Industrial Engineers ie Magazine, Vol. 19, No. 1, pp. 37-42, Mar. 2012.
  4. Jiyeon Kim, "Internet Search Engine : Technological Mode that Draws User's Attention to Make Its Expertise Reinforce", Journal of Science & Technology Studies, Vol. 13, No. 1, pp. 181-216, Jun. 2013.
  5. Yangjung Ae, "News Story Salience and Users' Selective Exposure - Effects of Popularity Indications on Online News Exposure", Korean Journal of Broadcasting, Vol. 25, No. 2, pp. 264-288, Mar. 2011.
  6. Sunghae Jun, "Big Data Preprocessing using Statistical Text Mining", Journal of Korean Institute of Intelligent Systems, Vol. 25, No. 5, pp. 470-476, Oct. 2015. https://doi.org/10.5391/JKIIS.2015.25.5.470
  7. Daemin Park, "Natural Language Processing of News Articles : A Case of ", Communication Theories, Vol. 12, No. 1, pp. 4-52, Mar. 2016
  8. Manning, C. D, Raghavan, and P, Schutze, "HIntroduction to Information Retrieval", Cambridge University Press, pp. 100-123, 2008
  9. Dik L. Lee, Huei Chuang, and Kent Seamons, "Document Ranking and the Vector-Space Model", IEEE Software, Vol. 14, No. 2, pp. 67- 75, Mar. 1997. https://doi.org/10.1109/52.582976
  10. Euisok Chung and Jeon-Gue Park, "Class Language Model based on Word Embedding and POS Tagging", KIISE Transactions on Computing Practices, Vol. 22, No. 6, pp. 315-319, Jul. 2016. https://doi.org/10.5626/KTCP.2016.22.7.315
  11. Eunjeong Park and Sungzoon Cho, "KoNLPy: Korean natural language processing in Python", Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, pp. 133-136, Oct. 2014.
  12. Radim Rehurek and Petr Sojka, "Software framework for topic modelling with large corpora", The LREC 2010 Workshop On New Challenges For NLP Frameworks, pp. 45-50, May. 2010.
  13. Young-Bin Kwon, Seoung-Do Lee, Hyun Yang, and Yo-Han Joo, "The Analysis of the Conferences for the Computer Network Using the Miner and the Cosine Similarity based upon Keywords", The Korea Society of Information Technology Services, Vol. 11, No. 1, pp. 223-238, Mar. 2012.
  14. Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu, "A Theoretical Analysis of NDCG Ranking Measures", Workshop and Conference Proceedings, pp. 1-30, Apr. 2013.
  15. Sungjick Lee and Hanjoon Kim, "Keyword Extraction from News Corpus using Modified TF-IDF", Society for e-Business Studies, Vol. 14, No. 4, pp. 59-73, Nov. 2009.
  16. Seunghee Han, "A Study on Keyword Extraction From a Single Document Using Term Clustering", Journal of the Korean Society for Library and Information Science, Vol. 44, No. 3, pp. 155-173, Aug. 2010. https://doi.org/10.4275/KSLIS.2010.44.3.155
  17. KilHong Joo, JooIl Lee, and WonSuk Lee, "An Associated Keywords Extraction and a Spread Clustering Methods for an Efficient Document Searching", Journal of KIIT, Vol. 9, No. 6, pp. 155-166, Jun. 2011.

Cited by

  1. A Study on Emotion Analysis of Mobile Banking Based on Color Adjectives vol.16, pp.7, 2018, https://doi.org/10.14801/jkiit.2018.16.7.77
  2. 한의학 고문헌 데이터 분석을 위한 단어 임베딩 기법 비교: 자연어처리 방법을 적용하여 vol.32, pp.1, 2018, https://doi.org/10.14369/jkmc.2019.32.1.061
  3. 워드 임베딩을 이용한 질의 기반 한국어 문서 요약 분석 및 비교 vol.19, pp.6, 2018, https://doi.org/10.7236/jiibc.2019.19.6.161