DOI QR코드

DOI QR Code

A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data

텍스트 데이터 워드클라우드 분석을 위한 데이터 정제기법에 관한 연구

  • Lee, Won-Jo (Dept. of Industrial Management Eng., Ulsan College)
  • 이원조 (울산과학대학교 산업경영공학과)
  • Received : 2021.09.30
  • Accepted : 2021.10.18
  • Published : 2021.11.30

Abstract

In Big data visualization analysis of unstructured text data, raw data is mostly large-capacity, and analysis techniques cannot be applied without cleansing it unstructured. Therefore, from the collected raw data, unnecessary data is removed through the first heuristic cleansing process and Stopwords are removed through the second machine cleansing process. Then, the frequency of the vocabulary is calculated, visualized using the word cloud technique, and key issues are extracted and informationalized, and the results are analyzed. In this study, we propose a new Stopword cleansing technique using an external Stopword set (DB) in Python word cloud, and derive the problems and effectiveness of this technique through practical case analysis. And, through this verification result, the utility of the practical application of word cloud analysis applying the proposed cleansing technique is presented.

비정형 텍스트 데이터의 빅데이터 시각화 분석에서 원시 데이터는 대부분 대용량이고 비정형으로 정제하지 않고 분석기법을 적용할 수 없는 상태이다. 따라서 수집된 원시 데이터는 1차 휴리스틱 정제과정을 통해서 불필요한 데이터들을 제거하고 2차 머시인 정제과정을 통해서 불용어를 제거한다. 그리고 어휘의 빈도수를 계산하여 워드클라우드 기법으로 시각화하고 핵심 이슈들을 추출하여 정보화하고 그 결과를 분석한다. 본 연구에서는 파이썬 워드클라우드에서 외부 불용어 Set(DB)를 사용한 새로운 불용어 정제기법을 제안하고 실무 사례분석을 통하여 이 기법의 문제점과 효용성을 도출한다. 그리고 이 검증 결과를 통해 제안된 정제기법을 적용한 워드클라우드 분석의 실무적용에 대한 효용성을 제시한다.

Keywords

References

  1. W. Lee, A Study on Word Cloud Techniques for Analysis of Unstructured Text Data, JCCT, vol. 6, No. 3, pp. 337-341, 2021.
  2. J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis of Civel Complaint Texts Using R Language, KIICE, 2020.
  3. I. Chun, D. Park, Y. Kang, Python and data science, Saengneun Publishing, pp. 222-233, 2019.
  4. M. Chi, S. Lin, S. Chen, C. Lin, T. Lee, Morphable word Clouds for Time-Varying Text Data Visualization, IEEE, 2015.
  5. Kumar, P. Thakur, K. Gupta, and A. Pal, 2015, Text mining approach to analyse the relation between obesity and breast cancer data, ILNS
  6. M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
  7. Jong Suk Lee and 3 others, Big data analysis of civil complaint texts using R language, 2020.
  8. Insun Lee and 1 others, Unstructured data analysis and visualization, Korean Psychology Association, 2018.
  9. Dongnyeok Sim, Research on ICT issue detection and analysis methodology using text data, 2020.
  10. Software Engineering Center Webzine Materials, Big data purification process, 2019.
  11. Giseop Noh, An Analysis on Internet Information using Real Time Search Words, JCCT, vol. 4, No. 4, pp. 337-341, 2018.
  12. Jongyong LEE, A Study on Tourism Analysis in Uijeongbu Region Using Big Data, JCCT, vol. 6, No. 1, pp. 413-419, 2020.
  13. Sunghuk Moon, Big data environment analysis and research on ways to secure global competitiveness, JCCT, vol. 5 No. 2, pp. 361-367
  14. Web Mining, IT Glossary, Korea Information and Communication Technology Association
  15. text mining, Biochemistry Encyclopedia
  16. Sejong Oh, R data analysis for everyone, R data analysis for everyone, Hanbit Media, 2019.
  17. Dictionary of current affairs.