DOI QR코드

DOI QR Code

Detecting Spam Data for Securing the Reliability of Text Analysis

텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안

  • Hyun, Yoonjin (Kookmin University The Graduate School of Business Information Technology) ;
  • Kim, Namgyu (Kookmin University School of MIS)
  • Received : 2017.01.09
  • Accepted : 2017.02.13
  • Published : 2017.02.28

Abstract

Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.

최근 뉴스, 블로그, 소셜미디어 등을 통해 방대한 양의 비정형 텍스트 데이터가 쏟아져 나오고 있다. 이러한 비정형 텍스트 데이터는 풍부한 정보 및 의견을 거의 실시간으로 반영하고 있다는 측면에서 그 활용도가 매우 높아, 학계는 물론 산업계에서도 분석 수요가 증가하고 있다. 하지만 텍스트 데이터의 유용성이 증가함과 동시에 이러한 텍스트 데이터를 왜곡하여 특정 목적을 달성하려는 시도도 늘어나고 있다. 이러한 스팸성 텍스트 데이터의 증가는 방대한 정보 가운데 필요한 정보를 획득하는 일을 더욱 어렵게 만드는 것은 물론, 정보 자체 및 정보 제공 매체에 대한 신뢰도를 떨어뜨리는 현상을 초래하게 된다. 따라서 원본 데이터로부터 스팸성 데이터를 식별하여 제거함으로써, 정보의 신뢰성 및 분석 결과의 품질을 제고하기 위한 노력이 반드시 필요하다. 이러한 목적으로 스팸을 식별하기 위한 연구가 오피니언 스팸 탐지, 스팸 이메일 검출, 웹 스팸 탐지 등의 분야에서 매우 활발하게 수행되었다. 본 연구에서는 스팸 식별을 위한 기존의 연구 동향을 자세히 소개하고, 블로그 정보의 신뢰성 향상을 위한 방안 중 하나로 블로그의 스팸 태그를 식별하기 위한 방안을 제안한다.

Keywords

References

  1. Economist Intelligence Unit, Big Data Harnessing a Game-Changing Asset, The Economist, 2011.
  2. McKinsey Global Institute, Big Data: The next Frontier for Innovation, Competition, and Productivity, McKinsey and Company, 2011.
  3. Gartner Inc., 2012 Hype Cycle for Emerging Technologies, Gartner Inc., 2011.
  4. C. Chen, J. Zhang, Y. Xiang, and W. Zhou, "Spammers are becoming "Smarter" on twitter," Browse J. & Mags., vol. 18, no. 2, 2016.
  5. B. Liu, Sentiment analysis and opinion mining, syntehesis lectures on human language technologies #16, Morgan & Claypool Publisiers, 2012.
  6. M. Egele, G. Stringhini, C. Kruegel, and G. Vigna, "Compa: Detecting compromised accounts on social networks," in Proc. Ann. Netw. Distrib. Syst. Security Symp., San Diego, CA, 2013.
  7. J. Song, S. Lee, and J. Kim, "Spam filtering in twitter using sender-receiver relationship. Recent advances in intrusion detection," Int. Workshop on Recent Advances in Intrusion Detection, pp. 301-317, Heidelberg, Berlin, Sept. 2011.
  8. S. Yarde, D. Romero, G. Schoenebeck, and D. Boyd, "Detecting spam in a twitter network," First Monday, vol. 15, no. 1, Jan. 2010.
  9. A. H. Wang, "Don't follow me: Spam detection in twitter," IEEE SECRYPT, pp. 1-10, Athens, Greece, Jul. 2010.
  10. Y. Ma, Y. Niu, Y. Ren, and Y. Xue, "Detecting spam on sina weibo," CCIS, Oct. 2013.
  11. S. Lee and J. Kim, "Warningbird: A near real-time detection system for suspicious URLs in twitter stream," IEEE Trans. Dependable and Secure Comput., vol. 10, no. 3, pp. 183-195, Jan. 2013. https://doi.org/10.1109/TDSC.2013.3
  12. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Ed., Morgan Kaufmann Publishers, 2011.
  13. R. J. Mooney and R. Bunescu, "Mining knowledge from text using information extraction," ACM SIGKDD Explorations Newsletter - Natural Lang. Process. and Text Mining, vol. 7, no. 1, pp. 3-10, Jun. 2006.
  14. C. J. V. Rijsbergen, Information Retrieval, 2nd Ed., Butterworth, London, 1979.
  15. K. Kim and H. Ahn. "Development of web-based intelligent recommender systems using advanced data mining techniques," J. Inf. Technol. Appl. Management, vol. 12, no. 3, pp. 41-56, Sept. 2005.
  16. J. Hur and J. W. Kim, "Characteristics on inconsistency pattern modeling as hybrid data mining techniques," J. Inf. Technol. Appl. Management, vol. 15, no. 1, pp. 225-242, Mar. 2008.
  17. I. Hwang, "A study on dynamic query expansion using web mining in information retrieval," J. Inf. Technol. Appl. Management, vol. 11, no. 2, pp. 227-237, Jun. 2004.
  18. T. N. Phan and M. Yoo, "Facebook fan page evaluation system based on user opinion mining," The J. Korean Inst. Commun. and Inf. Sci., vol. 40, no. 12, pp. 2488-2490, Dec. 2015. https://doi.org/10.7840/kics.2015.40.12.2488
  19. J. Moon, I. Jang, Y. C. Choe, J. G. Kim, and G. Bock, "Case study of big data-based agri-food recommendation system according to types of customers," The J. Korean Inst. Commun. Inf. Sci., vol. 40, no. 5, pp. 903-913, May 2015. https://doi.org/10.7840/kics.2015.40.5.903
  20. R. Albright, Taming Text with the SVD, SAS Institute Inc., 2006.
  21. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Commun. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975. https://doi.org/10.1145/361219.361220
  22. S. M. Weiss, N. Indurkhya, and T. Zhang, Fundamentals of Predictive Text Mining, Springer, 2010.
  23. J. Kim, N. Kim, and Y. Cho, "Userperspective issue clustering using multilayered two-mode network analysis," J. Intell. Inf. Syst., vol. 20, no. 2, pp. 93-107, Jun. 2014. https://doi.org/10.13088/JIIS.2014.20.2.093
  24. Y. Hyun, N. Kim, and Y. Cho, "A multi-dimensional issue clustering from the perspective consumers' interests and R&D," J. Inf. Technol. Serv., vol. 14, no. 1, pp. 237- 249, Mar. 2015.
  25. S. Choi, Y. Hyun, and N. Kim, "Improving performance of recommendation systems using topic modeling," J. Intell. Inf. Syst., vol. 21, no. 3, pp. 101-116, Sept. 2015. https://doi.org/10.13088/jiis.2015.21.3.101
  26. Y. Hyun, N. Kim, and Y. Cho, "Interest-based customer segmentation methodology using topic modeling," J. Inf. Technol. Appl. & Management, vol. 22, no. 1, pp. 77-93, Mar. 2015. https://doi.org/10.21219/JITAM.2015.22.1.077
  27. D. Kim, W. X. S. Wong, M. Lim, C. Liu, N. Kim, J. Park, W. Kil, and H. Yoon, "A methodology for analyzing public opinion about science and technology issues using text analysis," J. Inf. Technol. Serv., vol. 14, no. 3, pp. 33-48, Sept. 2015.
  28. M. Lim and N. Kim, "Investigating dynamic mutation process of issues using unstructured text analysis," J. Intell. Inf. Syst., vol. 22, no. 1, pp. 1-18, Mar. 2016. https://doi.org/10.13088/JIIS.2016.22.1.01
  29. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A bayesian approach to filtering junk e-mail," in AAAI Workshop on Learning for Text Categorization, vol. 62, pp. 98-105, Jul. 1998.
  30. X. Jia, K. Zheng, W. Li, T. Liu, and L. Shang, "Three-way decisions solution to filter spam email: An empirical study," Int. Conf. Rough Sets and Current Trends in Comput., pp. 287-296, Heidelberg, Berlin, Aug. 2012.
  31. I. Joe and H. T. Shim, "A SVM-based spam filtering system for short message service (SMS)," J. KICS, vol. 34, no. 9, pp. 908-913, Sept. 2009.
  32. B. Klimt and Y. Yang, "Introducing the enron corpus," CEAS 2004, First Conf. Email and Anti-Spam, California, USA, Jul. 2004.
  33. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, "Combating web spam with trustrank," VLDB '04, pp. 576-587, Toronto, Canada, Aug. 2004.
  34. Z. Gyongyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen, "Link spam detection based on mass estimation," VLDB '06, pp. 439-450, Seoul, Korea, Sept. 2006.
  35. A. Ntoulas, M. Najork, M. Manasse, and D. Retterly, "Detecting spam web pages through content analysis," in Proc. 15th Int. Conf. World Wide Web, pp. 83-92, Edinburgh, Scotland, May 2006.
  36. P. Xanthopoulos, O. P. Panagopoulos, G. A. Bakamitsos, and E. Freudmann, "Hashtag hijacking: What it is, why it happens and how to avoid it," J. Digital & Social Media Marketing, vol. 3, no. 4, pp. 353-362, Feb. 2016.
  37. S. Sedhai and A. Sun, "Effect on spam on hashtag recommendation for tweets," in Proc. 25th Int. Conf. Companion on World Wide Web, pp. 97-98, Quebec, Canada, Apr. 2016.
  38. J. Jung and M. Yoo, "Tag search system using the keyword extraction and similarity evaluation," The J. Korean Inst. Commun. Inf. Sci., vol. 40, no. 12, pp. 2458-2487, Dec. 2015.

Cited by

  1. 머신러닝 및 딥러닝 연구동향 분석: 토픽모델링을 중심으로 vol.15, pp.2, 2017, https://doi.org/10.17662/ksdim.2019.15.2.019