DOI QR코드

DOI QR Code

Spam Filter by Using X2 Statistics and Support Vector Machines

카이제곱 통계량과 지지벡터기계를 이용한 스팸메일 필터

  • 이성욱 (국립충주대학교 컴퓨터정보공학과)
  • Received : 2009.12.18
  • Accepted : 2010.03.22
  • Published : 2010.06.30

Abstract

We propose an automatic spam filter for e-mail data using Support Vector Machines(SVM). We use a lexical form of a word and its part of speech(POS) tags as features and select features by chi square statistics. We represent each feature by TF(text frequency), TF-IDF, and binary weight for experiments. After training SVM with the selected features, SVM classifies each e-mail as spam or not. In experiment, the selected features improve the performance of our system and we acquired overall 98.9% of accuracy with TREC05-p1 spam corpus.

본 논문은 지지벡터기계를 이용하여 스팸메일을 자동으로 분류하는 시스템을 제안한다. 이메일에 포함된 단어의 어휘 정보와 품사 태그 정보를 지지벡터기계의 자질로 사용한다. 우리는 카이제곱 통계량을 이용하여 자질을 선택한 후 각각의 자질을 TF, TF-IDF, 이진 가중치 등으로 표현하여 실험하였다. 카이제곱 통계량을 이용하여 선택된 자질들을 이용하여 SVM을 학습한 후, SVM분류기는 각각의 이메일의 스팸 여부를 결정한다. 실험 결과, 선택되어진 자질들이 성능향상을 가져왔으며, TREC05-p1 스팸 말뭉치에 대해 약 98.9%의 정확도를 얻었다.

Keywords

References

  1. V. Keselj, E. Milios, A. Tuttle, S. Wang, and R. Zhang. "TREC 2005 Spam Track: Spam Filtering Using N-grambased Techniques", Proceedings of Text REtrieval Conference, 2005.
  2. 김현준, 정재은, 조근식, "가중치가 부여된 베이지안 분류자를 이용한 스팸메일 필터링 시스템," 정보과학회논문지, 31권 8호, pp.1092-1100, 2004 [
  3. R. Segal. "IBM SpamGuru on the TREC 2005 Spam Track," Proceedings of Text REtrieval Conference, 2005.
  4. Al Brakto, B. Filipic. "Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track," Proceedings of Text REtrieval Conference, 2005.
  5. L. A. Breyer. "DBACL at the TREC 2005," Proceedings of Text REtrieval Conference, 2005.
  6. F. Assis, W. Yerazunis, C. Siefkes, and S. Chhabra. "CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track," Proceedings of Text REtrieval Conference, 2005.
  7. W. Cao, A. An, and X. Huang. "York University at TREC 2005: SPAM Track," Proceedings of Text REtrieval Conference, 2005.
  8. V. Vapnik. The nature of statistical learning theory, Springer, NewYork, 1995.
  9. http://www.csie.ntu.edu.tw/~cjlin/libsvm
  10. 공미경, 이경순, "스팸성 자질과 URL 자질의 공동 학습을 이용 한 최대 엔트로피 기반 스팸메일 필터 시스템," 정보처리학회 논문지B, 15-B권 1호, pp.61-68, 2008. https://doi.org/10.3745/KIPSTB.2008.15-B.1.61
  11. Yiming Yang and Jan O. Pedersen. "A comparative study on Feature selection in text categorization," proceedings of the 14th International conference on Machine Learning, 1997.
  12. D. Sculley, Gabriel M. Wachman. "Relaxed online SVMs for spam filtering," Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp.415-422, 2007. https://doi.org/10.1145/1277741.1277813
  13. H. Drucker, V. Vapnik, and D. Wu. "Support vector machines for spam categorization," IEEE Transactions on Neural Networks, Vol.10, No.5, pp.1048-1054, 1999. https://doi.org/10.1109/72.788645
  14. 은종민, 이성욱, 서정연, "지지벡터기계(Support Vector Machines)를 이용한 한국어 화행분석," 정보처리학회논문지, Vol.12-B, No.3, pp.365-368, 2005. https://doi.org/10.3745/KIPSTB.2005.12B.3.365
  15. G. V. Cormack and T. R. Lynam. "TREC 2005 spam track overview," The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005.
  16. G. V. Cormack and T. R. Lynam. "On-line supervised spam filter evaluation," Technical report, David R. Cheriton School of Computer Science, University of Waterloo, Canada, 2006.
  17. http://nlp.kookmin.ac.kr/HAM/kor/index.html
  18. http://web.media.mit.edu/~hugo/montylingua
  19. http://plg.uwaterloo.ca/~gvcormac/treccorpus/
  20. T. Lynam, G. Cormack, and D. Cheriton. "On-line spam filter fusion," Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp.123-130, 2006. https://doi.org/10.1145/1148170.1148195
  21. Martin Law. "A simple introduction to Support Vector Machines," 2003.