DOI QR코드

DOI QR Code

Distributed Processing System Design and Implementation for Feature Extraction from Large-Scale Malicious Code

대용량 악성코드의 특징 추출 가속화를 위한 분산 처리 시스템 설계 및 구현

  • 이현종 (단국대학교 소프트웨어학과) ;
  • 어성율 (단국대학교 소프트웨어학과) ;
  • 황두성 (단국대학교 소프트웨어학과)
  • Received : 2018.09.07
  • Accepted : 2018.12.27
  • Published : 2019.02.28

Abstract

Traditional Malware Detection is susceptible for detecting malware which is modified by polymorphism or obfuscation technology. By learning patterns that are embedded in malware code, machine learning algorithms can detect similar behaviors and replace the current detection methods. Data must collected continuously in order to learn malicious code patterns that change over time. However, the process of storing and processing a large amount of malware files is accompanied by high space and time complexity. In this paper, an HDFS-based distributed processing system is designed to reduce space complexity and accelerate feature extraction time. Using a distributed processing system, we extract two API features based on filtering basis, 2-gram feature and APICFG feature and the generalization performance of ensemble learning models is compared. In experiments, the time complexity of the feature extraction was improved about 3.75 times faster than the processing time of a single computer, and the space complexity was about 5 times more efficient. The 2-gram feature was the best when comparing the classification performance by feature, but the learning time was long due to high dimensionality.

기존 악성코드 탐지는 다형성 또는 난독화 기법이 적용된 변종 악성코드 탐지에 취약하다. 기계학습 알고리즘은 악성코드에 내재된 패턴을 학습시켜 유사 행위 탐지가 가능해 기존 탐지 방법을 대체할 수 있다. 시간에 따라 변화하는 악성코드 패턴을 학습시키기 위해 지속적으로 데이터를 수집해야한다. 그러나 대용량 악성코드 파일의 저장 및 처리 과정은 높은 공간과 시간 복잡도가 수반된다. 이 논문에서는 공간 복잡도를 완화하고 처리 시간을 가속화하기 위해 HDFS 기반 분산 처리 시스템을 설계한다. 분산 처리 시스템을 이용해 2-gram 특징과 필터링 기준에 따른 API 특징 2개, APICFG 특징을 추출하고 앙상블 학습 모델의 일반화 성능을 비교했다. 실험 결과로 특징 추출의 시간 복잡도는 컴퓨터 한 대의 처리 시간과 비교했을 때 약 3.75배 속도가 개선되었으며, 공간 복잡도는 약 5배의 효율성을 보였다. 특징 별 분류 성능을 비교했을 때 2-gram 특징이 가장 우수했으나 훈련 데이터 차원이 높아 학습 시간이 오래 소요되었다.

Keywords

JBCRIN_2019_v8n2_35_f0001.png 이미지

Fig. 1. Architecture of OpenDPU

JBCRIN_2019_v8n2_35_f0002.png 이미지

Fig. 2. Data processing steps of OpenDPU

JBCRIN_2019_v8n2_35_f0003.png 이미지

Fig. 3. API Call Statement in Assembly Code

JBCRIN_2019_v8n2_35_f0004.png 이미지

Fig. 4. Example of APICFG

JBCRIN_2019_v8n2_35_f0005.png 이미지

Fig. 5. Space Complexity of 2-gram Features

JBCRIN_2019_v8n2_35_f0006.png 이미지

Fig. 6. Scalability of OpenDPU

Table 1. Processing Time Per Features

JBCRIN_2019_v8n2_35_t0001.png 이미지

Table 2. Performance Comparison

JBCRIN_2019_v8n2_35_t0002.png 이미지

References

  1. I. You and Y. Kangbin. "Malware obfuscation techniques: A brief survey," 2010 International Conference on IEEE, Broadband, Wireless Computing, Communication and Applications(BWCCA), 2010.
  2. Symantec, "Internet Security Threat Report," vol.23, 2018.
  3. Michael Sikorski and Andrew Honig, "Practical Malware Analysis," San Francisco: No Strach Press, 2012.
  4. Charles LeDoux and Arun Lakhotia, "Malware and Machine Learning," Intelligent Methods for Cyber Warfare, Intelligent Methods for Cyber Warfare, Studies in Computational Intelligence Book Series, Springer, Vol.563, pp.1-42, 2014.
  5. Kaspersky Enterprise Cybersecurity, Machine Learning for Malware Detection [Internet], www.kaspersky.com/
  6. Rafiqul Islam, Ronghua Tian, Lynn M. Batten, and Steve Versteeg, "Classification of malware based on integrated static and dynamic features," Journal of Network and Computer Applications, Vol.36, Issue 2, pp.646-656, 2013. https://doi.org/10.1016/j.jnca.2012.10.004
  7. M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, "Novel feature extraction, selection and fusion for effective malware family classification," in Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. ACM, pp.183-194, 2016.
  8. I. Santos and F. Brezo, "Opcode sequences as representation of executables for data-mining-based unknown malware detection," Information Sciences, Vol.231, pp.64-82, 2013. https://doi.org/10.1016/j.ins.2011.08.020
  9. SS. Hansen, TMT. Larsen, and M. Stevanovic, "An approach for detection and family classification of malware based on behavioral analysis," Computing, Networking and Communications(ICNC), 2016 International Conference on. IEEE, pp.1-5, 2016.
  10. M. Wagner, F. Fischer, R. Luh, A. Haberson, A. Rind, D. A. Keim, and W. Aigner, "A Survey of Visualization Systems for Malware Analysis," in EG Conference on visualization (EuroVis)-STARs, pp.105-125, 2015.
  11. Hadoop MapReduce [Internet], http://hadoop.apache.org/
  12. T. White, "Hadoop: The Definitive Guide: Storage and Analysis at the Internet Scale," 4th ed., Beijing: O'Reilly Media, 2015.
  13. C. Lin, N. Wang, H. Xiao, and C. Eckert, "Feature Selection and Extraction for Malware Classification," Journal of Informations Science and Engineering, Vol.31, No.3, pp.965-992, 2015.
  14. CWSandbox [Internet], https://cwsandbox.org/
  15. Cuckoo Sandbox [Internet], https://cuckoosandbox.org/
  16. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, "Microsoft Malware Classification Challenge," arXiv:1802.10135v1, 2018.
  17. T. Chen and C. Guestrin. "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp.785-794, 2016.
  18. VXHeaven [Internet], http://83.133.184.251/virensimulation.org/
  19. VirusShare [Internet], https://virusshare.com/
  20. Ninite, Ninite [Internet], https://ninite.com/
  21. Lupo PenSuite Collections, Lupo pensuite collections [Internet], http://www.lupopensuite.com/collection.htm, 2015.
  22. A. Liaw and M. Wiener, "Classification and regression by randomForest," R news, Vol.2, pp.18-22, 2002.
  23. V. Simon, S. O'Keefe, and J. Austin, "Hadoop neural network for parallel and distributed feature selection," Neural Networks 78, pp.24-35. 2015. https://doi.org/10.1016/j.neunet.2015.08.011
  24. M. Bala, O. Boussaid, and Z. Alimazighi, "P-ETL: Parallel-ETL based on the MapReduce paradigm," Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on. IEEE, 2014.
  25. Radare2 [Internet], https://rada.re/r/
  26. P. Singhal and N. Raul, "Malware detection module using machine learning algorithms to assist in centralized security in enterprise networks," International Jounal of Network Security & Its Applications(IJNSA), Vol.4, No.1, 2012.
  27. Malware.com [Internet], https://www.malwares.com/