DOI QR코드

DOI QR Code

Similarity Analysis of Programs through Linear Regression of Code Distribution

코드 분포의 선형 회귀를 이용한 프로그램 유사성 분석

  • Lim, Hyun-il (Department of Computer Engineering, Kyungnam University)
  • 임현일 (경남대학교 컴퓨터공학부)
  • Received : 2018.06.25
  • Accepted : 2018.07.25
  • Published : 2018.07.31

Abstract

In addition to advances in information technology, machine learning approach is applied to a variety of applications, and is expanding to a variety of areas. In this paper, we propose a software analysis method that applies linear regression to analyse software similarity from the code distribution of the software. The characteristics of software can be expressed by instructions contained within the program, so the distribution information of instructions is used as learning data. In addition, a learning procedure with the learning data generates a linear regression model for software similarity analysis. The proposed method is evaluated with real world Java applications. The proposed method is expected to be used as a basic technique to determine similarity of software. It is also expected to be applied to various software analysis techniques through machine learning approaches.

정보 기술의 발전과 더불어 인공 지능 및 기계 학습 분야는 다양한 응용 분야에서 성능을 인정받고 있으며, 다양한 응용 분야로 확대되고 있다. 본 논문에서는 기계 학습 방법을 응용한 소프트웨어 분석 방법을 제안한다. 소프트웨어의 특성을 표현하기 위해 소프트웨어의 코드 분포를 분석하고 이 정보를 기계 학습 방법인 선형 회귀를 통해 분석함으로써 유사 소프트웨어를 분석할 수 있는 방법을 제안한다. 소프트웨어의 특성은 프로그램 내에 포함된 명령어에 의해 표현될 수 있으며, 명령어의 분포 정보를 학습 데이터로 활용하였다. 또한, 학습 데이터를 통한 학습 과정은 소프트웨어 유사성 분석을 위한 선형 회귀 모델을 구성한다. 본 논문에서 제안한 방법은 구현 및 실험을 통해 정확성을 검증한다. 본 논문에서 제안한 방법은 소프트웨어의 유사성을 판단할 수 있는 기본 기술로 활용될 수 있을 것으로 기대된다. 또한 기계 학습 방법을 통한 소프트웨어 분석 기술에 응용될 수 있을 것으로 기대된다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Michael J. Wise, "Yap3: Improved detection of similarities in computer program and other texts," In Proceedings of the 27th SIGCSE Technical Symposium on Computer Science Education, pages 130-134, 1996.
  2. Ginger Myles and Christian Collberg, "k-gram based software birthmarks," In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 314-318, 2005.
  3. Krinke J, "Identifying similar code with program dependence graphs," In Working Conference on Reverse Engineering 2001, pp. 301-309, 2001.
  4. Ginger Myles and Christian Collberg, "Detecting software theft via whole program path birthmarks," In International Conference on Information Security (ISC 2004), LNCS 3225, pp. 404-415, 2004.
  5. Hyun-il Lim, "An Effective Method for Comparing Control Flow Graphs through Edge Extension," KIPS Transactions on Computer and Communication Systems, Vol 2, No. 8, Aug. 2013.
  6. Kevin P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012.
  7. Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
  8. Pedro Domingos, “A few useful things to know about machine learning,” Communications of the ACM, Vol. 55, No. 10, pp. 78-87, 2012. https://doi.org/10.1145/2347736.2347755
  9. linear regression, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Linear_regression
  10. Least squares, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Least_squares
  11. Binary file, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Binary_file
  12. The class File Format, Java SE Specification, Oracle [Internet]. Available: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html
  13. Denis N. Antonioli and Markus Pilz, "Analysis of the Java Class File Format," Technical Report 98.4, Department of Computer Science, University of Zurich, 1998.
  14. Python [Internet]. Available: https://www.python.org/
  15. scikit-learn: Machine Learning in Python [Internet]. Available: http://scikit-learn.org/stable/index.html
  16. The Jakarta-ORO [Internet]. Available: https://jakarta.apache.org/oro/
  17. Smokescreen - Java obfuscator, http://www.javadevelopmentindia.com/technology-amp-integration/technology-amp-integration/obfustication-amp-decompiling/smokescreen/
  18. ANTLR (ANother Tool for Language Recognition) [Internet]. Available: http://www.antlr.org/
  19. Chang-Sik Kim, Su-Jung Choi, Kee-Young Kwahk, “Investigation of Research Trends in Information Systems Domain Using Topic Modeling and Time Series Regression Analysis,” Journal of Digital Contents Society, Vol. 18, No. 6, pp. 1143-1150, Oct. 2017.

Cited by

  1. Design of Similar Software Classification Model through Support Vector Machine vol.21, pp.3, 2018, https://doi.org/10.9728/dcs.2020.21.3.569
  2. A Study on Data Slicing Method of Linear Regression for Similar Program Analysis vol.21, pp.7, 2020, https://doi.org/10.9728/dcs.2020.21.7.1345