DOI QR코드

DOI QR Code

Plagiarism Detection among Source Codes using Adaptive Methods

  • Lee, Yun-Jung (Center of U-Port IT Research and Education, Pusan National University) ;
  • Lim, Jin-Su (HA Control R&D Lab in LG Electronics) ;
  • Ji, Jeong-Hoon (Korean Intellectual Property Office) ;
  • Cho, Hwaun-Gue (Dept. of Computer Science and Engineering, Pusan National University) ;
  • Woo, Gyun (Dept. of Computer Science and Engineering, Pusan National University)
  • Received : 2011.06.29
  • Accepted : 2012.05.17
  • Published : 2012.06.30

Abstract

We propose an adaptive method for detecting plagiarized pairs from a large set of source code. This method is adaptive in that it uses an adaptive algorithm and it provides an adaptive threshold for determining plagiarism. Conventional algorithms are based on greedy string tiling or on local alignments of two code strings. However, most of them are not adaptive; they do not consider the characteristics of the program set, thereby causing a problem for a program set in which all the programs are inherently similar. We propose adaptive local alignment-a variant of local alignment that uses an adaptive similarity matrix. Each entry of this matrix is the logarithm of the probabilities of the keywords based on their frequency in a given program set. We also propose an adaptive threshold based on the local outlier factor (LOF), which represents the likelihood of an entity being an outlier. Experimental results indicate that our method is more sensitive than JPlag, which uses greedy string tiling for detecting plagiarism-suspected code pairs. Further, the adaptive threshold based on the LOF is shown to be effective, and the detection performance shows high sensitivity with negligible loss of specificity, compared with that using a fixed threshold.

Keywords

References

  1. J. Carter, "Collaboration or plagiarism: What happens when students work together," in Proc. of ITICSE'99, pp.52-55, Jun.1999.
  2. A. Knight, K. Almeroth, and B. Bimber, "An automated system for plagiarism detection using the internet," in Proc. of ED-MEDIA 2004, pp.3619-3625, Jun.2004.
  3. D. Gitchell and N. Tran, "Sim: a utility for detecting similarity in computer programs," in Proc. of SIGCSE'99, pp.266-270, Mar.1999.
  4. L. Prechelt, G. Malpohl, and M. Philippsen, "Finding plagiarisms among a set of programs with JPlag," Journal of Universal Computer Science, vol.8, no.11, pp.1016-1038, Nov.2002.
  5. G. Whale, "Identification of program similarity in large populations," The Computer Journal-Special issue on Procedural programming, vol.33, no.2, pp.140-146, Apr.1990.
  6. M. J. Wise, "Detection of similarities in student programs: Yap'ing may be preferable to plague'ing," ACM SIGSCE Bulletin, vol.24, no.1, pp.268-271, Mar.1992.
  7. S. Schleimer, D.S. Wilkerson, and A. Aiken, "Winnowing: local algorithms for document fingerprinting," in Proc. of the ACM SIGMOD 2003, pp.76-85, Jun.2003.
  8. JS. Lim, JH. Ji, HG. Cho, and G. Woo, "Plagiarism detection among source codes using adaptive local alignment of keywords," in Proc. of ICUIMC'11, pp.24-33, 2 Feb.2011.
  9. T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, vol.147, pp.195-197, 1981. https://doi.org/10.1016/0022-2836(81)90087-5
  10. JH. Ji, SH. Park, G. Woo, and HG. Cho, "Source code similarity detection using adaptive local alignment of keywords," in Proc. of PDCAT 2007, pp.179-180, Dec.2007.
  11. M. M. Breunig, H. P. Kriegel, R. T. Ng and J. Sander, "LOF: Identifying Density-Based Local Outliers," in Proc. of the ACM SIGMOD 2000, pp.93-104, May.2000.
  12. A. Parker and J. O. Hamblen, "Computer algorithms for plagiarism detection," IEEE Trans. on Education, vol.32, no.2, pp.94-99, May.1989. https://doi.org/10.1109/13.28038
  13. S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital documents," in Proc. of the ACM SIGMOD 1995, pp.398-409, May.1995.
  14. J. H. Johnson, "Identifying redundancy in source coding using fingerprints," in CASCON'93, pp.171-183, 1993.
  15. S. D. Stephens, "Using metrics to detect plagiarism (student paper)," The journal of computing Sciences in Colleges, vol.16, no.3, pp.191-196, Mar.2001.
  16. M. H. Halstead, Elements of Software Science (Operating and programming systems series), Elsevier Science Inc., New York, 1977.
  17. T. Schmidt and J. Stoye, "Quadratic time algorithms for finding common intervals in two and more sequences," in Proc. of CPM 2004, pp. 347-385, Jul.2004.
  18. JW. Son, SB. Park, and SY. Park, "Program plagiarism detection using parse tree kernels," in Proc. of PRICAI 2006, pp.1000-1004, Aug.2006.
  19. I. D. Baxter, A. Yahin, L. M. Moura, M. Sant'Anna, and L. Bier, "Clone detection using abstract syntax trees," in Proc. of ICSM'98, pp.368-377, Mar.1998.
  20. K. L. Verco and M. J. Wise, "Software for detecting suspected plagiarism: Comparing structure and attribute-counting systems," in Proc. of ACSE'96, pp.81-88, Jul.1996.
  21. M. J. Wise, "Neweyes: A system for comparing biological sequences using the running Karp-Rabin Greedy String-Tiling algorithm," in Proc. of ISMB 1995, pp.393-401, Aug.1995.
  22. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, "Shared information and program plagiarism detection," IEEE Trans. On Information Theory, vol.50, no.7, pp.1545-1551, 2004. https://doi.org/10.1109/TIT.2004.830793
  23. J. Zhang and M. Zulkernine, "Anomaly based network intrusion detection with unsupervised outlier detection," in Proc. of ICC'06, vol.5, pp.2388-2393, Jun.2006.
  24. O. Maimon and L. Rokach, "Data mining and knowledge discovery handbook", Springer-Verlag New York Inc, 2005.
  25. J. Laurikkala, M. Juhola, and E. Kentala, "Informal identification of outliers in medical data," in Proc. of IDAMAP 2000, Aug.2000.
  26. E. M. Knorr and R. T. Ng, "Algorithms for mining distance-based outliers in large datasets," in Proc. of VLDB'98, pp.392-403, Aug.1998.
  27. Y. Zeng and T. M. Chen, "Classification of traffic flows into QoS class by unsupervised learning and KNN clustering," KSII Trans. on Internet and Information Systems, vol.3, no.2, pp.134-146, 2009. https://doi.org/10.3837/tiis.2009.02.001
  28. SH. Song, CH. Lee, JH. Park, KJ. Koo, JK. Kim, and JS. Park, "enhancing location estimation and reducing computation using adaptive zone based K-NNSS algorithm," KSII Trans. on Internet and Information Systems, vol.3, no.1, pp.119-133, 2009. https://doi.org/10.3837/tiis.2009.01.007
  29. JH. Yu, HS. Lee, YH. Im, MS. Kim, and DH. Park, "Real-time classification of internet application traffic using a hierarchical multi-class SVM," KSII Trans. on Internet and Information Systems, vol.4, no.5, pp.859-876, 2010.
  30. M. Alshawabkeh, B. Jang, and D. Kaeli, "Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems," in Proc. of GPGPU-3, pp.104-110, Mar.2010.
  31. OpenC++ Homepage, http://opencxx.sourceforge.net/, lastly visited on Apr. 2012.
  32. JH. Ji, G. Woo, SH. Park, and HG. Cho, "An intelligent system for detecting source code plagiarism using a probabilistic graph model," in Proc. of MLDM 2007, pp.55-69, Jul.2007.

Cited by

  1. 데이터 구조를 고려한 소스코드 표절 검사 기법 vol.3, pp.6, 2012, https://doi.org/10.3745/ktccs.2014.3.6.189