A Hybrid Plagiarism Detection Framework Using Lexical and Semantic Similarity with Lightweight Sentence Transformers
Abstract
Plagiarism detection has become increasingly challenging due to the widespread availability of paraphrasing tools and generative artificial intelligence systems. Traditional plagiarism detection techniques based on lexical similarity, such as TF-IDF and n-gram matching, often fail to identify semantically similar but lexically modified text. This paper presents a hybrid plagiarism detection framework that combines lexical similarity measures with semantic similarity derived from sentence transformer models. The proposed approach integrates TF-IDF–based cosine similarity with lightweight sentence embeddings generated using MiniLM and SBERT models. To enhance semantic detection performance, a MiniLM-based sentence transformer is fine-tuned on the PAN 2011 plagiarism detection corpus. Experimental evaluation demonstrates that the hybrid similarity approach significantly improves detection accuracy compared to purely lexical methods, particularly for paraphrased plagiarism cases. The framework is further validated using threshold-based analysis and real-world web content retrieved through automated scraping. The proposed system provides an efficient and scalable solution for plagiarism detection, balancing computational efficiency with semantic understanding, and is suitable for academic and real-world forensic applications.
References
[2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112.
[3] K. Cho et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
[4] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of the International Conference on Learning Representations (ICLR), 2013.
[7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019, pp. 3982–3992.
[8] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
[9] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
[10] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[11] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of EMNLP, 2015, pp. 1412–1421.
[12] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
[13] C. Quirk, C. Brockett, and W. Dolan, “Monolingual machine translation for paraphrase generation,” in Proceedings of EMNLP, 2004, pp. 142–149.
[14] D. Lin and P. Pantel, “Discovery of inference rules for question answering,” Natural Language Engineering, vol. 7, no. 4, pp. 343–360, 2001.
[15] K. T. Kalleberg, “Plagiarism detection using machine learning techniques,” International Journal of Computer Applications, vol. 119, no. 13, pp. 1–6, 2015.
[16] R. V. Birwadkar, “Plagiarism Detection and Paraphrasing based on Generative Artificial Intelligence,” Master’s thesis, Dept. of Information and Technology, SRH Hochschule Heidelberg, Heidelberg, Germany, 2025.
I (we), the author(s), hereby declare under full moral, financial and criminal liability that the manuscript submitted for publication to the Journal of Computer and Forensic Sciences
a) is the result of my (our) own original research and that I (we) hold the right to publish it;
b) does not infringe any copyright or other third-party proprietary rights;
c) complies with the Journal’s research and publishing ethics standards;
d) has not been published elsewhere, under this or any other title;
e) is not under consideration by another publication, under this or any other title.
I (we) also declare under full moral, financial and criminal liability:
f) that all conflicts of interest that may directly or potentially influence or impart bias on the work have been disclosed in the manuscript;
g) that if the article has been accepted for publishing I (we) will transfer all copyright ownership of the manuscript to the University of Criminal Investigation and Police Studies in Belgrade.
Signed by the Corresponding Author on behalf of the all other authors.
