Journal of Computer and Forensic Sciences

A Hybrid Plagiarism Detection Framework Using Lexical and Semantic Similarity with Lightweight Sentence Transformers

RAHUL Birwadkar — 2026-05-06

Plagiarism detection has become increasingly challenging due to the widespread availability of paraphrasing tools and generative artificial intelligence systems. Traditional plagiarism detection techniques based on lexical similarity, such as TF-IDF and n-gram matching, often fail to identify semantically similar but lexically modified text. This paper presents a hybrid plagiarism detection framework that combines lexical similarity measures with semantic similarity derived from sentence transformer models. The proposed approach integrates TF-IDF–based cosine similarity with lightweight sentence embeddings generated using MiniLM and SBERT models. To enhance semantic detection performance, a MiniLM-based sentence transformer is fine-tuned on the PAN 2011 plagiarism detection corpus. Experimental evaluation demonstrates that the hybrid similarity approach significantly improves detection accuracy compared to purely lexical methods, particularly for paraphrased plagiarism cases. The framework is further validated using threshold-based analysis and real-world web content retrieved through automated scraping. The proposed system provides an efficient and scalable solution for plagiarism detection, balancing computational efficiency with semantic understanding, and is suitable for academic and real-world forensic applications.

Semantic Paraphrase Generation Using Transformer Architectures: A Comparative Study of Pre-trained and Fine-Tuned Models

RAHUL Birwadkar — 2026-05-06

Semantic paraphrase generation plays a crucial role in academic and technical writing by enabling authors to restate content while preserving its original meaning. Traditional paraphrasing approaches, such as rule-based rewriting and statistical methods, often struggle to maintain semantic consistency and linguistic fluency, especially for complex or longer text segments. Recent advances in transformer-based architectures have significantly improved text generation capabilities by leveraging contextual representations and self-attention mechanisms. This paper presents a comparative study of pre-trained and fine-tuned transformer models for semantic paraphrase generation. We evaluate encoder–decoder–based transformer architectures, with a primary focus on the BART model in both pre-trained and fine-tuned settings, alongside a large generative language model used for paraphrase generation. The fine-tuning process adapts pre-trained models to paraphrasing tasks using task-specific data, enabling improved control over semantic preservation and output consistency. The evaluation is conducted using both quantitative and qualitative analysis, including training and validation loss trends and comparative examination of generated paraphrases. Experimental results demonstrate that fine tuned transformer models produce paraphrases with higher semantic fidelity and structural coherence compared to their pre-trained counterparts, while large generative models offer fluent but less deterministic outputs. The findings highlight the importance of task-specific fine-tuning for controlled and semantically accurate paraphrase generation. This study contributes practical insights into the selection and adaptation of transformer architectures for paraphrasing applications, particularly in academic and research-oriented writing contexts.

Comparative Analysis of Clustering Textual and Numerical Data Using the K-Means Algorithm

Sanja Raičević — 2026-05-06

This paper presents a comparative analysis of the application of the K-Means clustering algorithm on two different types of data – textual and numerical. The aim of the research was to examine the reliability, stability, and interpretability of the results when the same algorithm is applied to semantically diverse datasets. The textual data were taken from the articles of the Criminal Code of the Republic of Serbia, where clustering was performed after preprocessing and TF-IDF vectorization. The numerical data refer to traffic accident statistics from 2015 to 2021, analyzing parameters such as the number of property-damage-only accidents, the number of injured persons, and the number of fatalities.

The results showed that clustering on textual data produced a relatively clear separation of thematic groups of articles, but with a moderate silhouette coefficient value due to a high degree of semantic similarity among documents. On the other hand, clustering on numerical data demonstrated a more stable structure, where the optimal number of clusters was two, indicating the possibility of distinguishing periods with different intensity and severity of traffic accidents.

It was concluded that the K-Means algorithm provides more reliable and interpretable results for numerical data, while in the case of textual data, it requires more precise vector space modeling and possibly the application of semantic models such as Word2Vec or BERT. The paper serves as a basis for further research in the field of integrating machine learning techniques for analyzing heterogeneous data sources.