Comparative Analysis of Clustering Textual and Numerical Data Using the K-Means Algorithm

Sanja Raičević

doi:10.5937/jcfs4-64411

Sanja Raičević Ministry of Interior

DOI: https://doi.org/10.5937/jcfs4-64411

Keywords: clustering, K-Means, textual data, numerical data, TF-IDF, PCA, silhouette analysis

Abstract

This paper presents a comparative analysis of the application of the K-Means clustering algorithm on two different types of data – textual and numerical. The aim of the research was to examine the reliability, stability, and interpretability of the results when the same algorithm is applied to semantically diverse datasets. The textual data were taken from the articles of the Criminal Code of the Republic of Serbia, where clustering was performed after preprocessing and TF-IDF vectorization. The numerical data refer to traffic accident statistics from 2015 to 2021, analyzing parameters such as the number of property-damage-only accidents, the number of injured persons, and the number of fatalities.

The results showed that clustering on textual data produced a relatively clear separation of thematic groups of articles, but with a moderate silhouette coefficient value due to a high degree of semantic similarity among documents. On the other hand, clustering on numerical data demonstrated a more stable structure, where the optimal number of clusters was two, indicating the possibility of distinguishing periods with different intensity and severity of traffic accidents.

It was concluded that the K-Means algorithm provides more reliable and interpretable results for numerical data, while in the case of textual data, it requires more precise vector space modeling and possibly the application of semantic models such as Word2Vec or BERT. The paper serves as a basis for further research in the field of integrating machine learning techniques for analyzing heterogeneous data sources.

References

[1] J. MacQueen, „Some Methods for Classification and Analysis of Multivariate Observations,“ Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, т. 1, p. 281–297, 1967.
[2] X. &. Z. H. Cheng, „Clustering Analysis for High-Dimensional Data Using K-Means and PCA,“ Data Science and Engineering, т. 5, p. 107–118, 2020.
[3] L. K. V. Havrlant, „A Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (TF-IDF) Heuristic,“ International Journal of General Systems, т. 46, pp. 27-36, 2017.
[4] X. &. X. S. Wu, „Machine Learning for Text Classification: A Survey.,“ International Journal of Computational Intelligence, т. 4, p. 143–154, 2008.
[5] P. J. Rousseeuw, „Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis,“ Journal of Computational and Graphical Statistics, т. 1, p. 53–65, 1987.
[6] F. L. Gewers, G. R. Ferreira, H. F. de Arruda, F. N. Silva, C. H. Comin, D. R. Amancio и L. da Costa, „Principal Component Analysis: A Natural Approach to Data Exploration,“ 2018. [На мрежи]. Available: https://arxiv.org/abs/1804.02502. [Последњи приступ 02 October 2025].
[7] G. H. Dunteman, Principal Components Analysis, SAGE Publications, 1989.
[8] I. T. Jolliffe, Principal Component Analysis, Springer, 2002.
[9] C. D. Manning, P. Raghavan и H. Schütze, Introduction to Information Retrieval, Cambridge, United Kingdom: Cambridge University Press, 2008.
[10] D. &. T. Y. Xu, „A comprehensive survey of clustering algorithms.,“ Annals of Data Science, pp. 165-193, 2015.
[11] T. C. K. C. G. &. D. J. Mikolov, „Efficient Estimation of Word Representations in Vector Space,“ Cornell University (arXiv), 2013.
[12] J. S. R. M. C. D. Pennington, „GloVe: Global Vectors for Word Representation,“ у Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.