The clustering approach to malware dataset analysis
Abstract
The research in the area of malware analysis is very popular with accent on machine learning algorithms that help automate this subject. One of the leader portals that help researchers with dataset problems is Virus Total, providing free academic accounts with hundreds of thousands malware samples with metadata. This work contributes with the analysis of 429058 malware samples from Virus Total in terms of overcoming the problem of inconsistent labeling of the antivirus scan results from different vendors. Two methods were used, LSA and LDA both with automatic calibration of parameters, with purpose of finding the optimal number of clusters - both resulting with 5. The graphical representation of the clusters was done by k-menas clustering in two dimensional space. Additional research on most informative words in each cluster, showed that we can report of 4 similar classes, and one cluster per method (LSA and LDA) that was not related to the cluster in opposite method by wrd meaning in it. The showed results give good approach malware data analysis when dealing with inconsistently labeled dataset.
References
[2] Anthon P., „Cybercrime annual revenue is 3 times bigger than Walmart's“, Atlas VPN on 11 March 2020. [online], available at: https://atlasvpn.com/blog/cybercrime-annual-revenue-is-3-times-bigger-than-walmarts. [Accessed: 01 March 2022].
[3] McGuire M., Into the web of profit, An in-depth study of cybercrime, criminals and money, Book, Project funded by Bromium, Inc., available at: https://www.bromium.com/wp-content/uploads/2018/05/Into-the-Web-of-Profit_Bromium.pdf . [Accessed: 01 March 2022].
[4] Sood G., Virus Total web portal [online]. Available at: https://www.Virus Total.com. [Accessed: 1 March 2022].
[5] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
[6] Ortega, Joaquín & Almanza-Ortega, Nelva & Vega-Villalobos, Andrea & Pazos-Rangel, Rodolfo & Zavala-Diaz, José Crispin & Martínez-Rebollar, Alicia. (2019). The K-Means Algorithm Evolution. 10.5772/intechopen.85447.
[7] Giuseppe Bonaccorso (2018) Machine Learning Algorithms : Popular Algorithms for Data Science and Ma-chine Learning, 2nd Edition. Birmingham: Packt Publishing. Available at: https://ezproxy.nb.rs:2076/login.aspx?direct=true&db=e000xww&AN=1881497&site=eds-live (Accessed: 9 March 2022).
[8] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis.
Discourse Processes, 25, 259-284.
[9] Slavisa Ilić, Milan Gnjatović, Brankica Popović, Nemanja Maček (2022). А PILOT COMPARATIVE A-NALYSIS OF THE CUCKOO AND DRAKVUF SANDBOXES: AN END-USER PERSPECTIVE, Military tec-hnical courier, https://doi.org/10.5937/vojtehg70-39196
[10] Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu
[11] Ilić, Slaviša, Milan Gnjatović, Ivan Tot, Boriša Jovanović, Nemanja Maček, and Marijana Gavrilović Božo-vić. 2024. "Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset" Electronics 13, no. 17: 3553. https://doi.org/10.3390/electronics13173553
I (we), the author(s), hereby declare under full moral, financial and criminal liability that the manuscript submitted for publication to the Journal of Computer and Forensic Sciences
a) is the result of my (our) own original research and that I (we) hold the right to publish it;
b) does not infringe any copyright or other third-party proprietary rights;
c) complies with the Journal’s research and publishing ethics standards;
d) has not been published elsewhere, under this or any other title;
e) is not under consideration by another publication, under this or any other title.
I (we) also declare under full moral, financial and criminal liability:
f) that all conflicts of interest that may directly or potentially influence or impart bias on the work have been disclosed in the manuscript;
g) that if the article has been accepted for publishing I (we) will transfer all copyright ownership of the manuscript to the University of Criminal Investigation and Police Studies in Belgrade.
Signed by the Corresponding Author on behalf of the all other authors.
