The clustering approach to malware dataset analysis

  • Slaviša Ilić Ministry of defence
  • Prof. Ph.D. Department of Information Technology, University of Criminal Investigation and Police Studies, Cara Dušana 196, 11080 Beograd, Serbia;
  • Prof. Ph.D. Department of Information Technology, University of Criminal Investigation and Police Studies, Cara Dušana 196, 11080 Beograd, Serbia;
  • Mr. Ministry of defence Republic of Serbia, Bircaninova 5, 11000 Belgrade, Serbia;
Keywords: malware, analysis, dataset, virus total, clustering, LSA, LDA

Abstract


The research in the area of malware analysis is very popular with accent on machine learning algorithms that help automate this subject. One of the leader portals that help researchers with dataset problems is Virus Total, providing free academic accounts with hundreds of thousands malware samples with metadata. This work contributes with the analysis of 429058 malware samples from Virus Total in terms of overcoming the problem of inconsistent labeling of the antivirus scan results from different vendors. Two methods were used, LSA and LDA both with automatic calibration of parameters, with purpose of finding the optimal number of clusters - both resulting with 5. The graphical representation of the clusters was done by k-menas clustering in two dimensional space. Additional research on most informative words in each cluster, showed that we can report of 4 similar classes, and one cluster per method (LSA and LDA) that was not related to the cluster in opposite method by wrd meaning in it. The showed results give good approach malware data analysis when dealing with inconsistently labeled dataset.

Author Biographies

Prof. Ph.D., Department of Information Technology, University of Criminal Investigation and Police Studies, Cara Dušana 196, 11080 Beograd, Serbia;

 

 

Department of Information Technology, Full Professor

 

Prof. Ph.D., Department of Information Technology, University of Criminal Investigation and Police Studies, Cara Dušana 196, 11080 Beograd, Serbia;

Department of Information Technology, Full Professor

References

[1] Greig J., „Cybercriminals raking in $1.5 trilion every year“, TechRepublic on 12 March 2020. [online], availa-ble at: https://www.techrepublic.com/article/cybercriminals-raking-in-1-5-trillion-every-year/. [Accessed: 01 March 2022].
[2] Anthon P., „Cybercrime annual revenue is 3 times bigger than Walmart's“, Atlas VPN on 11 March 2020. [online], available at: https://atlasvpn.com/blog/cybercrime-annual-revenue-is-3-times-bigger-than-walmarts. [Accessed: 01 March 2022].
[3] McGuire M., Into the web of profit, An in-depth study of cybercrime, criminals and money, Book, Project funded by Bromium, Inc., available at: https://www.bromium.com/wp-content/uploads/2018/05/Into-the-Web-of-Profit_Bromium.pdf . [Accessed: 01 March 2022].
[4] Sood G., Virus Total web portal [online]. Available at: https://www.Virus Total.com. [Accessed: 1 March 2022].
[5] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
[6] Ortega, Joaquín & Almanza-Ortega, Nelva & Vega-Villalobos, Andrea & Pazos-Rangel, Rodolfo & Zavala-Diaz, José Crispin & Martínez-Rebollar, Alicia. (2019). The K-Means Algorithm Evolution. 10.5772/intechopen.85447.
[7] Giuseppe Bonaccorso (2018) Machine Learning Algorithms : Popular Algorithms for Data Science and Ma-chine Learning, 2nd Edition. Birmingham: Packt Publishing. Available at: https://ezproxy.nb.rs:2076/login.aspx?direct=true&db=e000xww&AN=1881497&site=eds-live (Accessed: 9 March 2022).
[8] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis.
Discourse Processes, 25, 259-284.
[9] Slavisa Ilić, Milan Gnjatović, Brankica Popović, Nemanja Maček (2022). А PILOT COMPARATIVE A-NALYSIS OF THE CUCKOO AND DRAKVUF SANDBOXES: AN END-USER PERSPECTIVE, Military tec-hnical courier, https://doi.org/10.5937/vojtehg70-39196
[10] Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu
[11] Ilić, Slaviša, Milan Gnjatović, Ivan Tot, Boriša Jovanović, Nemanja Maček, and Marijana Gavrilović Božo-vić. 2024. "Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset" Electronics 13, no. 17: 3553. https://doi.org/10.3390/electronics13173553
Published
2024/12/27
Section
Članci