Classification and analysis of the MNIST dataset using PCA and SVM algorithms

  • Mokhaled N.A. Al-Hamadani University of Debrecen, Doctoral School of Informatics, Department of Data Science and Visualization, Debrecen, Hungary; Northern Technical University, Technical Institute/Alhawija, Department of Electronic Techniques, Adan, Kirkuk, Republic of Iraq https://orcid.org/0000-0002-7042-3178
Keywords: statistical analysis, machine learning, SVM, PCA, classification

Abstract


Introduction/purpose: The utilization of machine learning methods has become indispensable in analyzing large-scale, complex data in contemporary data-driven environments, with a diverse range of applications from optimizing business operations to advancing scientific research. Despite the potential for insight and innovation presented by these voluminous datasets, they pose significant challenges in areas such as data quality and structure, necessitating the implementation of effective management strategies. Machine learning techniques have emerged as essential tools in identifying and mitigating these challenges and developing viable solutions to address them. The MNIST dataset represents a prominent example of a widely-used dataset in this field, renowned for its expansive collection of handwritten numerical digits, and frequently employed in tasks such as classification and analysis, as demonstrated in the present study.

Methods: This study employed the MNIST dataset to investigate various statistical techniques, including the Principal Components Analysis (PCA) algorithm implemented using the Python programming language. Additionally, Support Vector Machine (SVM) models were applied to both linear and non-linear classification problems to assess the accuracy of the model.

Results: The results of the present study indicate that while the PCA technique is effective for dimensionality reduction, it may not be as effective for visualization purposes. Moreover, the findings demonstrate that both linear and non-linear SVM models were capable of effectively classifying the dataset.

Conclusion: The findings of the study demonstrate that SVM can serve as an efficacious technique for addressing classification problems.

References

Abdi, H. & Williams, L.J. 2010. Principal component analysis. WIREs (Wiley Interdisciplinary Reviews), 2(4), pp.433-459. Available at: https://doi.org/10.1002/wics.101.

Ahmed, A.H., Al-Hamadani, M.N.A. & Abdulrahman Satam, I. 2022. Prediction of COVID-19 disease severity using machine learning techniques. Bulletin of Electrical Engineering and Informatics, 11(2), pp.1069-1074. Available at: https://doi.org/10.11591/eei.v11i2.3272.

Al-Hamadani, M.N.A. 2015. Evaluation of the Performance of Deep Learning Techniques Over Tampered Dataset. Master thesis. Greensboro, North Carolina, USA: The University of North Carolina,  Faculty of The Graduate School [online]. Available at: https://www.proquest.com/openview/769d2aa550c12fcf40655405e8df7689/1?pq-origsite=gscholar&cbl=18750 [Accessed: 05 February 2023].

Guenther, N. & Schonlau, M. 2016. Support Vector Machines. The Stata Journal, 16(4), pp.917-937. Available at: https://doi.org/10.1177/1536867X1601600407.

Hao, J. & Ho, T.K. 2019. Machine Learning Made Easy: A Review of Scikit-learn Package in Python Programming Language. Journal of Educational and Behavioral Statistics, 44(3), pp.348-361. Available at: https://doi.org/10.3102/1076998619832248.

LeCun, Y. 2023. MNIST dataset [online]. Available: https://yann.lecun.com/exdb/mnist/.

LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., Simard, P. & Vapnik, V. 1995. Comparison of learning algorithms for handwritten digit recognition. In: Fogelman, F. & Gallinari, P. (Eds.) International Conference on Artificial Neural Networks (ICANN'95), Paris, pp. 53-60, October 9-13.

Manshor, N., Halin, A.A., Rajeswari, M. & Ramachandram, D. 2011. Feature selection via dimensionality reduction for object class recognition. In: 2011 2nd International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering, Bandung, Indonesia, pp.223-227, November 08-09. Available at: https://doi.org/10.1109/ICICI-BME.2011.6108645.

Mishra, S.P., Sarkar, U., Taraphder, S., Datta, S., Swain, D.P., Saikhom, R., Panda, S. & Laishram, M. 2017. Multivariate Statistical Data Analysis-Principal Component Analysis (PCA). International Journal of Livestock Research, 7(5), pp.60-78.

Nielsen, M. 2019. Neural Networks and Deep Learning [online]. Available at: http://neuralnetworksanddeeplearning.com/ [Accessed: 05 February 2023].

Raschka, S., Patterson, J. & Nolet, C. 2020. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Information, 11(4), art.number:193. Availiable at: https://doi.org/10.3390/info11040193.

Saputra, D., Dharmawan, W.S. & Irmayani, W. 2022. Performance Comparison of the SVM and SVM-PSO Algorithms for Heart Disease Prediction. International Journal of Advances in Data and Information Systems, 3(2), pp.74-86. Available at: https://doi.org/10.25008/ijadis.v3i2.1243.

-Scikit-learn. 2023. sklearn.svm.SVC [online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html [Accessed: 05 February 2023].

Subasi, A. 2020. Practical Machine Learning for Data Analysis Using Python. London, United Kingdom: Elsevier, Academic Press. ISBN: 978-0-12-821379-7.

Suthaharan, S. 2014. Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Performance Evaluation Review, 41(4), pp.70-73. Available at: https://doi.org/10.1145/2627534.2627557.

Suthaharan, S. 2016. Support Vector Machine. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, 36. Boston, MA: Springer. Available at: https://doi.org/10.1007/978-1-4899-7641-3_9.

Wang, P., Li, Y. & Reddy, C.K. 2019. Machine Learning for Survival Analysis: A Survey. ACM Computing Surveys, 51(6), art.number:110, pp.1-36. Available at: https://doi.org/10.1145/3214306.

Published
2023/03/27
Section
Original Scientific Papers