OPTIMIZATION OF TOKENIZATION AND MEMORY MANAGEMENT FOR PROCESSING LARGE TEXTUAL CORPORA IN MULTILINGUAL APPLICATIONS

Dejan Dodić; Dušan Regodić; Nikola Milutinović

doi:10.5937/bnsr15-54993

Dejan Dodić The Academy of Applied Technical and Preschool Studies, Department of Information - communication technologies, Niš, Department of Vranje, Serbia
Dušan Regodić MB University, Faculty of Business and Law, Department of Advanced information technologies, Belgrade, Serbia
Nikola Milutinović The Academy of Applied Technical and Preschool Studies, Department of Information - communication technologies, Niš, Department of Vranje, Serbia

DOI: https://doi.org/10.5937/bnsr15-54993

Keywords: Tokenization optimization, Memory management, Large datasets, GPT-2, Serbian language, Transfer learning, Natural language processing (NLP)

Abstract

Optimization of tokenization and memory management in processing large datasets represents a key challenge in the contemporary development of language models. This paper focuses on enhancing the processing of large textual corpora in Serbian using the GPT-2 model, specifically adapted for transfer learning. Tokenization optimization was achieved by adding language-specific tokens for Serbian, while memory management was improved through advanced resource management methods during training. Key findings demonstrate significant memory consumption reduction and training process acceleration, enabling more efficient utilization of available computational resources. This research contributes to the development of language models tailored for the Serbian language and provides a foundation for further studies in the field of natural language processing (NLP). The implications of this work are multifaceted: it facilitates more efficient creation of NLP applications for Serbian-speaking regions, enhances the accuracy and performance of language models, and opens opportunities for applications across various domains, from automated translation to sentiment analysis. This study paves the way for future research focusing on additional optimization of language models, including adaptation for other languages with similar characteristics, as well as exploring new methods for even more efficient memory management during large-scale textual data processing.

References

Atteia, G., Abdel Samee, N., El-Kenawy, E. S. M. & Ibrahim, A. 2022. CNN-Hyperparameter Optimization for Diabetic Maculopathy Diagnosis in Optical Coherence Tomography and Fundus Retinography. Mathematics, 10(18). https://doi.org/10.3390/math10183274

Bergstra, J. & Bengio, Y. 2012. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10), pp. 281-305. Available at: http://jmlr.org/papers/v13/bergstra12a.html

Dempsey, R., Klebanov, I. R., Pufu, S. S., Søgaard, B. T. & Zan, B. 2023. Phase Diagram of the Two-Flavor Schwinger Model at Zero Temperature. Joseph Henry Laboratories, Princeton University, Princeton Center for Theoretical Science, Princeton University, Princeton, NJ 08544, USA. https://doi.org/10.1103/PhysRevLett.132.031603

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Association for Computational Linguistics, pp. 4171-4186. https://doi.org/10.18653/v1/N19-1423

Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J. & Sun, M. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3), pp. 220-235. https://doi.org/10.1038/s42256-023-00626-4

Dodić, D. & Regodić, D. 2024a. Analysis of the Efficiency of GPT-2 Model Application with Adapted Transfer Learning on Various Hardware Architectures. 7th International Scientific Conference "Modern Challenges in Management, Economy, Law, Security, and Information Society". https://doi.org/10.61837/mbuir020124174d

Dodić, D. & Regodić, D. 2024b. Tokenization and Memory Optimization for Reducing GPU Load in NLP Deep Learning Models. Tehnički vjesnik, 31(6), pp.1995-2002. https://doi.org/10.17559/TV-20231218001216

Feng, Y., Hu, C., Kamigaito, H., Takamura, H. & Okumura, M. 2021. Improving Character-Aware Neural Language Model by Warming up Character Encoder under Skip-gram Architecture. In: International Conference on Recent Advances in Natural Language Processing (RANLP 2021) - INCOMA Ltd., pp. 421-427. Available at: https://aclanthology.org/2021.ranlp-1.48

Giovanelli, J., Tornede, A., Tornede, T. & Lindauer, M. 2024. Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning. arXiv preprint. https://doi.org/10.48550/arXiv.2309.03581

Gopalun, K. & John Samuvel, D. 2023. Deep Learning Technique for Power Domain Non-Orthogonal Multiple Access Using Optimised LSTM in Cooperative Networks. Tehnički vjesnik - Technical Gazette, 30(5), pp.1397-1403. https://doi.org/10.17559/TV-20221228104420

Ilievski, I., Akhtar, T., Feng, J. & Shoemaker, C. 2017. Efficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates. In: Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.10647

Jin, C., Shi, Z., Li, W. & Guo, Y. 2021. Bidirectional LSTM-CRF Attention-based Model for Chinese Word Segmentation. arXiv Labs, arXiv:2105.09681. https://doi.org/10.48550/arXiv.2105.09681

Kelvinius, F. E., Georgiev, D., Toshev, A. P. & Gasteiger, J. 2023. Accelerating Molecular Graph Neural Networks via Knowledge Distillation. arXiv Labs, arXiv:2306.14818. Available at: https://ar5iv.labs.arxiv.org/html/2306.14818

Li, Y. & Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, pp. 295-316. https://doi.org/10.1016/j.neucom.2020.07.061

Ren, J., Rajbhandari, S., Yazdani Aminabadi, R., Ruwase, O., Yang, S., Zhang, M., Li, D. & He, Y. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv. https://doi.org/10.48550/arXiv.2101.06840

Watanabe, S. & Hutter, F. 2023. c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization (Version 4). arXiv preprint. https://doi.org/10.48550/arXiv.2211.14411

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J., 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv Labs, arXiv:1910.03771. https://doi.org/10.48550/arXiv.1910.03771