OPTIMIZATION OF TOKENIZATION AND MEMORY MANAGEMENT FOR PROCESSING LARGE TEXTUAL CORPORA IN MULTILINGUAL APPLICATIONS
Abstract
Optimization of tokenization and memory management in processing large datasets represents a key challenge in the contemporary development of language models. This paper focuses on enhancing the processing of large textual corpora in Serbian using the GPT-2 model, specifically adapted for transfer learning. Tokenization optimization was achieved by adding language-specific tokens for Serbian, while memory management was improved through advanced resource management methods during training. Key findings demonstrate significant memory consumption reduction and training process acceleration, enabling more efficient utilization of available computational resources. This research contributes to the development of language models tailored for the Serbian language and provides a foundation for further studies in the field of natural language processing (NLP). The implications of this work are multifaceted: it facilitates more efficient creation of NLP applications for Serbian-speaking regions, enhances the accuracy and performance of language models, and opens opportunities for applications across various domains, from automated translation to sentiment analysis. This study paves the way for future research focusing on additional optimization of language models, including adaptation for other languages with similar characteristics, as well as exploring new methods for even more efficient memory management during large-scale textual data processing.
References
Atteia, G., Abdel Samee, N., El-Kenawy, E. S. M. & Ibrahim, A. 2022. CNN-Hyperparameter Optimization for Diabetic Maculopathy Diagnosis in Optical Coherence Tomography and Fundus Retinography. Mathematics, 10(18). https://doi.org/10.3390/math10183274
Bergstra, J. & Bengio, Y. 2012. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10), pp. 281-305. Available at: http://jmlr.org/papers/v13/bergstra12a.html
Dempsey, R., Klebanov, I. R., Pufu, S. S., Søgaard, B. T. & Zan, B. 2023. Phase Diagram of the Two-Flavor Schwinger Model at Zero Temperature. Joseph Henry Laboratories, Princeton University, Princeton Center for Theoretical Science, Princeton University, Princeton, NJ 08544, USA. https://doi.org/10.1103/PhysRevLett.132.031603
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Association for Computational Linguistics, pp. 4171-4186. https://doi.org/10.18653/v1/N19-1423
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J. & Sun, M. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3), pp. 220-235. https://doi.org/10.1038/s42256-023-00626-4
Dodić, D. & Regodić, D. 2024a. Analysis of the Efficiency of GPT-2 Model Application with Adapted Transfer Learning on Various Hardware Architectures. 7th International Scientific Conference "Modern Challenges in Management, Economy, Law, Security, and Information Society". https://doi.org/10.61837/mbuir020124174d
Dodić, D. & Regodić, D. 2024b. Tokenization and Memory Optimization for Reducing GPU Load in NLP Deep Learning Models. Tehnički vjesnik, 31(6), pp.1995-2002. https://doi.org/10.17559/TV-20231218001216
Feng, Y., Hu, C., Kamigaito, H., Takamura, H. & Okumura, M. 2021. Improving Character-Aware Neural Language Model by Warming up Character Encoder under Skip-gram Architecture. In: International Conference on Recent Advances in Natural Language Processing (RANLP 2021) - INCOMA Ltd., pp. 421-427. Available at: https://aclanthology.org/2021.ranlp-1.48
Giovanelli, J., Tornede, A., Tornede, T. & Lindauer, M. 2024. Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning. arXiv preprint. https://doi.org/10.48550/arXiv.2309.03581
Gopalun, K. & John Samuvel, D. 2023. Deep Learning Technique for Power Domain Non-Orthogonal Multiple Access Using Optimised LSTM in Cooperative Networks. Tehnički vjesnik - Technical Gazette, 30(5), pp.1397-1403. https://doi.org/10.17559/TV-20221228104420
Ilievski, I., Akhtar, T., Feng, J. & Shoemaker, C. 2017. Efficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates. In: Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.10647
Jin, C., Shi, Z., Li, W. & Guo, Y. 2021. Bidirectional LSTM-CRF Attention-based Model for Chinese Word Segmentation. arXiv Labs, arXiv:2105.09681. https://doi.org/10.48550/arXiv.2105.09681
Kelvinius, F. E., Georgiev, D., Toshev, A. P. & Gasteiger, J. 2023. Accelerating Molecular Graph Neural Networks via Knowledge Distillation. arXiv Labs, arXiv:2306.14818. Available at: https://ar5iv.labs.arxiv.org/html/2306.14818
Li, Y. & Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, pp. 295-316. https://doi.org/10.1016/j.neucom.2020.07.061
Ren, J., Rajbhandari, S., Yazdani Aminabadi, R., Ruwase, O., Yang, S., Zhang, M., Li, D. & He, Y. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv. https://doi.org/10.48550/arXiv.2101.06840
Watanabe, S. & Hutter, F. 2023. c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization (Version 4). arXiv preprint. https://doi.org/10.48550/arXiv.2211.14411
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J., 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv Labs, arXiv:1910.03771. https://doi.org/10.48550/arXiv.1910.03771
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.