Application of Convolutional Neural Networks to Spoken Words Evaluation Based on Lip Movements without Accompanying Sound Signal

Dušan Perić; Mitko Bogdanoski; Nemanja Maček

doi:10.5937/1-42696

Dušan Perić DP SOFTWARE, Belgrade
Mitko Bogdanoski Military Academy General Mihailo Apostolski, Skopje
Nemanja Maček Academy of Technical and Art Applied Studies, School of Electrical and Computer Engineering, Belgrade, Serbia & University Business Academy in Novi Sad, Serbia & SECIT Security Consulting, Serbia

DOI: https://doi.org/10.5937/1-42696

Keywords: machine learning, convolutional neural networks, lip reading

Abstract

This paper proposes an approach to evaluate spoken words based on lip movements without accompanying sound signals using convolutional neural networks. The main goal of this research is to prove the efficiency of neural networks in the field, where all data is received from an array of images. The modeling and the hypotheses are validated based on the results obtained for a specific case study. Our study reports on speech recognition from only a sequence of images provided, where all crucial data and features are extracted, processed, and used in a model to create artificial consciousness.

Author Biographies

Dušan Perić, DP SOFTWARE, Belgrade

Mitko Bogdanoski, Military Academy General Mihailo Apostolski, Skopje

Nemanja Maček, Academy of Technical and Art Applied Studies, School of Electrical and Computer Engineering, Belgrade, Serbia & University Business Academy in Novi Sad, Serbia & SECIT Security Consulting, Serbia

References

[1] S. Pažin, L. Isaković, S. Slavnić, and M. Srzić, “Specifičnost čitanja govora sa usana kod gluvih i nagluvih učenika različitog uzrasta,” In Specificity of hearing impairment – new trends, 2020, pp. 219–233.

[2] A. Živanović, I. Sokolovac, M. Marković, S. Suzić, and V. Delić, “Prepoznavanje reči u govornoj audiometriji,” In Specificity of hearing impairment – new trends, 2020, pp. 97–111.

[3] Y. Xiao, L. Teng, A. Zhu, X. Liu, and P. Tian, “Lip Reading in Cantonese,” IEEE Access, Vol. 10, 2022, pp. 95020–95029. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9878070

[4] T. Thein and K. M. San, “Lip localization technique towards an automatic lip reading approach for Myanmar consonants recognition,” In 2018 International Conference on Information and Computer Technologies (ICICT), 2018, pp. 123–127.

[5] X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information maximization for effective lip reading,” In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, pp. 420–427.

[6] Y. WenJuan, L. YaLing, and D. MingHui, “A real-time lip localization and tacking for lip reading,” In 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010, Vol. 6, pp. V6–363.

[7] M. H. Rahmani and F. Almasganj, “Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features,” In 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), 2017, pp. 195–199.

[8] T. Saitoh and R. Konishi, “Profile lip reading for vowel and word recognition,” In 2010 20th International conference on pattern recognition, 2010, pp. 1356–1359.

[9] N. Rathee, “Investigating back propagation neural network for lip reading,” In 2016 International Conference on Computing, Communication and Automation (ICCCA), 2016, pp. 373–376.

[10] Y. Matsunaga and K. Matsui, “Mobile Device-based Speech Enhancement System Using Lip-reading,” In 2018 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2018, pp. 1–4.

[11] J. Moolayil, Learn Keras for Deep Neural Networks: A Dust-Track Approach to Modern Deep Learning with Python, 1st edition. Apress, 2018.

[12] M. Sewak, R. Karim, and P. Pujari, Practical Convolutional Neural Networks: Implement advanced deep learning models using Python. Packt Publishing, 2018.

[13] J. S. Chung and A. Zisserman, “Lip Reading in the Wild,” In Asian Conference on Computer Vision, 2016. Available: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16/chung16.pdf

[14] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip Reading Sentences in the Wild,” In IEEE Conference on Computer Vision and Pattern Recognition, 2017. Available: https://openaccess.thecvf.com/content_cvpr_2017/papers/Chung_Lip_Reading_Sentences_CVPR_2017_paper.pdf

[15] IPSJ SIG-SLP Noisy Speech Recognition Evaluation WG (2011): Audio-Visual Speech Recognition Evaluation Environment (CENSREC-1-AV). Speech Resources Consortium, National Institute of Informatics. (dataset). https://doi.org/10.32130/src.CENSREC-1-AV

[16] S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13325–13333.

[17] F. Eguchi, K. Matsui, Y. Nakatoh, Y. O. Kato, A. Rivas, and J. M. Corchado, “Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading,” In International Symposium on Distributed Computing and Artificial Intelligence, 2021, pp. 210–220.

[18] P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-reading with densely connected temporal convolutional networks,” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2857–2866.

[19] K. R. Prajwal, T. Afouras, and A. Zisserman, “Sub-word level lip reading with visual attention,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5162–5172.