Application of Convolutional Neural Networks to Spoken Words Evaluation Based on Lip Movements without Accompanying Sound Signal
Abstract
This paper proposes an approach to evaluate spoken words based on lip movements without accompanying sound signals using convolutional neural networks. The main goal of this research is to prove the efficiency of neural networks in the field, where all data is received from an array of images. The modeling and the hypotheses are validated based on the results obtained for a specific case study. Our study reports on speech recognition from only a sequence of images provided, where all crucial data and features are extracted, processed, and used in a model to create artificial consciousness.
References
[1] S. Pažin, L. Isaković, S. Slavnić, and M. Srzić, “Specifičnost čitanja govora sa usana kod gluvih i nagluvih učenika različitog uzrasta,” In Specificity of hearing impairment – new trends, 2020, pp. 219–233.
[2] A. Živanović, I. Sokolovac, M. Marković, S. Suzić, and V. Delić, “Prepoznavanje reči u govornoj audiometriji,” In Specificity of hearing impairment – new trends, 2020, pp. 97–111.
[3] Y. Xiao, L. Teng, A. Zhu, X. Liu, and P. Tian, “Lip Reading in Cantonese,” IEEE Access, Vol. 10, 2022, pp. 95020–95029. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9878070
[4] T. Thein and K. M. San, “Lip localization technique towards an automatic lip reading approach for Myanmar consonants recognition,” In 2018 International Conference on Information and Computer Technologies (ICICT), 2018, pp. 123–127.
[5] X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information maximization for effective lip reading,” In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, pp. 420–427.
[6] Y. WenJuan, L. YaLing, and D. MingHui, “A real-time lip localization and tacking for lip reading,” In 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010, Vol. 6, pp. V6–363.
[7] M. H. Rahmani and F. Almasganj, “Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features,” In 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), 2017, pp. 195–199.
[8] T. Saitoh and R. Konishi, “Profile lip reading for vowel and word recognition,” In 2010 20th International conference on pattern recognition, 2010, pp. 1356–1359.
[9] N. Rathee, “Investigating back propagation neural network for lip reading,” In 2016 International Conference on Computing, Communication and Automation (ICCCA), 2016, pp. 373–376.
[10] Y. Matsunaga and K. Matsui, “Mobile Device-based Speech Enhancement System Using Lip-reading,” In 2018 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2018, pp. 1–4.
[11] J. Moolayil, Learn Keras for Deep Neural Networks: A Dust-Track Approach to Modern Deep Learning with Python, 1st edition. Apress, 2018.
[12] M. Sewak, R. Karim, and P. Pujari, Practical Convolutional Neural Networks: Implement advanced deep learning models using Python. Packt Publishing, 2018.
[13] J. S. Chung and A. Zisserman, “Lip Reading in the Wild,” In Asian Conference on Computer Vision, 2016. Available: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16/chung16.pdf
[14] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip Reading Sentences in the Wild,” In IEEE Conference on Computer Vision and Pattern Recognition, 2017. Available: https://openaccess.thecvf.com/content_cvpr_2017/papers/Chung_Lip_Reading_Sentences_CVPR_2017_paper.pdf
[15] IPSJ SIG-SLP Noisy Speech Recognition Evaluation WG (2011): Audio-Visual Speech Recognition Evaluation Environment (CENSREC-1-AV). Speech Resources Consortium, National Institute of Informatics. (dataset). https://doi.org/10.32130/src.CENSREC-1-AV
[16] S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13325–13333.
[17] F. Eguchi, K. Matsui, Y. Nakatoh, Y. O. Kato, A. Rivas, and J. M. Corchado, “Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading,” In International Symposium on Distributed Computing and Artificial Intelligence, 2021, pp. 210–220.
[18] P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-reading with densely connected temporal convolutional networks,” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2857–2866.
[19] K. R. Prajwal, T. Afouras, and A. Zisserman, “Sub-word level lip reading with visual attention,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5162–5172.
I (we), the author(s), hereby declare under full moral, financial and criminal liability that the manuscript submitted for publication to the Journal of Computer and Forensic Sciences
a) is the result of my (our) own original research and that I (we) hold the right to publish it;
b) does not infringe any copyright or other third-party proprietary rights;
c) complies with the Journal’s research and publishing ethics standards;
d) has not been published elsewhere, under this or any other title;
e) is not under consideration by another publication, under this or any other title.
I (we) also declare under full moral, financial and criminal liability:
f) that all conflicts of interest that may directly or potentially influence or impart bias on the work have been disclosed in the manuscript;
g) that if the article has been accepted for publishing I (we) will transfer all copyright ownership of the manuscript to the University of Criminal Investigation and Police Studies in Belgrade.
Signed by the Corresponding Author on behalf of the all other authors.