Help | Advanced Search
Electrical Engineering and Systems Science > Audio and Speech Processing
Title: recent advances in end-to-end automatic speech recognition.
Abstract: Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.
- Download PDF
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
Subscribe to the PwC Newsletter
Join the community, add a new evaluation result row, speech recognition.
1020 papers with code • 312 benchmarks • 84 datasets
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Benchmarks Add a Result
Most implemented papers
Listen, attend and spell.
Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
Communication-Efficient Learning of Deep Networks from Decentralized Data
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device.
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.
Deep Speech: Scaling up end-to-end speech recognition
We present a state-of-the-art speech recognition system developed using end-to-end deep learning.
Conformer: Convolution-augmented Transformer for Speech Recognition
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Recurrent Neural Network Regularization
wojzaremba/lstm • 8 Sep 2014
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.
Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges
Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep neural networks (DNNs) to execute complex inference tasks such as image classification and speech recognition, among others.
voice recognition Recently Published Documents
- Latest Documents
- Most Cited Documents
- Contributed Authors
- Related Sources
- Related Keywords
Classification of Arabic fricative consonants according to their places of articulation
<span>Many technology systems have used voice recognition applications to transcribe a speaker’s speech into text that can be used by these systems. One of the most complex tasks in speech identification is to know, which acoustic cues will be used to classify sounds. This study presents an approach for characterizing Arabic fricative consonants in two groups (sibilant and non-sibilant). From an acoustic point of view, our approach is based on the analysis of the energy distribution, in frequency bands, in a syllable of the consonant-vowel type. From a practical point of view, our technique has been implemented, in the MATLAB software, and tested on a corpus built in our laboratory. The results obtained show that the percentage energy distribution in a speech signal is a very powerful parameter in the classification of Arabic fricatives. We obtained an accuracy of 92% for non-sibilant consonants /f, χ, ɣ, ʕ, ћ, and h/, 84% for sibilants /s, sҁ, z, Ӡ and ∫/, and 89% for the whole classification rate. In comparison to other algorithms based on neural networks and support vector machines (SVM), our classification system was able to provide a higher classification rate.</span>
THE EFFECT OF DIFFERENT OCCUPATIONAL BACKGROUND NOISES ON VOICE RECOGNITION ACCURACY
Abstract Voice recognition has become an integral part of our lives, commonly used in call centers and as part of virtual assistants. However, voice recognition is increasingly applied to more industrial uses. Each of these use cases has unique characteristics that may impact the effectiveness of voice recognition, which could impact industrial productivity, performance, or even safety. One of the most prominent among them is the unique background noises that are dominant in each industry. The existence of different machinery and different work layouts are primary contributors to this. Another important characteristic is the type of communication that is present in these settings. Daily communication often involves longer sentences uttered under relatively silent conditions, whereas communication in industrial settings is often short and conducted in loud conditions. In this study, we demonstrated the importance of taking these two elements into account by comparing the performances of two voice recognition algorithms under several background noise conditions: a regular Convolutional Neural Network (CNN) based voice recognition algorithm to an Auto Speech Recognition (ASR) based model with a denoising module. Our results indicate that there is a significant performance drop between the typical background noise use (white noise) and the rest of the background noises. Also, our custom ASR model with the denoising module outperformed the CNN based model with an overall performance increase between 14-35% across all background noises. . Both results give proof that specialized voice recognition algorithms need to be developed for these environments to reliably deploy them as control mechanisms.
Supervised Learning Methods and Applications in Medical Research
Machine Learning (ML) and Artificial Intelligence (AI) methods are transforming many commercial and academic areas, including feature extraction, autonomous driving, computational linguistics, and voice recognition. These new technologies are now having a significant effect in radiography, forensics, and many other areas where the accessibility of automated systems may improve the precision and repeatability of essential job performance. In this systematic review, we begin by providing a short overview of the different methods that are currently being developed, with a particular emphasis on those utilized in biomedical studies.
Steering a Robotic Wheelchair Based on Voice Recognition System Using Convolutional Neural Networks
Many wheelchair people depend on others to control the movement of their wheelchairs, which significantly influences their independence and quality of life. Smart wheelchairs offer a degree of self-dependence and freedom to drive their own vehicles. In this work, we designed and implemented a low-cost software and hardware method to steer a robotic wheelchair. Moreover, from our method, we developed our own Android mobile app based on Flutter software. A convolutional neural network (CNN)-based network-in-network (NIN) structure approach integrated with a voice recognition model was also developed and configured to build the mobile app. The technique was also implemented and configured using an offline Wi-Fi network hotspot between software and hardware components. Five voice commands (yes, no, left, right, and stop) guided and controlled the wheelchair through the Raspberry Pi and DC motor drives. The overall system was evaluated based on a trained and validated English speech corpus by Arabic native speakers for isolated words to assess the performance of the Android OS application. The maneuverability performance of indoor and outdoor navigation was also evaluated in terms of accuracy. The results indicated a degree of accuracy of approximately 87.2% of the accurate prediction of some of the five voice commands. Additionally, in the real-time performance test, the root-mean-square deviation (RMSD) values between the planned and actual nodes for indoor/outdoor maneuvering were 1.721 × 10−5 and 1.743 × 10−5, respectively.
Voice Versus Keyboard and Mouse for Text Creation on Arabic User Interfaces
Voice User Interfaces (VUIs) are increasingly popular owing to improvements in automatic speech recognition. However, the understanding of user interaction with VUIs, particularly Arabic VUIs, remains limited. Hence, this research compared user performance, learnability, and satisfaction when using voice and keyboard-and-mouse input modalities for text creation on Arabic user interfaces. A Voice-enabled Email Interface (VEI) and a Traditional Email Interface (TEI) were developed. Forty participants attempted pre-prepared and self-generated message creation tasks using voice on the VEI, and the keyboard-and-mouse modal on the TEI. The results showed that participants were faster (by 1.76 to 2.67 minutes) in pre-prepared message creation using voice than using the keyboard and mouse. Participants were also faster (by 1.72 to 2.49 minutes) in self-generated message creation using voice than using the keyboard and mouse. Although the learning curves were more efficient with the VEI, more participants were satisfied with the TEI. With the VEI, participants reported problems, such as misrecognitions and misspellings, but were satisfied about the visibility of possible executable commands and about the overall accuracy of voice recognition.
Smart Home Automated System Using Raspberry-Pi and Google Cloud Services
Abstract: This paper represents the development of an automated system based on IoT, which can mainly be used in the home and some features can also be implemented in offices, banks, or schools. The main purpose of this project is to save time and manpower along with security and convenience, using Raspberry pi. The salient features of this automated system are gas leakage detection for safety purposes, motion detection for security purposes, and controlling the home appliances as per the user’s need. The system takes command through voice as well as text as per the user’s requirements using google assistant, which further sends a response to Raspberry-Pi via Firebase for the required action. DHT22 sensor is used for the measurement of temperature and humidity, room temperature and Humidity will be displayed through Google assistant. This system consists of Python as the main programming language by default, provided by Raspberry Pi. The system will detect human presence with the help of a motion sensor i.e, whenever a person enters the room, the motion is detected and automatically an alert message will be sent to the user via Google assistant. Keywords: IoT, Raspberry-pi, Google assistant, Firebase, Python, Dialogflow, Voice Recognition.
Digital Technology in Diagnostic Breast Pathology and Immunohistochemistry
Digital technology has been used in the field of diagnostic breast pathology and immunohistochemistry (IHC) for decades. Examples include automated tissue processing and staining, digital data processing, storing and management, voice recognition systems, and digital technology-based production of antibodies and other IHC reagents. However, the recent application of whole slide imaging technology and artificial intelligence (AI)-based tools has attracted a lot of attention. The use of AI tools in breast pathology is discussed briefly as it is covered in other reviews. Here, we present the main application of digital technology in IHC. This includes automation of IHC staining, using image analysis systems and computer vision technology to interpret IHC staining, and the use of AI-based tools to predict marker expression from haematoxylin and eosin-stained digitalized images.
Designing Depression Screening Chatbots
Advances in voice recognition, natural language processing, and artificial intelligence have led to the increasing availability and use of conversational agents (chatbots) in different settings. Chatbots are systems that mimic human dialogue interaction through text or voice. This paper describes a series of design considerations for integrating chatbots interfaces with health services. The present paper is part of ongoing work that explores the overall implementation of chatbots in the healthcare context. The findings have been created using a research through design process, combining (1) literature survey of existing body of knowledge on designing chatbots, (2) analysis on state-of-the-practice in using chatbots as service interfaces, and (3) generative process of designing a chatbot interface for depression screening. In this paper we describe considerations that would be useful for the design of a chatbot for a healthcare context.
Early Prediction of DNN Activation Using Hierarchical Computations
Deep Neural Networks (DNNs) have set state-of-the-art performance numbers in diverse fields of electronics (computer vision, voice recognition), biology, bioinformatics, etc. However, the process of learning (training) from the data and application of the learnt information (inference) process requires huge computational resources. Approximate computing is a common method to reduce computation cost, but it introduces loss in task accuracy, which limits their application. Using an inherent property of Rectified Linear Unit (ReLU), a popular activation function, we propose a mathematical model to perform MAC operation using reduced precision for predicting negative values early. We also propose a method to perform hierarchical computation to achieve the same results as IEEE754 full precision compute. Applying this method on ResNet50 and VGG16 shows that up to 80% of ReLU zeros (which is 50% of all ReLU outputs) can be predicted and detected early by using just 3 out of 23 mantissa bits. This method is equally applicable to other floating-point representations.
Export Citation Format
- Published: 22 February 2023
Trends in speech emotion recognition: a comprehensive survey
- Kamaldeep Kaur ORCID: orcid.org/0000-0002-3542-1214 1 , 2 &
- Parminder Singh 2
Multimedia Tools and Applications volume 82 , pages 29307–29351 ( 2023 ) Cite this article
Among the other modes of communication, such as text, body language, facial expressions, and so on, human beings employ speech as the most common. It contains a great deal of information, including the speaker’s feelings. Detecting the speaker’s emotions from his or her speech has shown to be quite useful in a variety of real-world applications. The dataset development, feature extraction, feature selection/dimensionality reduction, and classification are the four primary processes in the Speech Emotion Recognition process. In this context, more than 70 studies are thoroughly examined in terms of their databases, emotions, features extracted, and classifiers employed. The databases, characteristics, extraction and classification methods, as well as the results, are all thoroughly examined. The study also includes a comparative analysis of these research papers.
This is a preview of subscription content, access via your institution .
Buy single article.
Instant access to the full article PDF.
Price includes VAT (Russian Federation)
Rent this article via DeepDyve.
Abdel-Hamid L (2020) Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun 122(May):19–30. https://doi.org/10.1016/j.specom.2020.04.005
Article Google Scholar
Abdelwahab M, Busso C (2019) “Active Learning for Speech Emotion Recognition Using Deep Neural Network,” 2019 8th International Conference on Affective Computing and Intelligent Interaction, ACII 2019, pp. 441–447, https://doi.org/10.1109/ACII.2019.8925524 .
Abdi H, Williams LJ (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459. https://doi.org/10.1002/wics.101
Agrawal SS (2011) “Emotions in Hindi speech- Analysis, perception and recognition,” 2011 International Conference on Speech Database and Assessments, Oriental COCOSDA 2011 - Proceedings, pp. 7–13, https://doi.org/10.1109/ICSDA.2011.6085972 .
Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Comm 116:56–76. https://doi.org/10.1016/j.specom.2019.12.001
Albawi S, Abed Mohammed T, Alzawi S (2017) Understanding of a convolutional neural network. In: 2017 IEEE International Conference on Engineering and Technology (ICET). https://doi.org/10.1109/ICEngTechnol.2017.8308186
Albornoz EM, Milone DH (2017) Emotion recognition in never-seen languages using a novel ensemble method with emotion profiles. IEEE Trans Affect Comput 8(1):43–53. https://doi.org/10.1109/TAFFC.2015.2503757
Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570. https://doi.org/10.1016/j.csl.2010.10.001
Anagnostopoulos CN, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177. https://doi.org/10.1007/s10462-012-9368-5
Balakrishnama S, Ganapathiraju A (1998) “Linear Discriminant Analysis—A Brief Tutorial,” accessed on 10.09.2021
Bansal S, Dev A (2013) “Emotional hindi speech database,” 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013, pp. 1–4, https://doi.org/10.1109/ICSDA.2013.6709867 .
Bansal S, Dev A (2015) Emotional Hindi speech: Feature extraction and classification. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) 03:1865–1868
Beaufays F (1995) Transform-domain adaptive filters: an analytical approach. IEEE Trans Signal Process 43(2):422–431. https://doi.org/10.1109/78.348125
Bhattacharyya S et al (2018) Speech Background Noise Removal Using Different Linear Filtering Techniques. Lect Notes Electr Eng 475:297–307. https://doi.org/10.1007/978-981-10-8240-5
Boersma P, Weenink D (2001) PRAAT, a system for doing phonetics by computer. Glot Int 5:341–345
Boggs K, Liam (2017) Performance measures for machine learning, accessed on 11.08.2021
Bou-Ghazale SE, Hansen JHL (Jul. 2000) A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transact Speech Aud Process 8(4):429–442. https://doi.org/10.1109/89.848224
Brookes M (1997) Voicebox: Speech processing toolbox for matlab. Imperial College, London. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html . Accessed 06.09.2021
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167. https://doi.org/10.1023/A:1009715923555
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) “A database of German emotional speech,” in 9th European Conference on Speech Communication and Technology, , vol. 5, pp. 1517–1520
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42:335–359. https://doi.org/10.1007/s10579-008-9076-6
Busso C, Metallinou A, Narayanan SS (2011) “Iterative feature normalization for emotional speech detection,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5692–5695, https://doi.org/10.1109/ICASSP.2011.5947652 .
C. academic of science Institute of automation (2005) CASIA-Chinese emotional speech corpus, Chin Linguist Data Consortium (CLDC). http://shachi.org/resources/27 . Accessed 17 Oct 2021
Cao H, Verma R, Nenkova A (2015) Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput Speech Lang 29(1):186–202. https://doi.org/10.1016/j.csl.2014.01.003
Chakroborty S, Saha G (2010) Feature selection using singular value decomposition and QR factorization with column pivoting for text-independent speaker identification. Speech Commun 52(9):693–709. https://doi.org/10.1016/j.specom.2010.04.002
Chandrasekar P, Chapaneri S, Jayaswal D (2014) “Automatic speech emotion recognition: A survey,” in 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications, CSCITA 2014, pp. 341–346, https://doi.org/10.1109/CSCITA.2014.6839284 .
Chen X, Jeong JC (2007) “Enhanced recursive feature elimination,” in Sixth International Conference on Machine Learning and Applications (ICMLA 2007), pp. 429–435, https://doi.org/10.1109/ICMLA.2007.35 .
Chen Y, Xie J (2012) “Emotional speech recognition based on SVM with GMM supervector,” Journal of Electronics (China), vol. 29, https://doi.org/10.1007/s11767-012-0871-2 .
Chen C, You M, Song M, Bu J, Liu J (2006) “An Enhanced Speech Emotion Recognition System Based on Discourse Information BT - Computational Science – ICCS 2006,” in ICCS, pp. 449–456
Chen B, Yin Q, Guo P (2014) “A study of deep belief network based Chinese speech emotion recognition,” Proceedings - 2014 10th International Conference on Computational Intelligence and Security, CIS 2014, pp. 180–184, https://doi.org/10.1109/CIS.2014.148 .
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246
Chen Z, Jiang F, Cheng Y, Gu X, Liu W, Peng J (2018) “XGBoost Classifier for DDoS Attack Detection and Analysis in SDN-Based Cloud,” in 2018 IEEE international conference on big data and smart computing (BigComp), pp. 251–256, https://doi.org/10.1109/BigComp.2018.00044 .
Chenchen Huang DF, Gong W, Wenlong F (2014) A research of speech emotion recognition based on deep belief network and SVM. Math Problems Eng, Article ID 749604. https://doi.org/10.1155/2014/749604
Chiu S, Tavella D (2008) Introduction to data mining. Data Min Market Intel Optimal Market Returns:137–192. https://doi.org/10.1016/b978-0-7506-8234-3.00007-1
Choudhury AR, Ghosh A, Pandey R, Barman S (2018) “Emotion recognition from speech signals using excitation source and spectral features,” Proceedings of 2018 IEEE Applied Signal Processing Conference, ASPCON 2018, pp. 257–261, https://doi.org/10.1109/ASPCON.2018.8748626 .
Clavel C, Vasilescu I, Devillers L, Richard G, Ehrette T (2008) Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun 50(6):487–503. https://doi.org/10.1016/j.specom.2008.03.012
Darekar RV, Dhande AP (2018) Emotion recognition from Marathi speech database using adaptive artificial neural network. Biol Inspired Cogn Architect 23(January):35–42. https://doi.org/10.1016/j.bica.2018.01.002
Dellaert F, Polzin T, Waibel A (1996) “Recognizing emotion in speech,” in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ‘96, vol. 3, pp. 1970–1973, https://doi.org/10.1109/ICSLP.1996.608022 .
Devillers L, Vidrascu L (2007) Real-Life Emotion Recognition in Speech BT - Speaker Classification II: Selected Projects, C. Müller, Ed. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 34–42
Dey A, Chattopadhyay S, Singh PK, Ahmadian A, Ferrara M, Sarkar R (2020) A hybrid Meta-heuristic feature selection method using Golden ratio and equilibrium optimization algorithms for speech emotion recognition. IEEE Access 8:200953–200970. https://doi.org/10.1109/ACCESS.2020.3035531
Dhall A, Goecke R, Gedeon T (2011) Acted facial expressions in the wild database. Tech Rep, no, [Online]. Available: http://cs.anu.edu.au/techreports/ . Accessed 27 Oct 2021
Duda PEHRO, Hart PE, Duda RO (1973) Pattern classification and scene analysis. Leonardo 19(4):462–463
MATH Google Scholar
Dupuis K, Pichora-Fuller M (2011) Recognition of emotional speech for younger and older talkers: Behavioural findings from the Toronto emotional speech set. Can Acoust Acoustique Canadienne 39:182–183
Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3):169–200
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Article MATH Google Scholar
Engberg IS, Hansen AV, Andersen O, Dalsgaard P (1997) “Design, recording and verification of a danish emotional speech database,” in 5th European Conference on Speech Communication and Technology, Rhodes, Greece, pp. 1–4
Er MB (2020) “A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features,” IEEE Access, vol. 8, https://doi.org/10.1109/ACCESS.2020.3043201 .
Essentia Toolkit (n.d.) https://essentia.upf.edu . Accessed 16 Nov 2021
Eyben F (n.d.) Eight emotional speech databases. https://mediatum.ub.tum.de/ . Accessed 18 Nov 2021
Eyben F, Schuller B (2015) OpenSMILE: the Munich open-source large-scale multimedia feature extractor. SIG Multimed Rec 6(4):4–13. https://doi.org/10.1145/2729095.2729097
Eyben F, Wöllmer M, Schuller B (2009) “OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit,” in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–6, https://doi.org/10.1109/ACII.2009.5349350 .
Eyben F, Wöllmer M, Schuller B (2010) “Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462, https://doi.org/10.1145/1873951.1874246 .
Eyben F, Scherer KR, Schuller BW, Sundberg J, Andre E, Busso C, Devillers LY, Epps J, Laukka P, Narayanan SS, Truong KP (2016) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans Affect Comput 7(2):190–202. https://doi.org/10.1109/TAFFC.2015.2457417
Ezz-Eldin M, Khalaf AAM, Hamed HFA, Hussein AI (2021) Efficient feature-aware hybrid model of deep learning architectures for speech emotion recognition. IEEE Access 9:1–1. https://doi.org/10.1109/access.2021.3054345
Faramarzi A, Heidarinejad M, Stephens B, Mirjalili S (2020) Equilibrium optimizer: A novel optimization algorithm. Knowl-Based Syst 191:105190. https://doi.org/10.1016/j.knosys.2019.105190
Farhoudi Z, Setayeshi S, Rabiee A (2017) Using learning automata in brain emotional learning for speech emotion recognition. Int J Speech Technol 20(3):553–562. https://doi.org/10.1007/s10772-017-9426-0
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68. https://doi.org/10.1016/j.neunet.2017.02.013
Ferdib-Al-Islam, L. Akter, and M. M. Islam (2021) “Hepatocellular Carcinoma Patient’s Survival Prediction Using Oversampling and Machine Learning Techniques,” Int Conf Robot Electr Signal Process Tech, pp. 445–450, https://doi.org/10.1109/ICREST51555.2021.9331108 .
Fernandez R, Picard RW (2003) Modeling drivers’ speech under stress. Speech Commun 40(1):145–159. https://doi.org/10.1016/S0167-6393(02)00080-8
Fischer A, Igel C (2014) Training restricted Boltzmann machines: An introduction. Pattern Recognit 47(1):25–39. https://doi.org/10.1016/j.patcog.2013.05.025
Fonti V (2017) Feature selection using LASSO. VU Amsterdam:1–26
Fukunaga K, Mantock JM (1983) Nonparametric discriminant analysis. IEEE Trans Pattern Anal Mach Intell 5(6):671–678. https://doi.org/10.1109/tpami.1983.4767461
Giannakopoulos T (2015) PyAudioAnalysis: An open-source python library for audio signal analysis. PLoS One 10(12):1–17. https://doi.org/10.1371/journal.pone.0144610
Giannakopoulos T, Pikrakis A, Theodoridis S (2009) “A dimensional approach to emotion recognition of speech from movies,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 65–68, https://doi.org/10.1109/ICASSP.2009.4959521 .
Gomes J, El-Sharkawy M (2015) “i-Vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition,” 2015 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 476–480, https://doi.org/10.1109/CSCI.2015.17 .
Grimm M, Kroschel K, Narayanan S (2007) “Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ‘07, vol. 4, pp. IV-1085-IV–1088, https://doi.org/10.1109/ICASSP.2007.367262 .
Hansen JHL, Bou-Ghazale SE (1997) Getting started with SUSAS: a speech under simulated and actual stress database. https://catalog.ldc.upenn.edu/LDC99S78 . Accessed 28 Nov 2021
Hifny Y, Ali A (2019) “Efficient Arabic Emotion Recognition Using Deep Neural Networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 6710–6714, https://doi.org/10.1109/ICASSP.2019.8683632 .
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.19126.96.36.1995
Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio–visual emotional big data. Inf Fus 49:69–78. https://doi.org/10.1016/j.inffus.2018.09.008
Hossin M, Sulaiman MN (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process 5(2):01–11. https://doi.org/10.5121/ijdkp.2015.5201
Hozjan V, Kacic Z, Moreno A, Bonafonte A, Nogueiras A (2002) Interface databases: design and collection of a multilingual emotional speech database. http://www.lrec-conf.org/proceedings/lrec2002
Huang R, Ma C (2006) “Toward A Speaker-Independent Real-Time Affect Detection System,” in 18th International Conference on Pattern Recognition (ICPR’06), vol. 1, pp. 1204–1207, https://doi.org/10.1109/ICPR.2006.1127 .
Inger AVH, Engberg S (1996) Documentation of the danish emotional speech database DES. Aalborg. https://vbn.aau.dk/en . Accessed 14 Aug 2021
Islam MM, Islam MR, Islam MS (2020) “An Efficient Human Computer Interaction through Hand Gesture Using Deep Convolutional Neural Network,” SN Comput Sci, vol. 1, no. 4, https://doi.org/10.1007/s42979-020-00223-x .
Islam MM, Islam MZ, Asraf A, Ding W (2020) “Diagnosis of COVID-19 from X-rays using combined CNN-RNN architecture with transfer learning,” medRxiv, https://doi.org/10.1101/2020.08.24.20181339 .
Islam MR, Moni MA, Islam MM, Rashed-al-Mahfuz M, Islam MS, Hasan MK, Hossain MS, Ahmad M, Uddin S, Azad A, Alyami SA, Ahad MAR, Lio P (2021) Emotion recognition from EEG signal focusing on deep learning and shallow learning techniques. IEEE Access 9:94601–94624. https://doi.org/10.1109/ACCESS.2021.3091487
Islam MR et al (2021) EEG Channel Correlation Based Model for Emotion Recognition. Comput Biol Med 136(May):104757. https://doi.org/10.1016/j.compbiomed.2021.104757
Issa D, Fatih Demirci M, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:1–11. https://doi.org/10.1016/j.bspc.2020.101894
Jackson P, Haq S (2011) Surrey Audio-Visual Expressed Emotion (SAVEE) database. http://kahlan.eps.surrey.ac.uk/savee . Accessed 17 Sept 2021
Jaiswal JK, Samikannu R (2017) “Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression,” in 2017 World congress on computing and communication technologies (WCCCT), pp. 65–68, https://doi.org/10.1109/WCCCT.2016.25 .
Jaratrotkamjorn A, Choksuriwong A (2019) “Bimodal Emotion Recognition using Deep Belief Network,” ICSEC 2019 - 23rd International Computer Science and Engineering Conference, pp. 103–109, https://doi.org/10.1109/ICSEC47112.2019.8974707 .
Jiang P, Fu H, Tao H, Lei P, Zhao L (2019) Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access 7:90368–90377. https://doi.org/10.1109/ACCESS.2019.2927384
Jin Y, Zha C, Zhao L, Song P (2015) Speech emotion recognition method based on hidden factor analysis. Electron Lett 51(1):112–114. https://doi.org/10.1049/el.2014.3339
Jing S, Mao X, Chen L (2018) Prominence features: effective emotional features for speech emotion recognition. Digit Signal Process A Rev J 72:216–231. https://doi.org/10.1016/j.dsp.2017.10.016
Kamble VV, Gaikwad BP, Rana DM (2014) “Spontaneous emotion recognition for Marathi Spoken Words,” International Conference on Communication and Signal Processing, ICCSP 2014 - Proceedings, pp. 1984–1990, https://doi.org/10.1109/ICCSP.2014.6950191 .
Kandali AB, Routray A, Basu TK (2008) “Emotion recognition from Assamese speeches using MFCC features and GMM classifier,” IEEE Region 10 Annual International Conference, Proceedings/TENCON, https://doi.org/10.1109/TENCON.2008.4766487 .
Kate Dupuis MKP-F (2010) Toronto emotional speech set (TESS). University of Toronto , Psychology Department. https://tspace.library.utoronto.ca/handle/1807/24487 . Accessed 08.10.2021
Kattubadi IB, Garimella RM (2019) “Emotion Classification: Novel Deep Learning Architectures,” 2019 5th International Conference on Advanced Computing and Communication Systems, ICACCS 2019, pp. 285–290, https://doi.org/10.1109/ICACCS.2019.8728519 .
Khalil RA, Jones E, Babar MI, Jan T, Zafar MH, Alhussain T (2019) Speech emotion recognition using deep learning techniques: a review. IEEE Access 7:117327–117345. https://doi.org/10.1109/access.2019.2936124
Khan A, Islam M (2016) Deep belief networks. In: Proceedings of Introduction to Deep Neural Networks At: PIEAS, Islamabad, Pakistan. https://doi.org/10.13140/RG.2.2.17217.15200
King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758
Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15(2):99–117. https://doi.org/10.1007/s10772-011-9125-1
Koolagudi SG, Rao KS (2012) Emotion recognition from speech using source, system, and prosodic features. Int J Speech Technol 15(2):265–289. https://doi.org/10.1007/s10772-012-9139-3
Koolagudi SG, Maity S, Kumar VA, Chakrabarti S, Rao KS (2009) “IITKGP-SESC: Speech Database for Emotion Analysis,” in Contemporary Computing, pp. 485–492
Koolagudi SG, Reddy R, Yadav J, Rao KS (2011) “IITKGP-SEHSC : Hindi speech corpus for emotion analysis,” 2011 International Conference on Devices and Communications, ICDeCom 2011 - Proceedings, https://doi.org/10.1109/ICDECOM.2011.5738540 .
Koolagudi SG, Murthy YVS, Bhaskar SP (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21(1):167–183. https://doi.org/10.1007/s10772-018-9495-8
Krothapalli SR, Koolagudi SG (2013) Characterization and recognition of emotions from speech using excitation source information. Int J Speech Technol 16(2):181–201. https://doi.org/10.1007/s10772-012-9175-z
Kubat M (1999) Neural networks: a comprehensive foundation. Knowl Eng Rev 13(4):409–412. https://doi.org/10.1017/S0269888998214044
Kuchibhotla S, Vankayalapati HD, Vaddi RS, Anne KR (2014) A comparative analysis of classifiers in emotion recognition through acoustic features. Int J Speech Technol 17(4):401–408. https://doi.org/10.1007/s10772-014-9239-3
Kuchibhotla S, Deepthi H, Koteswara V, Anne R (2016) An optimal two stage feature selection for speech emotion recognition using acoustic features. Int J Speech Technol 19(4):657–667. https://doi.org/10.1007/s10772-016-9358-0
Kwon O-W, Chan K, Hao J, Lee T-W (2003) Emotion recognition by speech signals. In: 8th European Conference on Speech Communication and Technology. https://doi.org/10.21437/eurospeech.2003-80
Lalitha S, Mudupu A, Nandyala BV, Munagala R (2015) “Speech emotion recognition using DWT,” 2015 IEEE international conference on computational intelligence and computing research, ICCIC, 2016, https://doi.org/10.1109/ICCIC.2015.7435630 .
Lalitha S, Tripathi S, Gupta D (2019) Enhanced speech emotion detection using deep neural networks. Int J Speech Technol 22(3):497–510. https://doi.org/10.1007/s10772-018-09572-8
Langley P, Iba W, Thompson K (1998) “An Analysis of Bayesian Classifiers,” Proceedings of the Tenth National Conference on Artificial Intelligence, vol. 90
Lee C-Y, Chen B-S (2018) Mutually-exclusive-and-collectively-exhaustive feature selection scheme. Appl Soft Comput 68:961–971. https://doi.org/10.1016/j.asoc.2017.04.055
Lee C, Narayanan SS, Pieraccini R (2002) Classifying emotions in human-machine spoken dialogs. Proceed IEEE Int Conf Multimed Expo 1:737–740
Lee KH, Kyun Choi H, Jang BT, Kim DH (2019) “A Study on Speech Emotion Recognition Using a Deep Neural Network,” ICTC 2019 - 10th International Conference on ICT Convergence: ICT Convergence Leading the Autonomous Future, pp. 1162–1165, https://doi.org/10.1109/ICTC46691.2019.8939830 .
Li X (2007) SPEech Feature Toolbox (SPEFT) design and emotional speech feature extraction. https://epublications.marquette.edu/theses/1315 . Accessed 25 Aug 2021
Li J, Fu X, Shao Z, Shang Y (2019) “Improvement on Speech Depression Recognition Based on Deep Networks,” Proceedings 2018 Chinese Automation Congress, CAC 2018, pp. 2705–2709, https://doi.org/10.1109/CAC.2018.8623055 .
Li Y, Baidoo C, Cai T, Kusi GA (2019) “Speech Emotion Recognition Using 1D CNN with No Attention,” ICSEC 2019 - 23rd International Computer Science and Engineering Conference, pp. 351–356, https://doi.org/10.1109/ICSEC47112.2019.8974716 .
Li Z, Li J, Ma S, Ren H (2019) “Speech emotion recognition based on residual neural network with different classifiers,” Proceedings - 18th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2019, pp. 186–190, https://doi.org/10.1109/ICIS46139.2019.8940308 .
Li D, Liu J, Yang Z, Sun L, Wang Z (2021) Speech Emotion Recognition Using Recurrent Neural Networks with Directional Self-Attention. Exp Syst Appl 173:11468. https://doi.org/10.1016/j.eswa.2021.114683
Liberman M (2002) Emotional prosody speech and transcripts. University of Pennsylvania. https://catalog.ldc.upenn.edu/LDC2002S28 . Accessed 14 Oct 2021
Lim W, Jang D, Lee T (2016) “Speech emotion recognition using convolutional and Recurrent Neural Networks,” 2016 Asia-Pacific signal and information processing association annual summit and conference , APSIPA, 2017, https://doi.org/10.1109/APSIPA.2016.7820699 .
Litman DJ, Forbes-Riley K (2004) “Predicting Student Emotions in Computer-Human Tutoring Dialogues,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp. 351–358, https://doi.org/10.3115/1218955.1219000 .
Liu Y, Zhou Y, Wen S, Tang C (2014) A strategy on selecting performance metrics for classifier evaluation. Int J Mob Comput Multimed Commun 6(4):20–35. https://doi.org/10.4018/IJMCMC.2014100102
Liu B, Zhou Y, Xia Z, Liu P, Yan Q, Xu H (2018) Spectral regression based marginal Fisher analysis dimensionality reduction algorithm. Neurocomputing 277:101–107. https://doi.org/10.1016/j.neucom.2017.05.097
Liu ZT, Xie Q, Wu M, Cao WH, Mei Y, Mao JW (2018) Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309:145–156. https://doi.org/10.1016/j.neucom.2018.05.005
Liu ZT, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Livingstone S, Russo F (2018) The Ryerson audio-visual database of emotional speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English. PLoS One 13:e0196391. https://doi.org/10.1371/journal.pone.0196391
Loizou PC (1998) COLEA: A MATLAB software tool for speech analysis. https://ecs.utdallas.edu/loizou/speech/colea.htm . Accessed 20 Oct 2021
Lotfian R, Busso C (2019) Building naturalistic emotionally balanced speech Corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans Affect Comput 10(4):471–483. https://doi.org/10.1109/TAFFC.2017.2736999
Luengo I, Navas E, Hernáez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimed 12(6):490–501. https://doi.org/10.1109/TMM.2010.2051872
Majkowski A, Kołodziej M, Rak RJ, Korczynski R (2016) “Classification of emotions from speech signal,” Signal Processing - Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, SPA, pp. 276–281, https://doi.org/10.1109/SPA.2016.7763627 .
Manjunath R (2013) Dimensionality reduction and classification of color features data using svm and knn. Int J Image Process Vis Commun 1:16–21
Mannepalli K, Sastry PN, Suman M (2016) FDBN: design and development of fractional deep belief networks for speaker emotion recognition. Int J Speech Technol 19(4):779–790. https://doi.org/10.1007/s10772-016-9368-y
Mao KZ (2004) Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man, Cybern Part B (Cybernetics) 34(1):629–634. https://doi.org/10.1109/TSMCB.2002.804363
Mao X, Chen L (2010) Speech emotion recognition based on parametric filter and fractal dimension. IEICE Trans Inf Syst E93-D(8):2324–2326. https://doi.org/10.1587/transinf.E93.D.2324
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213. https://doi.org/10.1109/TMM.2014.2360798
Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9(1):11–17. https://doi.org/10.1109/TIT.1963.1057810
Martin O, Kotsia I, Macq B, Pitas I (2006) “The eNTERFACE’ 05 audio-visual emotion database,” https://doi.org/10.1109/ICDEW.2006.145 .
Martinez AM, Kak AC (Feb. 2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233. https://doi.org/10.1109/34.908974
McFee B et al. (2015) “librosa: Audio and Music Signal Analysis in Python,” Proceedings of the 14th Python in Science Conference, no. Scipy, pp. 18–24, https://doi.org/10.25080/majora-7b98e3ed-003 .
Meftah A, Alotaibi Y, Selouani SA (2014) Designing, building, and analyzing an Arabic speech emotional Corpus. In: Ninth International Conference on Language Resources and Evaluation at: Reykjavik, Iceland
Meftah A, Alotaibi Y, Selouani S-A (2016) “Emotional Speech Recognition: A Multilingual Perspective,” 2016 International Conference on Bio-Engineering for Smart Technologies(Biosmart)
Milton A, Tamil Selvi S (2014) Class-specific multiple classifiers scheme to recognize emotions from speech signals. Comput Speech Lang 28(3):727–742. https://doi.org/10.1016/j.csl.2013.08.004
Mohanta A, Sharma U (2016) “Bengali Speech Emotion Recognition,” in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 2812–2814
Montenegro CS, Maravillas EA (2015) “Acoustic-prosodic recognition of emotion in speech,” 8th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management, HNICEM, 2016, https://doi.org/10.1109/HNICEM.2015.7393229 .
Morin D (2004) “Beads-on-A-string,” Encyclopedic Dictionary of Genetics, Genomics and Proteomics, pp. 1–22, https://doi.org/10.1002/0471684228.egp01270 .
Mustafa MB, Yusoof MAM, Don ZM, Malekzadeh M (2018) Speech emotion recognition research: an analysis of research focus. Int J Speech Technol 21(1):137–156. https://doi.org/10.1007/s10772-018-9493-x
Nanavare VV, Jagtap SK (2015) Recognition of human emotions from speech processing. Proced Comput Sci 49(1):24–32. https://doi.org/10.1016/j.procs.2015.04.223
Nematollahi AF, Rahiminejad A, Vahidi B (2020) A novel meta-heuristic optimization method based on golden ratio in nature. Soft Comput 24(2):1117–1151. https://doi.org/10.1007/s00500-019-03949-w
Nicholson J, Takahashi K, Nakatsu R (2000) Emotion recognition in speech using neural networks. Neural Comput Applic 9(4):290–296. https://doi.org/10.1007/s005210070006
Nooteboom S (1997) The prosody of speech: Melody and rhythm. Handbook Phon Sci, vol 5
O’Shea K, Nash R (2015) An introduction to convolutional neural networks. ArXiv e-prints. https://doi.org/10.48550/arXiv.1511.08458
Ortony A, Clore GL, Collins A (1988) The cognitive structure of emotions. Cambridge University Press, Cambridge
Book Google Scholar
Özseven T, Düğenci M (2018) SPeech ACoustic (SPAC): A novel tool for speech feature extraction and classification. Appl Acoust 136(February):1–8. https://doi.org/10.1016/j.apacoust.2018.02.009
Palo HK, Sagar S (2018) Comparison of Neural Network Models for Speech Emotion Recognition. Proceed 2nd Int Conf Data Sci Bus Anal ICDSBA 20:127–131. https://doi.org/10.1109/ICDSBA.2018.00030
Palo HK, Mohanty MN, Chandra M (2016) Efficient feature combination techniques for emotional speech classification. Int J Speech Technol 19(1):135–150. https://doi.org/10.1007/s10772-016-9333-9
Pandey SK, Shekhawat HS, Prasanna SRM (2019) “Deep Learning Techniques for Speech Emotion Recognition: A Review,” in 2019 29th international conference RADIOELEKTRONIKA (RADIOELEKTRONIKA), pp. 1–6, https://doi.org/10.1109/RADIOELEK.2019.8733432 .
Partila P, Tovarek J, Voznak M, Rozhon J, Sevcik L, Baran R (2018) “Multi-Classifier Speech Emotion Recognition System,” 2018 26th Telecommunications Forum, TELFOR 2018 - Proceedings, pp. 1–4, https://doi.org/10.1109/TELFOR.2018.8612050 .
Pathak BV, Patil DR, More SD, Mhetre NR (2019) “Comparison between five classification techniques for classifying emotions in human speech,” 2019 International Conference on Intelligent Computing and Control Systems, ICCS 2019, pp. 201–207, https://doi.org/10.1109/ICCS45141.2019.9065620 .
Pedregosa F et al. (2012) “Scikit-learn: Machine Learning in Python,” J Mach Learn Res, vol. 12
Petrushin V (2000) “Emotion in speech: recognition and application to call centers,” Proceedings of Artificial Neural Networks in Engineering
Picard RW (1997) Affective computing. MIT Press. https://direct.mit.edu/books/book/4296/Affective-Computing . Accessed 24 Jun 2021
Pratiwi O, Rahardjo B, Supangkat S (2015) “Attribute Selection Based on Information Gain for Automatic Grouping Student System,” in Communications in Computer and Information Science, vol. 516, pp. 205–211, https://doi.org/10.1007/978-3-662-46742-8_19 .
Prinz J (2004) Which emotions are basic? Oxford University Press. https://doi.org/10.1093/acprof:oso/9780198528975.003.0004
Pudil S, Pavel, N, Jana, Bláha (1991) “Statistical approach to pattern recognition: Theory and practical solution by means of PREDITAS system,” Kybernetika 27, vol. 1, no. 76
Pyrczak F, Oh DM, Pyrczak F, Oh DM (2019) “Introduction to the t test,” https://doi.org/10.4324/9781315179803-28 .
Qayyum CSABA, Arefeen A (2019) “Convolutional Neural Network ( CNN ) Based Speech Recognition,” in 2019IEEE International Conference onSignal Processing, Information, Communication & Systems(SPICSCON, pp. 122–125
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. https://doi.org/10.1109/5.18626
Rahman MM, Islam MM, Manik MMH, Islam MR, Al-Rakhami MS (2021) “Machine Learning Approaches for Tackling Novel Coronavirus (COVID-19) Pandemic,” SN Comput Sci, vol. 2, no. 5, https://doi.org/10.1007/s42979-021-00774-7 .
Rajisha TM, Sunija AP, Riyas KS (2016) Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM. Proced Technol 24:1097–1104. https://doi.org/10.1016/j.protcy.2016.05.242
Rajoo R, Aun CC (2016) “Influences of languages in speech emotion recognition: A comparative study using Malay, English and Mandarin languages,” ISCAIE 2016–2016 IEEE Symposium on Computer Applications and Industrial Electronics, pp. 35–39, https://doi.org/10.1109/ISCAIE.2016.7575033 .
Ram CS, Ponnusamy R (2014) “An effective automatic speech emotion recognition for Tamil language using Support Vector Machine,” in 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 19–23, https://doi.org/10.1109/ICICICT.2014.6781245 .
Rao KS, Koolagudi SG, Vempada RR (2013) Emotion recognition from speech using global and local prosodic features. Int J Speech Technol 16(2):143–160. https://doi.org/10.1007/s10772-012-9172-2
Ren M, Nie W, Liu A, Su Y (2019) Multi-modal correlated network for emotion recognition in speech. Vis Inf 3(3):150–155. https://doi.org/10.1016/j.visinf.2019.10.003
Revathi A, Jeyalakshmi C (2019) Emotions recognition: different sets of features and models. Int J Speech Technol 22(3):473–482. https://doi.org/10.1007/s10772-018-9533-6
Roccetti M, Delnevo G, Casini L, Mirri S (2021) An alternative approach to dimension reduction for pareto distributed data: a case study. J Big Data 8(1):39. https://doi.org/10.1186/s40537-021-00428-8
Rong J, Li G, Chen YPP (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manag 45(3):315–328. https://doi.org/10.1016/j.ipm.2008.09.003
Roubos H, Setnes M, Abonyi J (2000) Learning fuzzy classification rules from data, vol 150. In: Proceedings Developments in Soft Computing, pp 108–115. https://doi.org/10.1007/978-3-7908-1829-1_13
Ryerson RU (2017) Multimedia research lab. RML Emotion Database. http://shachi.org/resources/4965 . Accessed 30 Oct 2021
Sadeghyan S (2018) A new robust feature selection method using variance-based sensitivity analysis. arXiv. https://doi.org/10.48550/arXiv.1804.05092
Sari H, Cochet PY (1996) “Transform-Domain Signal Processing in Digital Communications,” in Signal Processing in Telecommunications, pp. 374–384
Sarikaya R, Hinton GE, Deoras A (2014) Application of deep belief networks for natural language understanding. IEEE Trans Audio Speech Lang Process 22(4):778–784. https://doi.org/10.1109/TASLP.2014.2303296
Savargiv M, Bastanfard A (2013) “Text material design for fuzzy emotional speech corpus based on persian semantic and structure,” in 2013 International Conference on Fuzzy Theory and Its Applications (iFUZZY), pp. 380–384, https://doi.org/10.1109/iFuzzy.2013.6825469 .
Savargiv M, Bastanfard A (2015) “Persian speech emotion recognition,” in 2015 7th Conference on Information and Knowledge Technology (IKT), pp. 1–5, https://doi.org/10.1109/IKT.2015.7288756 .
Schlosberg H (1954) Three dimensions of emotion. Psychol Rev 61(2):81–88
Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Schubiger M (1958) English intonation:its form and function. M. Niemeyer Verlag, Tübingen
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Comm 53(9–10):1062–1087. https://doi.org/10.1016/j.specom.2011.01.011
Schuller B et al (2013) The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings 14th Annual Conference of the International Speech Communication Association. https://doi.org/10.21437/Interspeech.2013-56
Shah Fahad M, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digit Signal Process A Rev J 110:10295. https://doi.org/10.1016/j.dsp.2020.102951
Shahin I, Nassif AB, Hamsa S (2019) Emotion recognition using hybrid Gaussian mixture model and deep neural network. IEEE Access 7:26777–26787. https://doi.org/10.1109/ACCESS.2019.2901352
Sheikhan M, Bejani M, Gharavian D (2013) Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Comput & Applic 23(1):215–227. https://doi.org/10.1007/s00521-012-0814-8
Spolaôr N, Cherman EA, Monard MC, Lee HD (2013) “ReliefF for multi-label feature selection,” Proceedings - 2013 Brazilian Conference on Intelligent Systems, BRACIS 2013, pp. 6–11, https://doi.org/10.1109/BRACIS.2013.10 .
Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. Logos Verlag. http://www5.informatik.uni-erlangen.de . Accessed 03.06.2021
Suganya S, Charles E (2019) “Speech emotion recognition using deep learning on audio recordings,” https://doi.org/10.1109/ICTer48817.2019.9023737 .
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(4):687–719. https://doi.org/10.1142/S0218001409007326
Sun L, Zou B, Fu S, Chen J, Wang F (2019) “Speech emotion recognition based on DNN-decision tree SVM model,” Speech Commun, https://doi.org/10.1016/j.specom.2019.10.004 .
Swain M, Routray A, Kabisatpathy P, Kundu JN (2017) “Study of prosodic feature extraction for multidialectal Odia speech emotion recognition,” IEEE Region 10 Annual International Conference, Proceedings/TENCON, pp 1644–1649, https://doi.org/10.1109/TENCON.2016.7848296 .
Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21(1):93–120. https://doi.org/10.1007/s10772-018-9491-z
Tacconi D et al. (2008) “Activity and emotion recognition to support early diagnosis of psychiatric diseases,” in 2008 Second International Conference on Pervasive Computing Technologies for Healthcare, pp. 100–102, https://doi.org/10.1109/PCTHEALTH.2008.4571041 .
Taha M, Adeel A, Hussain A (2018) A survey on techniques for enhancing speech. Int J Comput Appl 179(17):1–14. https://doi.org/10.5120/ijca2018916290
Trigeorgis G, Nicolaou MA, Schuller W (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Select Top Signal Process 11(8):1301–1309
Uhrin D, Partila P, Voznak M, Chmelikova Z, Hlozak M, Orcik L (2014) “Design and implementation of Czech database of speech emotions,” 2014 22nd Telecommunications Forum, TELFOR 2014 - Proceedings of Papers, no. November, pp. 529–532, https://doi.org/10.1109/TELFOR.2014.7034463 .
Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: Introduction and review. J Biomed Inf 85:189–203. https://doi.org/10.1016/j.jbi.2018.07.014
Valstar M et al. (2014) “AVEC 2014 - 3D dimensional affect and depression recognition challenge,” AVEC 2014 - Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Workshop of MM 2014, no. January 2021, pp. 3–10, https://doi.org/10.1145/2661806.2661807 .
Van Der Maaten LJP, Postma EO, Van Den Herik HJ (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:1–41. https://doi.org/10.1080/13506280444000102
Van Lierde K, Moerman M, Vermeersch H, Van Cauwenberge P (1996) An introduction to computerised speech lab. Acta Otorhinolaryngol Belg 50(4):309–314
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181. https://doi.org/10.1016/j.specom.2006.04.003
Ververidis D, Kotropoulos C (2006) “Emotional speech recognition: resources, features, and methods,” Speech Commun, https://doi.org/10.1016/j.specom.2006.04.003 .
Vihari S, Murthy AS, Soni P, Naik DC (2016) Comparison of speech enhancement algorithms. Proced Comput Sci 89:666–676. https://doi.org/10.1016/j.procs.2016.06.032
Vlasenko B, Wendemuth A (2007) Tuning hidden Markov model for speech emotion recognition. DAGA 1:1
Vrebcevic N, Mijic I, Petrinovic D (2019) “Emotion classification based on convolutional neural network using speech data,” 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2019 - Proceedings, pp. 1007–1012, https://doi.org/10.23919/MIPRO.2019.8756867 .
Wang X, Paliwal KK (2003) Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition. Pattern Recognit 36(10):2429–2439. https://doi.org/10.1016/S0031-3203(03)00044-X
Wang K, An N, Li BN, Zhang Y, Li L (2015) “Speech emotion recognition using Fourier parameters,” IEEE Trans Affect Comput
Wei H, Shi X, Yang J, Pu Y (2010) “Speech Independent Component Analysis,” in 2010 International Conference on Measuring Technology and Mechatronics Automation, vol. 3, pp. 445–448, https://doi.org/10.1109/ICMTMA.2010.604 .
Whitney AW (1971) A Direct Method of Nonparametric Measurement Selection. IEEE Trans Comput C–20(9):1100–1103. https://doi.org/10.1109/T-C.1971.223410
Williams CE, Stevens KN (1981) “Vocal correlates of emotional states,” in Speech Eval Psychiatry
Wu J (2017) Introduction to convolutional neural networks. Introduct Convolutional Neural Netw, pp 1–31. https://cs.nju.edu.cn/wujx/paper/CNN.pdf . Accessed 14 Nov 2021
Wu G, Li F (2021) A randomized exponential canonical correlation analysis method for data analysis and dimensionality reduction. Appl Numer Math 164:101–124. https://doi.org/10.1016/j.apnum.2020.09.013
Article MathSciNet MATH Google Scholar
Yao Z, Wang Z, Liu W, Liu Y, Pan J (2020) Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun 120(April 2019):11–19. https://doi.org/10.1016/j.specom.2020.03.005
Ye C, Liu J, Chen C, Song M, Bu J (2008) “Speech Emotion Classification on a Riemannian Manifold,” in Advances in Multimedia Information Processing - PCM 2008, pp. 61–69
Yegnanarayana B (2009) Artificial neural networks. PHI Learning Pvt. Ltd
Yildirim S, Narayanan S, Potamianos A (2011) Detecting emotional state of a child in a conversational computer game. Comput Speech Lang 25(1):29–44. https://doi.org/10.1016/j.csl.2009.12.004
Yu C, Aoki P, Woodruff A (2004) Detecting user engagement in everyday conversations. ArXiv. https://doi.org/10.48550/arXiv.cs/0410027
Zang Q, Wang SLK (2013) A database of elderly emotional speech. In: 2013 International symposium on signal processing, biomedical engineering and informatics
Zeynep Inanoglu RC (2005) Emotive alert: hmm-based emotion detection in voicemail messages. https://vismod.media.mit.edu/tech-reports/TR-585.pdf
Zhalehpour S, Onder O, Akhtar Z, Erdem C (2016) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput PP:1. https://doi.org/10.1109/TAFFC.2016.2553038
Zhang Z, Coutinho E, Deng J, Schuller B (2015) Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Trans Audio Speech Lang Process 23(1):115–126. https://doi.org/10.1109/TASLP.2014.2375558
Zhang S, Zhang S, Huang T, Gao W (2018) Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans Multimed 20(6):1576–1590. https://doi.org/10.1109/TMM.2017.2766843
Zhang X, Wu G, Ren F (2018) “Searching Audio-Visual Clips for Dual-mode Chinese Emotional Speech Database,” 2018 1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018, pp. 1–6, https://doi.org/10.1109/ACIIAsia.2018.8470387 .
Zhang S, Chen A, Guo W, Cui Y, Zhao X, Liu L (2020) Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition. IEEE Access 8:23496–23505. https://doi.org/10.1109/ACCESS.2020.2969032
Zhang H, Huang H, Han H (2021) Attention-based convolution skip bidirectional long short-term memory network for speech emotion recognition. IEEE Access 9:5332–5342. https://doi.org/10.1109/ACCESS.2020.3047395
Zhang C, Liu Y, Fu H (n.d.) “AE 2 -Nets : Autoencoder in Autoencoder Networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2577–2585
Zhao J, Mao X, Chen L (2018) Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Process 12(6):713–721. https://doi.org/10.1049/iet-spr.2017.0320
Zhao Z, Bao Z, Zhao Y, Zhang Z, Cummins N, Ren Z, Schuller B (2019) Exploring deep Spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525. https://doi.org/10.1109/ACCESS.2019.2928625
We would like to thank IKG Punjab Technical University, Kapurthala, Punjab (India) for providing the opportunity to carry out the research work.
Authors and affiliations.
IKG Punjab Technical University, Kapurthala, Punjab, India
Department of Computer Science & Engineering, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India
Kamaldeep Kaur & Parminder Singh
You can also search for this author in PubMed Google Scholar
Correspondence to Kamaldeep Kaur .
Conflicts of interests.
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and Permissions
About this article
Cite this article.
Kaur, K., Singh, P. Trends in speech emotion recognition: a comprehensive survey. Multimed Tools Appl 82 , 29307–29351 (2023). https://doi.org/10.1007/s11042-023-14656-y
Received : 15 December 2021
Revised : 06 April 2022
Accepted : 03 February 2023
Published : 22 February 2023
Issue Date : August 2023
DOI : https://doi.org/10.1007/s11042-023-14656-y
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Find a journal
- Publish with us
- Artificial Intelligence
- Data Science
- Hardware & Sensors
- Machine Learning
- Defense & Cyber Security
- Healthcare & Sports
- Hospitality & Retail
- Logistics & Industrial
- Office & Household
- Write for Us
What are the different types of robot power sources?
8 essential books on statistics and mathematics for data science, top automotive accessories: a comprehensive guide to general truck items, how do robots make decisions, how can i get a job in robotics, how can i build my own robot, how will robots change the workforce, 15 tips for defending against unauthorized network access in manufacturing, how renting office space in fortitude valley can unlock opportunities, ui/ux design for gaming applications: balancing user experience, aesthetics, and intuitive…, 5 technologies you didn’t know your business can afford (and should…, examine the role of sd-wan in remote work environments, revolutionize your performance review: new employee welcome kit ideas, 35 research papers and projects in speech recognition – download.
For centuries, engineers and scientists were intrigued by the idea of creating a machine that can converse with humans fluently in a natural spoken language. The earlier examples of this fascination can be found in science fiction movies in the 60s and 70s.
Stanley Kubrick’s 1968 movie “2001: A Space Odyssey” and George Lucas’ famous Star Wars saga, for instance, had an intelligent computer named “HAL” and mobile droids like R2D2 and C3PO respectively, that could move around interacting with people and other droids in natural human language.
Among the earliest genuine attempts of speech recognition was a “digit recognizer” called Audrey (1952) Bell Laboratories which could only recognize spoken numbers, Shoebox by IBM (1962) which could understand 16 words in English, and Harpy (1976) by Carnegie Mellon which could comprehend 1011 words.
In a significant breakthrough, the development of the Markov model in the 1980s was able to determine a word from an unknown sound, without relying on speech patterns or fixed templates. It led to the invention of several industrial and business applications. But all speech recognition systems in the 80s had a big flaw: you had to take a break between each spoken word.
The world had to wait until 1997 to get a glimpse of the world’s first continuous speech recognizer – Dragon NaturallySpeaking. Capable of understanding 100 words per minute without any pause, it is still in use today.
In a human to machine interface, a speech signal is transformed into an analog and digital waveform which can be understood by the machine. Speech technologies are vastly used and have unlimited uses. These technologies enable devices to respond correctly and reliably to human voices and provide useful and valuable services.
In this post, we list the top 35 research papers and projects in speech recognition , published recently. Feel free to download. Share your own research papers with us to be added to this list.
- Scaling Up Online Speech Recognition Using Convnets
- Recurrent neural network based speech recognition using MATLAB
- Minute Meeting System Using Speech Recognition
- Improving Children Speech Recognition through Feature Learning from Raw Speech Signal
- Speech Recognition in High Noise Environment
- Boosting Neuro Evolutionary Techniques for Speech Recognition
- Segment-level training of ANNs based on acoustic confidence measures for hybrid HMM/ANN Speech Recognition
- Recurrent Poisson Process Unit for Speech Recognition
- On the Effect of the Implementation of Human Auditory Systems on Q-Log-Based Features for Robustness of Speech Recognition Against Noise
- Speech Recognition using AANN
- Robust feature extraction for visual speech and speaker recognition
- Hindi Speech Vowel Recognition Using Hidden Markov Model
- A Study of Formation and Recognition of Speech in Speech Signal Processing
- Deep Variational Filter Learning Models For Speech Recognition
- The Art of Developing Accurate Speech Recognition for Military Training
- Application of Speech Recognition Technology on the Evaluation of English Pronunciation Teaching
- On Using 2d Sequence-To-Sequence Models For Speech Recognition
- Inter speech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages
- Recurrent neural network language model adaptation for conversational speech recognition
- Investigation on LSTM Recurrent N-gram Language Models for Speech Recognition
- Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition
- Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant
- A Pruned RNNLM Lattice-Rescoring Algorithm for Automatic Speech Recognition
- Data Augmentation Improves Recognition of Foreign Accented Speech
- Zero-shot keyword spotting for visual speech recognition in-the-wild
- Output-Gate Projected Gated Recurrent Unit for Speech Recognition
- The combination of Sparse Principle Component Analysis and Kernel Ridge Regression methods applied to speech recognition problem.
- Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition
- Combined Speaker Clustering and Role Recognition in Conversational Speech
- Temporal Sensitivity Measured Shortly After Cochlear Implantation Predicts 6-Month Speech Recognition Outcome
- Domain-Adversarial Training for Session Independent EMG-based Speech Recognition
- End-to-end speech recognition using lattice-free MMI
- Language Recognition for Telephone and Video Speech : The JHU HLTCOE Submission for NIST LRE17
- Automatic speech recognition system development in the wild ,
- Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning
RELATED ARTICLES MORE FROM AUTHOR
Key technical skills required for ai and ml professionals, top data science and machine learning podcasts to listen to in 2023, impact of machine vision in automated perception and recognition, harnessing the power of machine learning for predictive marketing in robotics industry, top companies in document digitization using machine learning, how ai and machine learning solve crimes.
- Write For Us
- Terms & Conditions
The field of generative AI is rapidly evolving, showing remarkable potential to augment human creativity and self-expression. In 2022, we made the leap from image generation to video generation in the span of a few months. And at this year’s Meta Connect, we announced several new developments , including Emu , our first foundational model for image generation. Technology from Emu underpins many of our generative AI experiences, some AI image editing tools for Instagram that let you take a photo and change its visual style or background, and the Imagine feature within Meta AI that lets you generate photorealistic images directly in messages with that assistant or in group chats across our family of apps. Our work in this exciting field is ongoing, and today, we’re announcing new research into controlled image editing based solely on text instructions and a method for text-to-video generation based on diffusion models.
Emu Video: A simple factorized method for high-quality video generation
Whether or not you’ve personally used an AI image generation tool, you’ve likely seen the results: Visually distinct, often highly stylized and detailed, these images on their own can be quite striking—and the impact increases when you bring them to life by adding movement.
With Emu Video, which leverages our Emu model, we present a simple method for text-to-video generation based on diffusion models. This is a unified architecture for video generation tasks that can respond to a variety of inputs: text only, image only, and both text and image. We’ve split the process into two steps: first, generating images conditioned on a text prompt, and then generating video conditioned on both the text and the generated image. This “factorized” or split approach to video generation lets us train video generation models efficiently. We show that factorized video generation can be implemented via a single diffusion model. We present critical design decisions, like adjusting noise schedules for video diffusion, and multi-stage training that allows us to directly generate higher-resolution videos.
Unlike prior work that requires a deep cascade of models (e.g., five models for Make-A-Video ), our state-of-the-art approach is simple to implement and uses just two diffusion models to generate 512x512 four-second long videos at 16 frames per second. In human evaluations, our video generations are strongly preferred compared to prior work—in fact, this model was preferred over Make-A-Video by 96% of respondents based on quality and by 85% of respondents based on faithfulness to the text prompt. Finally, the same model can “animate” user-provided images based on a text prompt where it once again sets a new state-of-the-art outperforming prior work by a significant margin.
Emu Edit: Precise image editing via recognition and generation tasks
Of course, the use of generative AI is often a process. You try a prompt, the generated image isn’t quite what you had in mind, so you continue tweaking the prompt until you get to a more desired outcome. That’s why prompt engineering has become a thing. And while instructable image generative models have made significant strides in recent years, they still face limitations when it comes to offering precise control. That’s why we’re introducing Emu Edit, a novel approach that aims to streamline various image manipulation tasks and bring enhanced capabilities and precision to image editing.
Emu Edit is capable of free-form editing through instructions, encompassing tasks such as local and global editing, removing and adding a background, color and geometry transformations, detection and segmentation, and more. Current methods often lean towards either over-modifying or under-performing on various editing tasks. We argue that the primary objective shouldn’t just be about producing a “believable” image. Instead, the model should focus on precisely altering only the pixels relevant to the edit request. Unlike many generative AI models today, Emu Edit precisely follows instructions, ensuring that pixels in the input image unrelated to the instructions remain untouched. For instance, when adding the text “Aloha!” to a baseball cap, the cap itself should remain unchanged.
Our key insight is that incorporating computer vision tasks as instructions to image generation models offers unprecedented control in image generation and editing. Through a detailed examination of both local and global editing tasks, we highlight the vast potential of Emu Edit in executing detailed edit instructions.
In order to train the model, we’ve developed a dataset that contains 10 million synthesized samples, each including an input image, a description of the task to be performed, and the targeted output image. We believe it’s the largest dataset of its kind to date. As a result, our model displays unprecedented edit results in terms of both instruction faithfulness and image quality. In our evaluations, Emu Edit demonstrates superior performance over current methods, producing new state-of-the-art results in both qualitative and quantitative evaluations for a range of image editing tasks.
The road ahead
Although this work is purely fundamental research right now, the potential use cases are clearly evident. Imagine generating your own animated stickers or clever GIFs on the fly to send in the group chat rather than having to search for the perfect media for your reply. Or editing your own photos and images, no technical skills required. Or adding some extra oomph to your Instagram posts by animating static photos. Or generating something entirely new.
While certainly no replacement for professional artists and animators, Emu Video, Emu Edit, and new technologies like them could help people express themselves in new ways—from an art director ideating on a new concept or a creator livening up their latest reel to a best friend sharing a unique birthday greeting. And we think that’s something worth celebrating.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Meta © 2023
- International edition
- Australia edition
- Europe edition
Microsoft releases AI tool for photorealistic copying of faces and voices
In response to criticism that Azure AI Speech was simply a ‘deepfakes creator’, Microsoft said it had implemented safeguards
Microsoft announced its latest contribution to the artificial intelligence race at its developer conference this week: software that can generate new avatars and voices or replicate the existing appearance and speech of a user – raising concerns that it could supercharge the creation of deepfakes , AI-made videos of events that didn’t happen.
Announced at Microsoft Ignite 2023, Azure AI Speech is trained with human images and allows users to input a script that can then be “read” aloud by a photorealistic avatar created with artificial intelligence. Users can either choose a preloaded Microsoft avatar or upload footage of a person whose voice and likeness they want to replicate. Microsoft said in a blog post published on Wednesday that the tool could be used to build “conversational agents, virtual assistants, chatbots and more”.
The post reads: “Customers can choose either a prebuilt or a custom neural voice for their avatar. If the same person’s voice and likeness are used for both the custom neural voice and the custom text to speech avatar, the avatar will closely resemble that person.”
The company said the new text-to-speech software is being released with a variety of limits and safeguards to prevent misuse. “As part of Microsoft’s commitment to responsible AI, text to speech avatar is designed with the intention of protecting the rights of individuals and society, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content,” the company said.
“Customers can upload their own video recording of avatar talent, which the feature uses to train a synthetic video of the custom avatar speaking,” the blog post reads. “Avatar talent” is a human posing for the AI’s proverbial camera.
The announcement quickly elicited criticism that Microsoft had launched a “deepfakes creator” – which would more easily allow a person’s likeness to be replicated and made to say and do things the person has not said or done. Microsoft’s own president said in May that deepfakes are his “biggest concern” when it comes to the rise of artificial intelligence.
In a statement, the company pushed back on the criticism, saying the customized avatars are now a “limited access” tool for which customers have to apply and be approved for by Microsoft. Users will also be required to disclose when AI was used to create a synthetic voice or avatar.
“With these safeguards in place, we help limit potential risks and empower customers to infuse advanced voice and speech capabilities into their AI applications in a transparent and safe manner,” Sarah Bird of Microsoft’s responsible AI engineering division said in a statement.
The text-to-speech avatar maker is the latest tool as major tech firms have raced to capitalize on the artificial intelligence boom of recent years. After the runaway popularity of ChatGPT – launched by Microsoft-backed firm OpenAI – companies like Meta and Google have pushed their own artificial intelligence tools to the market.
after newsletter promotion
With AI’s rise has come growing concerns about the capabilities of the technology, with the OpenAI CEO, Sam Altman, warning Congress that it could be used for election interference and safeguards must be implemented.
Deepfakes pose particular danger when it comes to election interference, experts say . Microsoft launched a tool earlier this month to allow politicians and campaigns to authenticate and watermark their videos to verify their legitimacy and prevent the spread of deepfakes. Meta announced a policy this week requiring the disclosure of the use of AI in political ads and banning campaigns from using Meta’s own generative AI tools for ads.
- Artificial intelligence (AI)
- Share full article
X Sues Media Matters Over Research on Ads Next to Antisemitic Posts
X asked a federal court to order the advocacy group to take down its findings, accusing it of “manipulating the algorithms.”
By Kate Conger and Ryan Mac
X, the social media service formerly known as Twitter, sued Media Matters in federal court on Monday after the advocacy organization published research showing that ads on X appeared next to antisemitic content.
A post last week from Elon Musk that endorsed an antisemitic conspiracy theory, which he wrote a day before the Media Matters research was published, kicked off an advertiser exodus, with major brands like IBM, Apple, Warner Bros. Discovery and Sony pausing their spending on the platform.
X has rejected Media Matters’ findings, saying they were not representative of a regular user’s experience on the platform. On Friday, Mr. Musk promised a “thermonuclear lawsuit” against Media Matters and its backers.
The lawsuit, filed in U.S. District Court for the Northern District of Texas, claims that Media Matters tried to damage X’s relationships with advertisers. “Media Matters has manipulated the algorithms governing the user experience on X to bypass safeguards and create images of X’s largest advertisers’ paid posts adjacent to racist, incendiary content, leaving the false impression that these pairings are anything but what they actually are: manufactured, inorganic and extraordinarily rare,” lawyers for X wrote in the complaint.
Angelo Carusone, the president of Media Matters, said in a statement, “This is a frivolous lawsuit meant to bully X’s critics into silence.” He added that his organization “stands behind its reporting and looks forward to winning in court.”
On Monday night, Ken Paxton, the Texas attorney general , also announced that his office would open an investigation into Media Matters for “potential fraudulent activity.”
Brands have been hesitant to advertise on X since Mr. Musk bought the company a year ago and said he would relax its content-moderation policies. X has sought to woo hesitant advertisers back to the platform, an initiative that has been overseen by Linda Yaccarino, a longtime advertising executive who became X’s chief executive in June.
But Mr. Musk’s post on Wednesday , in which he agreed with a post from an X account accusing Jewish communities of pushing “hatred against whites that they claim to want people to stop using against them,” was a setback.
“You have said the actual truth,” Mr. Musk replied to the post.
Jewish groups quickly condemned the statement that Mr. Musk endorsed, likening it to the “ Great Replacement Theory ,” a conspiracy theory that claims minorities are replacing white European populations as part of a coordinated effort by Jewish people. The White House condemned Mr. Musk’s remark, and top-tier brands quickly pulled their advertising off X.
Mr. Musk said in a post to X on Sunday that claims that he was antisemitic were untrue. “Nothing could be further from the truth,” he wrote.
In its lawsuit, X asked the court to order Media Matters to take down its published research. The lawsuit also seeks unspecified monetary damages and lawyers’ fees.
Ms. Yaccarino said in a statement that the X account that Media Matters had used in its research was the only account that saw some of the ads next to the antisemitic posts in question. In the case of Apple, its ad was placed next to an antisemitic post and viewed by another user, she added.
“If you know me, you know I’m committed to truth and fairness,” Ms. Yaccarino said in a post on X . “Data wins over manipulation or allegations. Don’t be manipulated. Stand with X.”
During an all-staff meeting on Monday, Ms. Yaccarino said she had discussed the issue with advertisers and was committed to defending X, according to audio of the meeting heard by The New York Times. In those conversations, some advertisers had asked her to be more communicative about problems to get in front of them, and to share more data about how ads are displayed on the platform, she said.
“We want to work with all of our partners who stand with us and believe in the power and the necessity of free speech,” Ms. Yaccarino said. “Sometimes in life and in business, standing for your values is the truly defining thing that makes leaders, and we’re going to hold on to that and keep pushing forward. No critic is ever going to deter us from our mission to keep fighting and protecting free speech .”
In the roughly 30-minute meeting, Ms. Yaccarino focused the blame on Media Matters and did not address Mr. Musk’s endorsement of the antisemitic post.
She also encouraged employees to be frugal during a period of decreased revenue caused by the advertiser pauses and think of ways the company could bring in more money.
“I would say be as fiscally responsible as possible,” the chief executive told her workers. “And that’s on a spectrum of critical-only and necessary travel to being a good steward of anything that might be related to an expenditure at the company.”
Kate Conger is a technology reporter based in San Francisco. She can be reached at [email protected]. More about Kate Conger
Ryan Mac is a technology reporter focused on corporate accountability across the global tech industry. He won a 2020 George Polk award for his coverage of Facebook and is based in Los Angeles. More about Ryan Mac
The World of Elon Musk
The billionaire’s portfolio includes the world’s most valuable automaker, an innovative rocket company and plenty of drama..
X : More advertisers have paused their spending on the social media service formerly known as Twitter as the backlash continued over Elon Musk’s endorsement of an antisemitic post .
SpaceX: Musk's spaceflight company made major progress toward some of its most ambitious goals with the second test flight of its powerful Starship rocket, but the launch was not a complete success .
Tesla: After a Tesla employee put out a fire, he received a congratulatory email from Musk. Next, he was in the middle of an epic battle with the company .
Starlink: Musk has become the dominant power in satellite internet technology. The ways he is wielding that influence are raising global alarms .
A.I. Efforts: Musk plans to compete with OpenAI, the ChatGPT developer he helped found, and has ramped up his own A.I. activities, even as he calls out the potential harms of the technology .