Speech Sentiment Analysis Using Acoustic Features

As technology has developed so has our methods of interacting with them. From physical keys, to virtual Graphical User Interfaces, and finally to Voice User Interfaces, Human Computer Interaction has come a long way. Today, we can successfully use are voice to give commands to different software. This is achieved by the use of Natural Language Processing, i.e., the linguistic aspect of voice interaction. In this project we try to understand if the acoustic approach can be used for a overall better voice interaction. Here, we use the sound waves from human voices to analysis the underlying tone of the speaker and classify it according to their emotion. This would help in providing additional information about the user, the state of his emotion at that moment. This sentiment detection is achieved using machine learning. The project consists of 2 important parts. The first step is extraction of features form the sound clips and the second one is to train a model based on the acquired feature matrix. Feature extraction step is carried out to improve model performance by taking out the import features and using them as inputs for model, rather than using the sound waves as input directly. A comparison between the features and the resultant accuracy is also carried out, along with the comparison of multiple features as input to the model instead of a single one.

Human–computer interaction (HCI) studies the design and use of computer technology, focused on the interfaces between people (users) and computers. Right from the start of emerging new technologies Graphical User Interfaces (GUIs) have been the industry standard for decades now. While GUI has virtual keys and buttons, a voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions, and typically respond with a text to speech module as a reply. This field has been mainly driven based upon the linguistic part of the speech. The new advancements in Voice Interaction has been made mainly in the fields of Natural Language Processing, whereas the acoustic part of the speech, the underlying sentiment behind the speech hasn’t been explored much. Here in this project, we analysis the sentiment in the voice of the speaker. This is carried out using machine learning.

The challenging aspect of this idea is the feeding of data to the model. As sound waves are more or less set of random values, tweaking and shaping them into sensible features so as to feed them as input, is the core of this project. This feature extraction stage is carried out using the functions present in Librosa python package. This is a library focused on sound processing and provides useful algorithms for feature extractions. In this project we have used various sound processing and they are MFCC (Mel Frequency Cepstral Coefficients), Spectral Contrast, Chroma, Mel Spectrogram Frequency and Tonnetz. The datasets used in this project is the RAVDESS (Ryerson Audio Visual Database of Emotional Speech and Song) and CORPUS JL opensource datasets.

To read more about the methodoloy and procedure, please read the report linked at the end.

Results

Following tables contain the result obtained from the project, which includes performance of the model and the comparison between different feature extraction technique and thier combinations.

Table 1: Individual feature comparison table

Dataset Used	Feature(s)	Selection Type	Test Score	Train Score	Average Score
CORPUS JL	chroma	avg	0.5469	0.6901	0.6185
	chroma	minmaxavg	0.5521	0.7201	0.6361
	contrast	avg	0.6667	0.7747	0.7207
	contrast	minmaxavg	0.6667	0.7747	0.7207
	mel	avg	0.8542	0.9362	0.8952
	mel	minmaxavg	0.8385	0.9089	0.8737
	mfcc	avg	0.8698	0.9219	0.8958
	mfcc	minmaxavg	0.8802	0.9232	0.9017
	tonnetz	avg	0.4323	0.6237	0.5280
	tonnetz	minmaxavg	0.3229	0.5951	0.4590
RAVDESS	chroma	avg	0.3556	0.6164	0.4860
	chroma	minmaxavg	0.3556	0.6462	0.5009
	contrast	avg	0.5259	0.6611	0.5935
	contrast	minmaxavg	0.5259	0.6611	0.5935
	mel	avg	0.5259	0.7430	0.6345
	mel	minmaxavg	0.5556	0.6890	0.6223
	mfcc	avg	0.6741	0.8343	0.7542
	mfcc	minmaxavg	0.6667	0.8287	0.7477
	tonnetz	avg	0.3556	0.5885	0.4720
	tonnetz	minmaxavg	0.2741	0.5512	0.4126

Table 2: Combination of feature extraction technique comparison

Dataset Used	Feature(s)	Selection Type	Test Score	Train Score	Average Score
CORPUS JL	contrast,mel	avg	0.547	0.690	0.8880
	contrast,mel	minmaxavg	0.552	0.720	0.8770
	mffcc, contrast	avg	0.667	0.775	0.8926
	mffcc, contrast	minmaxavg	0.667	0.775	0.9010
	mfcc, mel	avg	0.854	0.936	0.9010
	mfcc, mel	minmaxavg	0.839	0.909	0.9121
	mfcc, contrast, mel	avg	0.870	0.922	0.8971
	mfcc, contrast, mel	minmaxavg	0.880	0.923	0.9128
RAVDESS	contrast, mel	avg	0.356	0.616	0.6817
	contrast, mel	minmaxavg	0.356	0.646	0.6316
	mfcc, contrast	avg	0.526	0.661	0.7644
	mfcc, contrast	minmaxavg	0.526	0.661	0.7356
	mfcc, mel	avg	0.526	0.743	0.7644
	mfcc, mel	minmaxavg	0.556	0.689	0.7598
	mfcc, contrast, mel	avg	0.674	0.834	0.7644
	mfcc, contrast, mel	minmaxavg	0.667	0.829	0.7598

An application was also built that was use to classify the emotion in real time. PyQt5 was used to build the GUI of the application.

Conclusion

Using just the intensity and the variance of the sound waves in a human speech, the underlying emotions can be understood to some extent without linguistic processing. This means that acoustic features can act as an important factor in Speech Emotion Recognition (SER) with its big brother Natural Language Processing (NLP). Even without knowing the actual words the sound waves can be analyzed to understand the emotion, this result can further be developed by classifying the words in the speech based on their sentiments, positive and negative. This combination would yield much more better results. MFCC still stands out to be the best feature extraction technique for speech sentiment analysis for short duration sound clips. A combination of the feature extraction techniques or the minimum and the maximum values of the features does not provide a substantial increase in the accuracy than a single technique alone.

The entire report can be found here.

The published paper can be found on this link: https://ijisrt.com/feature-extraction-techniques-comparison-for-emotion-recognition-using-acoustic-features