Analysis of Spectral features for Speaker Verification
Abstract- This paper presents speech features which are used in speech production as a complementary information sources for speaker verification. Speaker verification makes it possible to speaker’s voice to confirm their identity as well as their control access to services in different applications like voice dialing, telly banking, phone shopping, database access services, info access services, voice mail, and security management for confidential information areas. In this paper, investigation has been done to evaluate the performance of MFCC and LPCC features under different noisy conditions. Clean speech signals from NIST 2006 database have been contaminated with different type of noises at different signal-to-noise ratio and deviation in MFCC and LPCC features have been analysed with statistical methods namely t-test, and Correlation coefficient.
Keywords—Speaker verification, Mel-Frequency Cepstrum Coefficients (MFCC) and Linear Prediction Cepstrum Coefficients (LPCC), t-test, Correlation coefficient, Spectral analysis, Speaker Recognition.
Introduction- Biometric recognition systems are increasingly being deployed as a means for the recognition of individuals 1. A standout amongst the most broadly utilized biometric modalities is human voice. Speaker recognition systems are technologies which are used to recognize person from his/her speech signal by exploiting speaker speci?c characteristics.2 Speaker verification is the task which performed by the machine to recognizing individual person’s through their voice.3 The verification of voice is mainly based on the anatomy of the vocal tract, the articulation control system properties, and voice source features. The anatomical features of sounds are determined by the spectral parameters. Articulatory properties control the speech rate, rate of transient processes and duration of the speech segment .Voice source determines the fundamental tone frequency and the timbre parameters of the speech signal 4. Speaker verification can be utilized as a part of multi-user systems applications such as speaker tracking or speaker diarization to find the segment of a given speaker in an audio segment or in automatic segmentation of teleconferences. It is additionally valuable in helping transcription of courtroom discussion and forensic applications. Spectral recognition has much potential application as a biometric tool since there are many tasks that can be performed remotely using speech likewise telephone based applications (banking or costumer care services) 5. In numerous criminal cases where criminals could not be identified due to the unavailability of finger print, voice may be used effectively. 67. Since cepstral features are computationally effective and give satisfactory performance in noise-free environments, it is commonly used as a feature vector for speaker verification .It has been observed that the performance of speaker verification system is degraded in noisy
Speech Features: The Selection of speech features is a crucial decision for developing any of the speech system, and it completely based on the area of their uses. A different type of speech features represents different types of speech related information like (speaker, speech, and so on). Several features such as spectral feature 11, glottal waveforms 12, phase information 13, 14, and15 and prosodic feature 16 have been explored. Typically, due to robust performance in various environments spectral feature is utilized as a part of most tasks 17. Spectral features are computed from short frames of around 20–30ms in duration. Inside this interim, the speech signal is assumed to remain stationary. Spectral features represent the most widely recognized approach to characterize the speech signal. Fourier analysis gives a usual way of analyzing the spectral properties of a given signal in the frequency domain. In speech analysis, the phase spectrum is usually neglected, since it is most part accepted that it has little e?ect on the perception of speech 18. The simplest way of analyzing spectral properties of a signal is by utilizing ?lter banks. This approach to spectral feature extraction is so called subband ?ltering where subband yields are considered directly as the features 19. The most frequently used spectral features of speaker recognition are mel-frequency cepstral coe?cients 20 which is based on mel-scale ?lter banks. MFCCs and LPCC are broadly utilized spectral features for speaker recognition. The common parameters of speech signal like Pitch Period, Speech Frame Energy, and Formant and to estimate them, MFCC and LPCC has become one of the most important features.
Parameterization of Speech-It is an important step in speech recognition systems, which is utilized to extract relevant information such as voices (phoneme) from audio signal. In this paper we discussed about two speech parameterization methods, Linear Predictive Coding Cepstral Coefficients (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) to identify the characteristics of speech signal.
Mel Frequency Cepstral Coefficients (MFCC)- MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC) – MFCC is the most commonly used speech features, and due to its accurate estimate and efficient computational of the speech parameters it is most popular, and robust model of speech 7. Additionally, MFCC feature vectors are usually a 39 dimensional vector, composing of 13 standard features, and their first and second derivatives. The procedure of MFCC explained as follows.
Pre-emphasis: In this step signals are passing through a filter. Pre-emphasis procedure emphasizes the higher frequencies. This process will increase the energy of signal at higher frequency. 21
Yn =Xn -0.95 X n-1(1)
The goal of pre-emphasis is to compensate the high-frequency part that was suppressed during the sound production mechanism of humans. Moreover, it can also amplify the importance of high-frequency formants. 22
Framing: After pre-emphasis process we split the signal into short –time frame of 20ms -40 ms duration. Usually the frame size (in terms of sample points) is equal to power of two in order to facilitate the use of FFT. Overlapping is used to produce continuity within frames. The voice signal is divided into frames of N samples. Adjacent frames are being separated by M (M<N).
Hamming windowing: After diving the signal into frames, Hamming window function applying on each frame. The Hamming window is the most popular window used in speech processing.
wn =0.54?0.46cos (2?n/N?1)(2)
Equation (2) represents the N-point Hamming window and N represents Sample period of a frame. Hamming windowing is used to minimize the discontinuities of a signal and it enhance the harmonics, smooth the edges and to reduce the edge effect.23
Fast Fourier Transform: This step is used to convert the convolution of the glottal pulse Un and the vocal tract impulse Hn response in the time domain into frequency domain.
Y (w) = FFT h (t) * X (t) = H (w) * X (w)(3)
If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively.
Mel Filter Bank Processing: Mel-Spectrum is computed by passing the Fourier transformed signal through a set of band-pass ?lters known as mel-?lter bank. It does not correspond linearly to the physical frequency of the tone, as the human auditory system apparently does not perceive pitch linearly. The mel scale is approximately a linear frequency spacing below 1 kHz, and a logarithmic spacing above 1 kHz. The approximation of mel from physical frequency can be expressed as
F (Mel) = 2595 * log 10 1+f 700(4)
Where f denotes the physical frequency in Hz, and fmel denotes the perceived frequency 24.
Discrete Cosine Transform: Since the vocal tract is smooth, the energy levels in adjacent bands tend to be correlated. The DCT is applied to the transformed mel frequency coef?cients produces a set of cepstral coef?cients. Prior to computing DCT the mel spectrum is usually represented on a log scale.
Finally, MFCC is calculated as
c (n)=m=0M-1log10smcos?n(m-0.5)M (5)
Where c (n) are the cepstral coef?cients.Linear Predictive Cepstral Coefficients (LPCC) – LPCC is Linear Prediction Coefficients (LPC) represented in the Cepstrum domain. The basic idea of LPCC straightforwardly from the LPC utilizing recursion method instead of applying in-verse Fourier transform of the logarithms of the spectrum of the original signal. LPCC is less computationally expensive because it can be computed without calculating the Fourier transform at the starting stages to modify signal from time domain to frequency domain. Besides LPCC was acquiring the advantages from LPC. The main idea of LPC is based on the speech production model. A perfect estimation to the vocal tract spectral envelope was given by the all-pole model of the LPC while it is applied to the analysis of speech signals which leads to a moderate source-vocal tract separation.
Experimental Setup – The experiments were performed using the NIST 2003 database. Noisy background was artificially simulated by adding the additive noises (i.e. airport, babble, car, exhibition, restaurant, street, subway and train) which was collected from the AURORA database. Specific type of noise added at 0, 5, 10, 15, 20, and 25dB SNR respectively. In the present study MFCC and LPCC feature extraction methods are used .Results and Discussions- In the present study performance of MFCC and LPCC are compared by using the calculation of mean value using the t-test and correlation coefficient are discussed in different noisy environment with respect to various signal-to-noise ratio (SNR).From the experimental results it has been observed that at low SNR level from 0 to 10dB MFCC performance (Fig. 1 to Fig. 6) is comparative better then LPCC, whereas LPCC perform (Fig.7 to Fig.12) comparatively better then as compare to MFCC at higher SNR level.
Fig. 1 Mean values of MFCC cepstral coefficients c1-c12 at 0 dB SNR
Fig. 2 Mean values of MFCC cepstral coefficients c1-c12 at 5 dB SNR
Fig. 3 Mean values of MFCC cepstral coefficients c1-c12 at 10 dB SNR
Fig. 4 Mean values of MFCC cepstral coefficients c1-c12 at 15 dB SNR
Fig. 5Mean values of MFCC cepstral coefficients c1-c12 at 20 dB SNR
Fig. 6Mean values of MFCC cepstral coefficients c1-c12 at 25 dB SNR
Fig. 7Mean values of LPCC cepstral coefficients c1-c12 at 0 dB SNR
Fig. 8Mean values of LPCC cepstral coefficients c1-c12 at 5 dB SNR
Fig. 9Mean values of LPCC cepstral coefficients c1-c12 at 10 dB SNR
Fig. 10 Mean values of LPCC cepstral coefficients c1-c12 at 15dB SNR
Fig. 11 Mean values of LPCC cepstral coefficients c1-c12 at 20dB SNR
Type of Noise 0 5 10 15 20 25
Airport 0.9671 0.9791 0.9791 0.9944 0.9996 1
Babble 0.9683 0.9810 0.9810 0.9961 0.9961 1.0000
Car 0.9678 0.9792 0.9886 0.9946 0.9998 1
Exhibition 0.9695 0.9933 0.9933 0.9976 0.9998 1.0000
Restaurant 0.9663 0.9796 0.9901 0.9960 0.9996 1.0000
Street 0.9600 0.9749 0.9877 0.9948 0.9997 1
Subway 0.9682 0.9835 0.9921 0.9976 0.9995 1.0000
Train 0.9702 0.9822 0.9889 0.9943 0.9997 1
Fig. 12 Mean values of LPCC cepstral coefficients c1-c12 at 25dB SNR
Table 1. Correlation Coefficient of MFCC
Table 2. Correlation Coefficient of LPCC
Type of Noise 0 5 10 15 20 25
Airport 0.1005 0.0757 0.0757 0.1072 0.1462 0.1488
Babble 0.0225 0.0087 0.0087 0.1405 0.1405 0.1488
Car 0.0826 0.0833 0.0166 0.1116 0.1474 0.1488
Restaurant 0.1305 0.1471 0.1507 0.1701 0.1521 0.1487
Exhibition 0.2052 0.1529 0.1529 0.1433 0.1466 0.1487
Street 0.0232 0.0292 0.0382 0.0911 0.1448 0.1488
Subway 0.1192 0.1304 0.1315 0.1270 0.1471 0.1487
Train 0.1347 0.1416 0.0080 0.1103 0.1523 0.1488
Table 1 and Table 2 shows the results obtained by using the correlation of cepstral coefficients c1-c12 by MFCC and LPCC methods at different SNR levels, from the results it has been observed that for all types of noises which are shown is table are equally correlate at 25dB signal-to-noise ratio. The Pattern in change in their correlation values is not clearly visible in Table 1 upto 10dB for all types of noise; whereas in table 2 there is visible correlation occurred for all SNR levels.
Conclusions-. This paper discussed the MFCC and LPCC techniques which are used to extract voice of individuals. In this paper, we showed the signi?cance of statistical analysis of spectral features under different types of noisy condition at different SNR levels .The aim of this paper to perform statistical analysis on feature set of 12th order Mel-Frequency Cepstral Coefficients (MFCC) and Linear Prediction Cepstrum Coefficients (LPCC) at various Signal-to-noise ratios. From the outcomes of the results it has been observed that both MFCC and LPCC performance are good but LPCC slightly better perform than MFCC in higher SNR levels whereas MFCC is better at low SNR.
1 Joseph N. Pato and Lynette I. Millett, “Biometric Recognition: Challenges and Opportunities,” National Academies Press, Washington, 2010.
2 Bo?sko Bo?zilovi´c, Branislav M. Todorovi´c and Miroslav Obradovi´,” Text–Independent Speaker Recognition using two–dimensional Information Entropy,” Journal of Electrical Engineering, vol. 66(3), pp.169–173, 2015.
3T.R.JayanthiKumariandH.S.Jayanna,”LimitedDataSpeakerVeri?cation:FusionofFeatures,”InternationalJournalofElectricalandComputerEngineering(IJECE), vol. 7(6), pp. 3344 – 3357, December 2017.
4 V. N. Sorokin and A. I. Tsyplikhin, “Speaker Verification Using the Spectral and Time Parameters of Voice Signal,” Journal of Communications Technology and Electronics, vol. 55(12), pp. 1561–1574, 2010.5 J.P. Campbell Jr., “Speaker Recognition: a tutorial,” Proceedings of the IEEE, vol. 85(9), pp. 14371462, 1997.
6 K.K. Ang and A.C.Kot, “Speaker Verification for home security system” in proceedings of IEEE International Symposium on Consumer Electronics(ISCE ’97), pp. 27-30, Singapur December 1997.
7 Piyush Loti and.M.R.Khan, “Significance of Complementary Spectral Features for Speaker Recognition,” International Journalof Research in Computer and Communication Technology, vol 2(8), August-2013.
8 3. Tomi Kinnunen, Haizhou Li, “An overview of text independent speaker recognition, from features to supervectors, Speech Communication,” July 2009.
9. Marcos Faundez-Zanuy and Enric Monte-Moreno State, “The-art in Speaker Recognition,”IEEE A&E Systems Magazine, May 2005.
10 Sharada V Chougulea, Mahesh S Chavan, “Robust Spectral Features for Automatic Speaker Recognition in Mismatch Condition,” Procedia Computer Science 58, pp. 272 – 279, 2015.
11 D. Reynolds, T. Quatieri and R. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, Digital Signal Processing,” vol. 10, pp. 19-41, 2000.
12 M. Plumpe, T. Quatieri and D. Reynolds, “Modeling of the glottal flow derivative waveform with application to speaker identification,” IEEE Transactions on Speech and Audio Processing, vol. 7(5), pp. 569-586, 1999.13 L. Wang, S. Nakagawa and S. Ohtsuka, “High improvement of speaker identification and verification by combining MFCC and phase information,”Proc. Of IEEE ICASSP 2009, pp. 4529-4532, April. 2009.
14 L. Wang, K. Minami, K. Yamamoto and S. Nakagawa, “Speaker identification by combining MFCC and phase information in noisy environments, “Proc. Of IEEE ICASSP 2010, pp. 4502-4505, March. 2010.
15 S. Nakagawa, L. Wang and S. Ohtsuka, “Speaker identification and verification by combining MFCC and phase information”, IEEE Transactions on Audio, Speech and Language Processing,” vol. 20(4), pp. 1085-1095, 2012.16 A. Adami, “Modeling prosodic differences for speaker recognition,” Speech Communication, vol. 49(4), pp. 277-291, 2007.17Zhaofeng Zhang, Jing Deng, Longbiao Wang, and Xiong Xiao, “A Spectrum Smoothing Method for Speaker Verification,” Proceedings of APSIPA Annual Summit and Conference, pp.1291-1295, December 2015
18 FURUI, S, “Digital Speech Processing, Synthesis, and Recognition,” 2nd ed., Marcel Dekker, New York, 2001.
19 P. Sivakumaran, A.M. Ariyaeeinia and M.J. Loomes, “Sub-Band Based Text-Dependent Speaker Verification,” Journal of Speech Communication 41 (2003), pp. 485-509, 2003.20 DAVIS, S.and MERMELSTEIN.P, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, Signal Processing vol. 28(4), pp.357–366.1980.
21 R.M.Sneha , and K.L.Hemalatha, “Implementation of MFCC Extraction Architecture and DTW Technique in Speech Recognition System,” International Journal of Emerging Trends in Science and Technology, vol.3(5),pp.753-757,2016.
22 Koustav Chakraborty,Asmita Talele and Savitha Upadhya, “Voice Recognition Using MFCC Algorithm ,” International Journal of Innovative Research in Advanced Engineering,vol.1(10),pp.158-161,2014.
23 SUREKHA RATHOD and SANGITA NIKUMBH, “SECURITY BASED ON SPEECH RECOGNITION USING MFCC METHOD WITH MATLAB APPROACH ,” International Journal of Soft Computing and Artificial Intelligence,vol.3(2),pp.105-109,2015.
24 J.R. Deller, J.H. Hansen, J.G. Proakis, Discrete Time Processing of Speech Signals, 1st edn. (Prentice Hall PTR, Upper Saddle River, 1993)
25. Elias Nemer, Rafik Goubran, Member, and Samy Mahmoud.(2001). Robust Voice Activity Detection Using Higher-Order Statistics in the LPC Residual Domain. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 9(3), 217-231.
26. B. Yegnanarayana, R. Kumara Swamy, and K. Sri Rama Murty (2009). Determining Mixing Parameters From Multispeaker Data Using Speech-Speci?c Information. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 17(6), 1196-1207.
27 R. Rajeswara Rao, V. Kamakshi Prasad, and A. Nagesh(2010). Performance Evaluation of Statistical Approaches for TextIndependent Speaker Recognition Using Source Feature. InterJRI Computer Science and Networking, 2(1).28. K. Sri Rama Murty and B. Yegnanarayana. (2006).Combining Evidence From Residual Phase and MFCC Features for Speaker Recognition. IEEE SIGNAL PROCESSING LETTERS, 13(1), 52-55.
29. Hossein Zeinali, Hossein Sameti, and Luk´ a?s Burget. (2017). HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Veri?cation IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 25(7).1421-1435.30Danyang Cao Xue Gao Lei Gao, “An Improved Endpoint Detection Algorithm Based on MFCC Cosine Value,” https://link.springer.com/journal/11277,vol.95(3)Wireless Pers Commun (2017) Springer US https://doi.org/10.1007/s11277-017-3958-031 Seiichi Nakagawa, Member, Longbiao Wang and Shinji Ohtsuka, “Speaker Identi?cation and Veri?cation by Combining MFCC and Phase Information,” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING,vol. 20(4),pp.1085-1095,2012.
32 Md Jahangir Alam , Tomi kinnunen , Patrick Kenny , Pierre Ouellet, andDouglas O’Shaughnessy , “Multi-taper MFCC Features for Speaker Verification using I-vectors,” ASRU 2011, IEEE,547-552.
Ref. B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessonslearnt from the ?rst challenge,” Speech Communication, vol. 53, pp. 1062-1087, 2011.
M. Kockmann, L. Burget, and J. H. Cernocky, “Application of speakerandlanguageidenti?cationstate-of-arttechniquesforemotion recognition,” Speech Communication, vol. 53, pp. 1172-1185, Nov. 2011.
11 K. R. Scherer, “Vocal communication of emotion: A review of researchparadigms,”SpeechCommunication,vol.40,pp.227-256, 2003.
12 C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE Trans. Audio, Speech, Lang. Process., vol. 13, no. 2, pp. 293-303, Mar. 2005