Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Speaker Identification System Progress 1-Implementation and Applications In Computer Sciences-Project Report, Study Guides, Projects, Research of Applications of Computer Sciences

This report is for final year project to complete degree in Computer Science. It emphasis on Applications of Computer Sciences. It was supervised by Dr. Abhisri Yashwant at Bengal Engineering and Science University. Its main points are: Literature, Survey, Databases, Comprehensive, Information, Speakers, Addition, Frames

Typology: Study Guides, Projects, Research

2011/2012

Uploaded on 07/18/2012

padmini
padmini 🇮🇳

4.3

(202)

175 documents

1 / 23

Toggle sidebar

Related documents


Partial preview of the text

Download Speaker Identification System Progress 1-Implementation and Applications In Computer Sciences-Project Report and more Study Guides, Projects, Research Applications of Computer Sciences in PDF only on Docsity! Table of Contents 1. INTRODUCTION..................................................................................................... 1 1.1 Project Objectives………………………………………………………………1 1.2 Overview of the Report………………………………………………………....2 2. DATABASE USED FOR THE EXPERIMENTS .................................................. 3 2.1 Literature Survey of Existing Databases………………………………………..3 2.2 Overview of Database Protocol………………………………………………...4 2.2.1 Database Characterization ............................................................................. 5 2.2.2 Designed Tasks and Distribution of age ......................................................... 6 2.3 Comprehensive Information of Speakers in SDSRS…………………………...6 2.4 Working of Speaker Recognition System………………………………………8 2.4.1 Results Obtained ............................................................................................. 9 2.5 Experiments Performed on Database for Analysis……………………………11 2.5.1 Addition of Noise in voice samples ............................................................... 11 2.5.2 Pitch Alteration ............................................................................................. 12 3. GAUSSIAN MIXTURE MODEL ......................................................................... 14 3.1 Gaussian Mixture Model for Speaker Recognition…………………………...14 3.1.1 Frames as classifiers..................................................................................... 16 4. REFERENCES ......................................................................................................... 18 docsity.com ii List of Figures Figure 2.1- On-line certificate status protocol (OCSP) publication. ............................. 4 Figure 2.2: Screenshot of the poster .............................................................................. 6 Figure 2.3: Components of speaker identification system ............................................. 9 Figure 2.4: Computing of mel-cepstrum [1] ................................................................ 10 Figure 2.5: Comparison of % accuracy between noisy samples .................................. 12 Figure 2.6: Screenshot of the wavlab enviorment ....................................................... 13 Figure 3.1: GMMs for speaker recognition [1] ............................................................ 14 docsity.com Time Schedule Task Name Duration 2006 20014) Mar] Apr [May Jun | sul [aug] Sep/Oct [Nov Dec ‘an [Feb] Mar [Apr [ay Jun 1 | Literature Survey days FEB 2 | Requirement capturing cays ese “3 | SRS.and Project plan development 37 days “4 | Database collection 56 days 5 | Research on different approaches of Feature Extracting techniques. 5 days 6 | More fundamental research into issues related to speech recognitio 58 days “7 | Development of the necessary software and hardware tools 40 days “8 | Evaluations and prototypes development 35 days 2 | Results dissemination and exploitation. 38 days 40-| Thesis Writing 90 days docsity.com 1 1. Introduction As speech interaction with computers becomes more persistent in activities such as financial services and information retrieval from speech databases, the utility of automatically recognizing a speaker based entirely on vocal characteristics increases. Given a speech sample, speaker recognition is concerned with extracting clues to the identity of the person who was the source of that utterance. Speaker recognition is divided into two specific tasks: verification and identification. In speaker verification the goal is to determine from a voice sample if a person is whom he or she claims. In speaker identification the goal is to determine which one of a group of know voices best matches the input voice sample. In either case the speech can be constrained to a known phrase (text-dependent) or totally unconstrained (text-independent). The focus of this work is on achieving significantly higher speaker identification rates using short utterances from unconstrained conversational speech and robustness to degradations produced by transmission of sound. There have been numerous approaches aimed at understanding the underlying process involved in the perception and production of speech. These approaches involve disciplines as diverse as pattern classification and signal processing to physiology and linguistics. The interdisciplinary nature of the problem is one thing that makes speech recognition such a complex and fascinating problem. 1.1 Project Objectives The main objectives of my research project are:  To make an overview of some of the techniques which are used today for Speaker Recognition (SR), i.e. how can they be implemented, on what principles they work, what are their advantages and disadvantages, do these techniques can be applied today or not? These were some of the questions which I have to sort out.  First I studied different techniques for extracting the features and then modeling them from the human speech signal e.g. Mel-Frequency Cepstral Coefficients docsity.com 2 (MFCC), Linear Predictive Coding (LPC) for extraction and Learning Vector Quantization (LVQ) for the classification purpose. In the implementation phase I have used MFCCs and LPCs for feature extraction while as features extraction techniques and VQ, and as feature modeling method. 1.2 Overview of the Report The rest of this report is organized as follows. Chapter 2 gives the detailed description of the database that has been made for speaker recognition system for evaluating the various algorithms. It also contains the experiments performed on database to check the performance of classifiers in real time application. Chapter 3 gives a more detailed description of the standard Gaussian Mixture Model for speaker recognition and the situations over which the method is to be evaluated. docsity.com 5 rapidly changing or highly degraded, acquisition processes are not always under control, incriminated people exhibit low degree of cooperativeness, etc., inducing a wide range of variability sources on speech utterances. In this sense, real approaches to speaker identification necessarily imply taking into account all these variability factors. In order to isolate, analyze and measure the effect of some of the main variability sources that can be found in real commercial and computer-human interaction applications and their influence in speaker recognition systems, a specific speech database in English called SDSRS has been designed and acquired under controlled conditions. In this report, together with a detailed description of the database, some experimental results are also presented. 2.2.1 Database Characterization An important secondary outcome of work with the database survey is a series of questions for characterizing a speaker recognition corpus. i. name and availability, ii. speaker material (including questions on the number of speakers, inter-speaker variation, intra speaker variation, and impostor characterization), iii. speech contents, iv. recording equipment, v. recording environment, vi. other information. Statistics: 50 persons (Males and Females) Recording Equipment: Recorder, Lab Format: .wav Text:  and that feeling of untouched wilderness continues as we go deep into the mangroves in search of an isolated place called Crocodile Creek  then judges wander around giving point scores to each bird  Mr. Wright should write to Ms. Wright right away about his ford or four door Honda  One two three four five six seven eight nine ten.  Gorgeous docsity.com 6 2.2.2 Designed Tasks and Distribution of age Consequently, delimiting the problem of speech variability, together with analyzing the quantitative results of speaker recognition systems will lead to an integral and comprehensive approach to commercial and forensic speaker recognition. All speakers uttered the same sentences  Gorgeous  One two three four five six seven eight nine ten. In order to determine an adequate age distribution of speakers in the database, sociological implications of technology should be taken into account, as equi-distribution of ages may not respond to a real age distribution of users in a specific commercial application. On the other hand, in forensic applications criminals are also unequally distributed in age. During recording session two different microphones (Somic and A- tech4 ) were used. All the voice samples are recorded at Lab-215(B-Block) at a sampling rate of 22050 Hz. Posters for publicizing the recording sessions were pasted at different notice boards. Moreover announced were made in the junior classes regarding the recording date, also request was made in person to faculty and staff members. A screen shots of the poster is shown in Figure 2.2 Figure 2.2: Screenshot of the poster 2.3 Comprehensive Information of Speakers in SDSRS Detailed information of all the speakers in Speech database for Speaker Recognition docsity.com 7 system is given in Table 2.1 ID Name Semester Recording environment 1. Rahat Ali Shah 2nd Lab 2. Asim Hammeed 2nd Lab 3. Zafar Murtaza 2nd Lab 4. Tariq Habib Afridi 2nd Lab 5. Muhammad Arslan 2nd Lab 6. Talah Bin Tariq 2nd Lab 7. M. Hamza Qamar 2nd Lab 8. Salman Shahid 2nd Lab 9. Rao M. Tahir 2nd Lab 10. Hira Anwar 2nd Lab 11. M. Mudasir Feroz 4th Lab 12. Adnan 4th Lab 13. Zubira 4th Lab 14. Ijaz Ahmad 4th Lab 15. Ahsan Ibrahim 4th Lab 16. Sidra Malik 4th Lab 17. M. Naeem 4th Lab 18. M. Nauman Sajid 4th Lab 19. M. Waqas Butt 6th Lab 20. M . Shoaib 6th Lab 21. M. Shoaib Zafar 6th Lab 22. Adeel Iqbal 6th Lab 23. M. Usman Akram 6th Lab 24. Amna Akram Khan 6th Lab 25. M. Taimoor 6th Lab 26. Sumayya Munib 6th Lab 27. Ali Imran 6th Lab 28. Abid Munir Staff-mem Lab 29. Nazir Ahmad Staff-mem Lab 30. Shahzadi Farah 8th Lab 31. Rafia Mailk 8h Lab 32. Emmen Farooq 6th Lab 33. Abrar Hussain Staff-mem Lab 34. Allah Ditta Staff-mem Lab 35. Sania Tanveer 6th Lab 36. Dr. Muhammad Arif Faculty Mem. Lab 37. Hafiz Zahoor Ahmad shah. Staff-mem Lab 38. Dr. Abdul Jalil Faculty Mem Lab 39. Dr. Mutawarra Hussain Faculty Mem Lab 40. Dr. Anila Usman Faculty Mem Lab 41. Dr. Anila Usman Faculty Mem Lab docsity.com 10 classifier. Each speaker has ten recorded voice samples in which one sample is used for the training purpose and one for the testing purpose. System will be tested with more training and testing samples. Both training and testing results obtained are 100%, when MFCC and VQ are used as feature extraction and feature classification technique respectively. No miss-classification has been reported. Table 2.2: Testing results for text-dependent SR Mel-Frequency Cepstral Coefficients have been implemented in MATLAB by following the pattern shown in the Figure 2.4 compute these coefficients, I used the function „melcepst.m‟, which was available in the MATLAB Voice toolbox [5] in my code. Figure 2.4: Computing of mel-cepstrum The number of coefficients which I selected was 13 because the most popular range is from 10~20. The outputs of the function are called Feature Vectors (MFCC coefficients). Size of these feature vectors depends upon the number of speech samples in a particular speaker sentence. Size may also vary for the same sentence spoken by different speakers because of the style each speaker have. It can also vary for one speaker only because the Features No of Speakers Tested Accuracy ( % Result) MFCCs 50 100 docsity.com 11 speaker conditions in different sessions, i.e., he may be ill or thirsty or tired or any other condition he may have. Parameters of MFCC are shown in Table 2.3 Parameter Value Sampling frequency 8 KHz Window Type Hamming Number of Coefficients 19 No of Filters in the Filter Bank 20 Length of the Frame 256 Frame Increment 100 Table 2.3: Mel-Cepstrum Parameters. The number of filters used in the filter bank was selected to be 20. This number was selected by keeping in mind the coverage of telephone bandwidth. Length of the frame was selected to contain 256 numbers of samples. For a sampling rate of 8 KHz, 256 numbers of samples corresponds to a frame length of 32ms (256/ 8K = 32 ms) i.e., the speech signal can be assumed as a stationary signal in one frame. These MFCC feature vectors were then given to a vector quantization classifier, this function needs the number of codebooks to make and MFCC features. Size of the codebooks could be 16, 32, 64, 128, 256 depending upon the signal variations. The most suitable codebook size for my data was 16. Complete description of functioning of algorithms has been mentioned in [6]. 2.5 Experiments Performed on Database for Analysis Different experiments have been performed on database to analyze the performance of classifiers in real time applications. 2.5.1 Addition of Noise in voice samples Gaussian noise has been added in all the voice samples that have been used for training and testing, because it models most of the natural noise that comes in from random sources acting together. Classification accuracy that has been achieved with noisy samples is given in Table 2.4 docsity.com 12 Table 2.4: Testing results for text dependent SR for noisy voice samples Following graph shows the comparison of percentage accuracy between samples having different noise ratio when MFCC is used as feature extraction and VQ is used as feature matching technique. 0.02 0.03 0.04 0.05 0.06 0.07 90 94 88 88 88 84 0 10 20 30 40 50 60 70 80 90 100 Noise Ratio Classification Accuracy Figure 2.5: Comparison of % accuracy between noisy samples 2.5.2 Pitch Alteration To analyze the performance of classifiers, when pitch of particular voice sample is altered, a software WavLab has been used, which is a proprietary software and is used for professional mastering, high resolution multi-channel audio editing, audio restoration, sample design and radio broadcast work right through to complete CD/DVD-A production. Already a standard application for digital audio editing and processing due to its outstanding flexibility and pristine audio quality. Features No of Speakers Tested Accuracy ( % Result) MFCCs 50 86 docsity.com 15 the weighed sum of M component densities as shown in Figure 3.1, given by the equation, 1 ( | ) ( ) M i i i p X p b x   (3.1) where X is a sequence of feature vectors from the audio data, x is D dimensional speech feature vector, ( ), 1....ib x i M are component densities and , 1...ip i M are the mixture weights. Each component density is a D variate Gaussian function of the form, 1 / 2 1/ 2 1 1 ( ) exp{ ( ) ' ( )} (2 ) | | 2 i i i iD i b x x u x u        (3.2) with mean vector iu and covariance matrix i . The mixture weights are such that 1 1. M i i p   For speaker identification, each speaker is represented by a GMM i which is completely parameterized by its mixture weights, means and covariance matrices, { , , }i i i ip u   (3.3) There are two principal motivations for using GMMs to model speaker identity. The first is that the components of such a multi-modal density may represent some underlying set of acoustic classes. It is reasonable to assume that the acoustic space corresponding to a speakers‟ voice can be characterized by a set of acoustic classes . These acoustic classes reflect some general speaker-dependent vocal tract configurations that are useful for characterizing speaker identity. The spectral shape of the thi acoustic class can in turn be represented by the mean iu and covariance matrix i . Because all the training or testing speech is unlabeled, the acoustic classes are hidden in that the class of an observation is unknown. The second motivation for using Gaussian mixture densities for speaker identification is that a linear combination of Gaussian basis functions is capable of modeling a large class of sample distributions. A GMM can form smooth approximations to arbitrarily shaped densities. There are several techniques that can be used to estimate the parameters of a GMM i , which describes the distribution of the training feature vectors. By far the most popular and well-established is Maximum Likelihood (ML) docsity.com 16 estimation. These GMMs are trained separately on each speaker‟s enrollment data using the Expectation Maximization (EM) algorithm [8]. The update equations that guarantee a monotonic increase in the model‟s likelihood value are: Mixture Weights: 1 1 ( | , ) T i t t p p i x T     Means: 1 1 ( | , ) ( | , ) T t t t i T t t p i x x u p i x        Variances: 2 22 1 1 ( | , ) ( | , ) T t t t i iT t t p i x x u p i x          where 2i , tx , iu are elements of 2 i , tx and iu , respectively. The a posteriori probability for acoustic class i is given by, 1 ( ) ( | , ) ( ) i i t t M k k t k p b x p i x p b x     (3.4) In speaker identification, given a group of speakers S = {1, 2....M} , represented by GMMs 1 2, 3..., S    , the objective is to find the speaker model which has the maximum a posteriori probability for a given test sequence 1 1 ( | ) ( )ˆ arg max ( ) arg max ( ) k k k k M k M p X p S p p X          (3.5) Assuming that all speakers are equally likely and that the observations are independent, and since p(X) is same for all speakers, this simplifies to 1 1 1 ˆ arg max ( | ) arg max[ ( ( | ))] T k t k k M k M t S p X p x          (3.6) Each GMM outputs a probability for each frame, which is multiplied across all the frames. The classifier makes a decision based on these product posterior probabilities. 3.1.1 Frames as classifiers An alternate view of the speaker recognition problem is that each frame is an independent classifier. Using the GMM parameters, each classifier makes an independent decision as docsity.com 17 to who the speaker is. In the case of classical GMMs, the outputs of the frames are the probabilities ( | )ip x  which are then combined by multiplication. But there are alternative methods to combine the output of multiple classifiers. Since we believe that the errors are being made because of a few outlier frames with small probability values dominating the final results, we are looking for a method that weighs each classifier (frame) equally. In the proposed method, the decisions of all the classifiers (frames) are combined by voting. In the voting scheme, for each frame we find the most likely speaker ˆ S for that frame by, 1 ˆ arg max ( | )i k k M S p x     . The frames together function as an ensemble classifier. In an ensemble classifier, each classifier is run and casts a ‟vote‟ as to who the correct speaker is. The votes are then collated and the speaker with the greatest number of votes becomes the final classification. This is also a good way to prevent a few bad frames from having a unreasonably large effect on the result, as each frame has an equal contribution to the final result. Pseudo code for the algorithm is shown below. Initialize a counter for each speaker to 0 For each frame j (LOOP 1) For each Speaker i (LOOP 2) Evaluate 1 ( | ) ( ) M ji i k k i p x p b x   End For (LOOP 2) Find the speaker v with maximum probability for the frame j 1 ˆ arg max ( | )i k k M S p x     Increment the counter for speaker v by one End For (LOOP 1) The speaker with the largest counter (i.e. largest number of votes) is hypothesized as the correct speaker. MatlabArsenal [9] , a Matlab package for classification algorithm, has been used for the GMMs. The implementation work as regards GMMs is in progress. docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved