Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Identity Verification-Applications of Computer Sciences-Project Report, Study Guides, Projects, Research of Applications of Computer Sciences

This report is for final year project to complete degree in Computer Science. It emphasis on Applications of Computer Sciences. It was supervised by Dr. Abhisri Yashwant at Bengal Engineering and Science University. Its main points are: Parameters, Biometric, System, Technologies, Comparison, Architecture, Applications, Thesis, Organization

Typology: Study Guides, Projects, Research

2011/2012

Uploaded on 07/18/2012

padmini
padmini 🇮🇳

4.3

(202)

175 documents

1 / 80

Toggle sidebar

Related documents


Partial preview of the text

Download Identity Verification-Applications of Computer Sciences-Project Report and more Study Guides, Projects, Research Applications of Computer Sciences in PDF only on Docsity! He [Allah] grants wisdom to whom He pleases; and he to whom wisdom is granted indeed receives a benefit overflowing, But none will grasp the Message except men of understanding. docsity.com ii Dedicated to My Beloved Parents docsity.com v 2.1 Representation of the Speech ................................................................................................... 11 2.1.1 Speech Perception System ................................................................................................ 12 2.1.2 Speech Production System ............................................................................................... 13 2.2 Speaker Variability .................................................................................................................. 15 2.3 Environment Variability .......................................................................................................... 16 2.4 Types of Speech Recognition .................................................................................................. 16 2.4.1 Isolated Words .................................................................................................................. 17 2.4.2 Connected Words ............................................................................................................. 17 2.4.3 Continuous Speech ........................................................................................................... 17 2.4.4 Spontaneous Speech ......................................................................................................... 17 2.5 Features of Human Speech Signal ........................................................................................... 17 2.6 Characteristics of the Features ................................................................................................. 18 2.7 Features Derived from Speech ................................................................................................. 19 2.8 Frequency Band Analysis ........................................................................................................ 19 2.8.1 Formant Frequencies ........................................................................................................ 19 2.8.2 Pitch Contours .................................................................................................................. 20 2.8.3 Co-Articulations ............................................................................................................... 20 2.8.4 Features derived from Short Term Processing ................................................................. 20 2.8.5 Short Time Average Energy and Magnitude .................................................................... 21 2.8.6 Short Time Average Zero Crossing rate ........................................................................... 22 2.8.7 Short Time Autocorrelation .............................................................................................. 22 2.8.8 Harmonic Features ............................................................................................................ 22 2.8.9 Cepstrum Coefficients ...................................................................................................... 23 CHAPTER 3 .................................................................................................................... 25 PRINCIPLES OF SPEECH RECOGNITION ............................................................ 25 3.1 Speaker Verification ................................................................................................................ 25 docsity.com vi 3.2 Speaker Identification .............................................................................................................. 25 3.2.1 Open-Set Speaker Identification ....................................................................................... 25 3.2.2 Closed-Set Speaker Identification .................................................................................... 25 3.3 Process of Speaker Identification ............................................................................................. 26 3.3.1 Speaker Enrollment Phase ................................................................................................ 27 3.3.2 Speaker Verification Phase ............................................................................................... 27 3.4 Applications of Speaker Recognition System .......................................................................... 28 3.5 Literature Survey ..................................................................................................................... 29 CHAPTER 4 .................................................................................................................... 34 FEATURE EXTRACTION AND CLASSIFICATION TECHNIQUE ..................... 34 4.1 Feature Extraction .................................................................................................................... 34 4.1.1 Linear Predictive Coding .................................................................................................. 34 4.1.2 Mel-Frequency Cepstrum Coefficients (MFCC) .............................................................. 37 4.1.3 Comparison of MFCC and LPC ....................................................................................... 42 4.2 Speaker Modeling (Classification)........................................................................................... 43 4.3 Approaches for Classification .................................................................................................. 43 4.3.1 Vector Quantization .......................................................................................................... 44 4.3.2 Nearest Neighbors ............................................................................................................ 46 CHAPTER 5 .................................................................................................................... 47 EXPERIMENTAL RESULTS AND ANALYSIS ........................................................ 47 5.1 Database used for Experiments ................................................................................................ 47 5.1.1 Literature Survey of Existing Databases .......................................................................... 47 5.1.2 Speech Material ................................................................................................................ 48 5.1.3 Decision Making and Performance .................................................................................. 50 5.2 Experimental Results ............................................................................................................... 51 docsity.com vii 5.3 SR System Using LPC and VQ ............................................................................................... 51 5.3.1 Training Phase .................................................................................................................. 51 5.3.2 Testing Phase .................................................................................................................... 51 5.3.3 Results .............................................................................................................................. 52 5.3.4 Comments ......................................................................................................................... 52 5.4 SR System Using MFCC and VQ ............................................................................................ 53 5.4.1 Training Phase .................................................................................................................. 53 5.4.2 Data Reduction ................................................................................................................. 54 5.4.3 Speaker Modeling Using Vector Quantization ................................................................. 55 5.4.4 Testing Phase .................................................................................................................... 56 5.4.5 Results .............................................................................................................................. 56 5.4.6 Comments ......................................................................................................................... 57 5.5 Experiments Performed on database for Analysis ................................................................... 58 5.5.1 Addition of noise in the voice samples ............................................................................. 58 5.5.2 Pitch Alteration ................................................................................................................. 58 5.6 Comparative Analysis .............................................................................................................. 60 CHAPTER 6 .................................................................................................................... 61 GRAPHICAL USER INTERFACE .............................................................................. 61 6.1 Enrollment Phase ..................................................................................................................... 61 6.2 Identification Phase .................................................................................................................. 63 CHAPTER 7 .................................................................................................................... 65 CONCLUSION AND FUTURE WORK ...................................................................... 65 7.1 Future Directions ..................................................................................................................... 65 REFERENCES ................................................................................................................ 67 docsity.com x Abstract Identity verification systems are an important part of our every day life. A typical example is the Automatic Teller Machine (ATM) which employs a simple identity verification scheme: the user is asked to enter their secret password after inserting their ATM card; if the password matches the one prescribed to the card, the user is allowed access to their bank account. This scheme suffers from a major drawback: only the validity of the combination of a certain possession (the ATM card) and certain knowledge (the password) is verified. The ATM card can be lost or stolen, and the password can be compromised. Thus new verification methods have emerged, where the password has either been replaced by, or used in addition to, biometrics such as the person‘s speech, face image or fingerprints. Given a speech signal there are two kinds of information that may be extracted from it. On one hand there is the linguistic information about what is being said, and on the other there is also speaker specific information. Nowadays it is obvious that speakers can be identified from their voices. Therefore in this work, ―Speaker Recognition System‖, the details of speaker identification have been reviewed to use some well- known techniques of used in speaker identification. Starting from the features extraction module, I have studied various types of features extraction methods but more emphasize was on the features, namely Mel- Frequency Cepstral Coefficients (MFCC), Linear Predictive Coefficients (LPC) and its variant Linear Predictive Cepstral Coefficients (LPCC). Then some speaker modeling and classification techniques were reviewed, but for implementation purpose I have used Vector Quantization. docsity.com 1 Chapter 1 Introduction Humans and animals have outstanding ability outstandingly discriminate friends from foes. Such characteristics are vital for living things to survive. Because identifying a foe as a friend could mean goodwill abused, property or wealth stolen, valuable information lost to unwanted hands, life of oneself and family members endangered or in larger scale, the society or nation threatened. The reverse will equally cost one, extreme loss in social and personal relationship. Ability to recognize people is innate to every individual and is pervasive in our daily life. Identity verification is very important in current era of information technology. Revolution in the field of IT has brought capability for electronic transaction where face- to-face or other ways of personal contact is not essential. The lack of actual contact makes identifying the real user indispensable and difficult as well. Conventional means of identity authentication and verification using tokens such as keys or personal identification numbers (PINs) and passwords can be stolen and are not well suited for such critical purposes. Main concern is to identify the claimed identity with out going for difficult and cumbersome mechanism for authentication. Biometrics is seen as one of the best aspirants to solve this problem. Now, the increased computing power and decreased microchip size has given thrust for implementing realistic biometric authentication methods. 1.1 Biometrics Biometrics is the science of techniques for identifying humans uniquely using their intrinsic physical and behavioral traits. Modern advancements in the field of science and technology require applications, like credit card payments, border control and forensics, in which person authentication is of utmost importance and direly needed. Essentially, biometrics automates the process of authentication using unique characteristics of a person, e.g., DNA, fingerprint, voiceprints, face, voice, signature etc. As all these characteristics are unique to every individual, they cannot be copied, duplicated or stolen. docsity.com 2 Thus biometrics universally offers more secure and forthcoming means of identity authentication. 1.1.1 Classification of Biometric Traits Biometric characteristics can be divided in two main classes, as represented in Figure 0.1. a. Physiological Biometrics Physiological Biometric is related to the shape of the body. Examples of physical (or physiological or biometric) characteristics include fingerprints, eye retinas and irises, facial patterns and hand measurements b. Behavioral Biometrics Behavioral biometric is related to the behavior of a person behavioral characteristics e.g. signature, gait and typing patterns. Strictly speaking, voice should be regarded as physiological trait because every individual has unique pitch, but voice recognition is mainly based on the study of the way a person speaks, that is why it is commonly classified as behavioral. Figure 0.1: Biometric traits 1.1.2 Parameters of Biometric System Following parameters must be fulfilled by biometric features, if they are to be used in a biometric system: docsity.com 5 of this feature are sub optimal which results in a low performance of hand geometry based identification systems in terms of recognition accuracy. This problem inhibits its use in one-to-many searches using hand geometry. iv. Iris The iris texture in the eye of an individual is permanent through out life and is unique to every one. Iris patterns of identical twins are even different as they are independent of genetic makeup. It is very difficult to change this pattern surgically. All these properties make iris one of the most accurate biometric technologies with a large number of systems in operation. However because of a somewhat complicated and costly acquisition process, iris has a lower acceptability than some of the other biometric technologies. It endures poor lighting and reflection also, some imaging system will need the user to be motionless for a while v. Voice Voiceprint based person identification systems use the difference in the voice patterns of different persons for identification purposes. A large number of commercial products such as Tespar, BHS-1024 etc. make use of voiceprint of individuals for recognition. Voiceprints possess high acceptability but the uniqueness of these patterns is questionable, as they are prone to environmental and behavioral changes, which prevents such systems to achieve an accuracy required for high security applications. Moreover, voiceprints can easily be mimicked, also variation in microphones and channels mismatch are the factors which lessen its potential to obtain widespread use. vi. Other Biometric Technologies There is ongoing research on other biometric technologies including signature, retinal patterns, facial thermogram, DNA typing, hand vein , keystroke dynamics, ear shape, gait, body odour, lip shape and ear shape for authentication of person. But their achievable accuracy makes them notorious to use. 1.1.4 Comparison of Biometric Technologies Table 0.1Table 0.1 presents a comparison of the different biometric techniques in relation to the earlier mentioned requirements of a biometric identifier. Different biometric identifiers have different applications that vary enormously in nature; therefore the docsity.com 6 practical usage of a biometric technique at a particular location depends on a large number of factors with performance and cost being the highly weighted ones. Table 0.1: Comparison of different biometric technologies Biometrics Univer- sality Unique- ness Perma- nence Collect- ability Perfor- mance Accept- ability Circum- vention Face High Low Medium High Low High Low Fingerprints Medium High High Medium High Medium High Hand Geometry Medium Medium Medium High Medium Medium Medium Keystroke Dynamics Low Low Low Medium Low Medium Medium Hand Vein Medium Medium Medium Medium Medium Medium High Iris High High High Medium High Low High Retina High High Medium Low High Low High Signature Low Low Low High Low High Low Voice Medium Low Low Medium Low High Low Facial Thermogram High High Low High Medium High High DNA High High High Low High Low Low 1.2 Architecture of Biometric Systems Components of biometric system includes A biometric system includes the hardware, linked software and interconnecting communications to enable the end-to-end biometric process. Technically, a biometric system is a pattern matching system, which makes an identification or verification decision by analyzing one or more biometric characteristic of a person. The different logical modules in biometric system are acquisition, enrollment and test module. a. Acquisition Module The first block (sensor) is the interface between the real world and our system; it has to acquire all the necessary data for capturing a biometric feature from a subject by using a docsity.com 7 sensor technology suited for operation with the particular type of biometric characteristic being used. Examples include fingerprint scanners, signature tablets, cameras etc. Figure 0.2: The basic block diagram of biometric system b. Enrollment Module During the enrollment, biometric information of an individual is stored. This block performs all the necessary pre-processing: it has to remove artifacts from the sensor, to enhance the input (e.g. removing some noise), to use some kind of normalization, etc. It then extracts the features we need. This step is really important: we have to choose which features to extract and how. Moreover we have to do it with certain efficiency. After extracting the features, template of vectors of numbers is created. A template contains all the extracted features necessary for authentication with out any loss of information. During enrollment, the template is simply stored somewhere (it can be in on a card or within a database), while in identification template of the unknown person is compared against the stored ones. c. Testing Module During testing biometric information are detected and compared with the stored ones. Distance is estimated between them using any algorithm (e.g. Hamming distance). The decision that the matcher has taken is sent as output, so that it can be used for any purpose (e.g. it can allow a purchase or the entrance in a restricted area). Depending upon its operational needs, a biometric system can either be an identification system or a docsity.com 10 i. To make an overview of some of the techniques which are used today for Speaker Recognition (SR), i.e. how can they be implemented, on what principles they work, what are their advantages and disadvantages, do these techniques can be applied today or not? These were some of the questions which I have to sort out. ii. First I studied different techniques for extracting the features from the human speech signal e.g. Linear Predictive Coding (LPC), Mel-Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP) and some of their variants. iii. After the feature extraction module, for the classification purpose, I have studied vector Quantization (VQ), Gaussian Mixture Modeling (GMM) and Nearest Neighbor. iv. In the implementation phase I have used MFCC and LPC, as features extraction techniques and VQ as feature modeling method. v. Also certain experiments have been performed, e.g. addition of noise in the voice sample and pitch alteration of stores samples of the speakers, to check the robustness and performance of the speaker recognition system. 1.5 Thesis Organization The thesis is organized as follows: Chapter-2 gives the structural and functional design of a voice based person identification system, after presenting a brief overview of the history and anatomy of speech perception and production system. Chapter-3 describes the various processes involved in a speaker identification and verification system. Also, the characteristics of the features which can be selected for speaker recognition and the features which have been developed until now have been discussed in this chapter. Chapter-4 details about feature extraction and classification and presents an in depth view of the classification schemes implemented during this project. In Chapter 5, the implementation results are given with a special focus on database acquisition. Chapter 6 gives overview of the graphical user interface made for speaker recognition system and chapter 7 will conclude this project with some future recommendations. docsity.com 11 Chapter 2 Voice Based Person Recognition Several different parameters that supplement each other make speaker individuality, and only a small subset of available cues is used by human listener. Based on our own subjective impression, we tend to think that speaker recognition technology is not reliable, but the main advantage of speaker recognition is its naturalness. Speaker recognition deals with most common way of communication, i.e. speaking, and embedding speaker recognition technology into applications is not invasive from the user‘s viewpoint. Another strong advantage are cheap costs; no special equipment is needed. In order to capture a speech signal, only a microphone is needed, as contrasted to fingerprint and retinal scanners, for instance. Signal processing and pattern matching algorithms for speaker recognition are low-cost and memory-efficient, and thus applicable for mobile devices. Last but not least, performance of automatic speaker recognition is considerably high in right conditions. In order to understand the speaker recognition process and what are the difficulties lies in this technique one must have the knowledge of human speech production and perception systems as described below. 2.1 Representation of the Speech It is difficult to cope with speech recognition problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced. The pronunciation of the letters we used for writing varies, thus they are not adequate for this purpose, for instance, the letter "o" is pronounced differently in words "pot", "most" and "one". One possible way of representing speech sounds is by using phonemes. Formally, phoneme is a smallest unit of the sound system of a language and is represented between slashes. Two sounds having same phoneme are treated equally. The meaning of a word could change if one phoneme is substituted for another in a word. A finite set of docsity.com 12 phonemes exists in a language. However, when different languages are assessed together, there are differences; for instance, in English, /l/ and /r/ (as in "lot" and "rot") are two different phonemes, whereas in Japanese, they are not. Individual sounds such as the "clicks" or the velar fricatives (introduced later) found in some sub-Saharan African and Arabic languages respectively, are readily apparent to listeners fluent in languages that do not contain these phonemes. As total number of phonemes is finite, there is much overlap of the phoneme sets. Speech sounds can be distinguished based solely on the way they are produced. In this case, the units are known as the phones. Phones are produced in different ways depending on the context, thus there are many more phones than phonemes. The stress, rhythm and intonation are prosodic features of speech thus they equally contribute to the way an utterance is spoken and subsequently interpreted along with the speech organs. In sentences stress designates the most important words, while in words it specifies the major syllables - for instance, the word "object" could be inferred as either a noun or a verb, depending on whether the stress is sited on the first or second syllable. The timing aspect of utterances communicates rhythm. Some languages like English which have approximately equal time intervals between stresses are called stressed-time. Intonation, or pitch movement play an imperative role in indicating the meaning of an English sentence. In tonal languages like Mandarin and Vietnamese, the intonation also decides the meaning of individual words. 2.1.1 Speech Perception System In speech research, a lot of effort has been put into studying the way we as humans recognize and interpret speech, which makes sense since the best and most accurate speech recognition (and language identification, for that matter) system in existence today is that which most of us posses. This field of study is still to answer many crucial questions, but a lot has been achieved to date. According to the research has the two lowest formants are essential to generate the entire set of English vowels, and for good speech intelligibility the three formants lowest in frequency are necessary. More formants give more natural sounds. docsity.com 15 Tongue: flexible articulator, shaped away from the palate for vowels, placed close to or on the palate or other hard surfaces for consonant articulation. Teeth: another place of articulation used to brace the tongue for certain consonants. Lips: can be rounded or spread to affect vowel quality, and closed completely to stop the oral air flow in certain consonants (p, b, and m) [2]. The ways in which speech organs can be used in different applications have been discussed from now onwards. 2.2 Speaker Variability Every individual speaker is different. The speech he or she produces reflects the physical vocal tract size, length and width of the neck, a range of physical characteristics, age, sex, dialect, health, education, and personal style. As such, one person‘s speech patterns can be entirely different from those of another person. Even if we exclude these inter-speaker differences, the same speaker is often unable to precisely produce the same utterance. Thus, the shape of the vocal tract movement and rate of delivery may vary from utterance to utterance, even with dedicated effort to minimize the variability. For speaker-independent speech recognition, we typically use more than 500 speakers to build a combined model. Such an approach exhibits large performance fluctuations among new speakers because of possible mismatches in the training data between exiting speakers and new ones. In particular, speakers with accents have a tangible error-rate increase of 2 to 3 times. To improve the performance of a speaker- independent speech recognizer, a number of constraints can be imposed on its use. For example, we can have a user enrollment that requires the user to speak for about 30 minutes. With the speaker-dependent data and training, we may be able to capture various speaker-dependent acoustic characteristics that can significantly improve the speech recognizer‘s performance. In practice, speaker-dependent speech recognition offers not only improved accuracy but also improved speed, since decoding can be more efficient with an accurate acoustic and phonetic model. A typical speaker dependent speech recognition system can reduce the word recognition error by more than 30% as compared with a comparable speaker-independent speech recognition system. docsity.com 16 The disadvantage of speaker-dependent speech recognition is that it takes time to collect speaker-dependent data, which may be impractical for some applications such as an automatic telephone operator. Many applications have to support walk-in speakers, so speaker-independent speech recognition remains an important feature. When the amount of speaker-dependent data is limited, it is important to make use of both speaker- dependent and speaker-independent data using speaker-adaptive training techniques. Even for speaker-independent speech recognition, you can still use speaker-adaptive training based on recognition results to quickly adapt to each individual speaker during the usage. 2.3 Environment Variability The world we live in is full of sounds of varying loudness from different sources. When we interact with computers, we may have people speaking in the background. Someone may slam the door, or the air conditioning may start humming without notice. If speech recognition is embedded in mobile devices, such as PDAs (personal digital assistants) or cellular phones, the spectrum of noises varies significantly because the owner moves around. These external parameters, such as the characteristics of the environmental noise and the type and placement of the microphone, can greatly affect speech recognition system performance. In addition to the background noises, we have to deal with noises made by speakers, such as lip smacks and non-communication words. Noise may also be present from the input device itself, such as the microphone and A/D interference noises. In a similar manner to speaker-independent training, we can build a system by using a large amount of data collected from a number of environments; this is referred to as multi-style training. We can use adaptive techniques to normalize the mismatch across different environment conditions in a manner similar to speaker-adaptive training. Despite the progress being made in the field, environment variability remains as one of the most severe challenges facing today‘s state-of-the-art speech systems [3]. 2.4 Types of Speech Recognition Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the docsity.com 17 fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using. Some classes of SR are described below [4]. 2.4.1 Isolated Words Isolated word recognizers generally entail each utterance to have silent pause on BOTH sides of the sample window. They require a single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances. Isolated Utterance can be a better name for this class. 2.4.2 Connected Words Connect word systems consent to separate utterances to be 'run-together' with a minimal pause between them. 2.4.3 Continuous Speech Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers let users to speak more or less naturally, while the computer decides the content. Basically, it's computer dictation. 2.4.4 Spontaneous Speech Definition of spontaneous speech varies, at a basic level; it is taken as a speech that is natural sounding and not rehearsed. Thus spontaneous speech recognizer must have abilities to cope with the intricacies of natural speech feature 2.5 Features of Human Speech Signal The purpose of features extraction module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing which is referred as the signal-processing front end. The acoustic speech signal contains different kind of information about speaker. This includes ―high-level‖ properties such as dialect, context, speaking style, emotional docsity.com 20 few of these formant frequencies can be sampled at an appropriate rate and used for speaker recognition. These features are normally used in combination with other features. 2.8.2 Pitch Contours The variations of the fundamental frequency (pitch) during the duration of the utterance if followed would provide the contour, which can be used as a feature for speech recognition. The speech utterance is normalized and the contour is determined. The normalization of the speech utterance is required because the accurate time alignment of utterances is crucial; else the same speaker utterances could be interpreted as utterances from two different speakers. The contour is divided into a set of segments and the measured pitch values are averaged over the whole segment. The vector that contains the average values of pitch of all segments is thereafter used as a feature for speaker recognition 2.8.3 Co-Articulations Co-articulation is a phenomenon where a feature of a phonemic unit is achieved in the articulators well in advance of the time it is needed for that phonemic unit. Variation of the physical form of the speech organs causes the variation in the sounds that they produce. The process of co-articulation in which, the speech organs prepare to produce a new sound while transiting from one sound to another is characteristic of a speaker. This is due to the following reasons: the construction and shape of the vocal tract, and the motorical abilities of the speaker to produce the sequences of speech. Therefore for speaker recognition using this feature, the points in the speech signal where co- articulation takes place are spectrographically analyzed. 2.8.4 Features derived from Short Term Processing The speech signal is a slowly time-varying signal (called quasi-stationary). When examined over a sufficiently short period of time (5 ~ 100 ms), its characteristics are fairly stationary i.e., it has quite stable acoustic characteristics. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, the short-time spectral docsity.com 21 analysis is the most common way to characterize the speech signal where only a portion of the signal is used to extract signal features at one time. It works in the following way: predefined length window (usually 20-30 milliseconds) is moved along the signal with an overlapping (usually 30-50% of the window length) between the adjacent frames. Overlapping is needed to avoid losing of information. Parts of the signal formed in such a way are called frames. In order to prevent an abrupt change at the end points of the frame, it is usually multiplied by a window function. The operation of dividing signal into short intervals is called windowing and such segments are called windowed frames (or sometime just frames). There are several window functions used in speaker recognition area, but the most popular is the Hamming window. The following features of the short-term processing of the speech can be applied: short-term autocorrelation, average magnitude difference function, zero crossing measure, short–term power and energy measures, and short-term Fourier analysis. The short term processing techniques provide signals in the following form      m mnwmsTnQ )()()( (0.1) T[s (m)] is a transformation, which is applied to the speech signal and the signal is thereafter weighted by a window w (n). The summation of T[s (n)] convolved with w (n) represents certain property of the signal that is averaged over the window duration. 2.8.5 Short Time Average Energy and Magnitude The output in the (2.1) will be representing short time energy or amplitude if the transformation T is squaring or absolute magnitude operation. The energy indicates high amplitudes as the signal is squared for calculating Q(n). Such techniques enable the segmentation of speech into smaller phonetic units e.g. phonemes or syllables. There is a large variation in the amplitude between the voiced and the unvoiced segments. Also, the variation between phonemes with different manners of articulation is small. This feature permits speech segmentation based on energy Q(n) in automatic recognition systems. docsity.com 22 2.8.6 Short Time Average Zero Crossing rate A zero crossing is said to have occurred in a signal when its waveform crosses the time axis or changes its algebraic sign. For a discrete time signal with zero crossing rate (ZCR) in zero crossings/sample and a sampling frequency of Fs, the frequency Fo is given as 2 )*( FsZCR Fo  (0.2) The speech signal contains most of its energy in voiced signals at low frequencies. For unvoiced sounds, the broadband noise excitation takes place at higher frequencies due to the short length of the vocal tract. Therefore a high and a low ZCR relates to unvoiced and voiced speech respectively. 2.8.7 Short Time Autocorrelation The autocorrelation function for a discrete time signal is given as     m kmymsk )()()( (0.3) This function measures the similarity of two signals s(n) and y(n), by summing the product of a signal sample and a delayed sample of another signal. The short time autocorrelation function is obtained by windowing s(n) and applying the autocorrelation given by which results in     m m kmnwkmsmnwmskR )()()()()( (0.4) This short time auto correlation function provides information about the harmonic and formant amplitudes of s(n) and also indicates its periodicity. Thus pitch estimation and voiced/unvoiced speech detection can be carried out using this feature. 2.8.8 Harmonic Features The harmonic decomposition of the high-resolution spectral line estimate of speech signal results in the harmonic features. The line spectral pairs represent the variations in the glottis and the vocal tract of a speaker, which are transformed into frequency domain. docsity.com 25 Chapter 3 Principles of Speech Recognition Speaker recognition is among the widely used biometrics when it comes to our behavioral characteristics. It consists of the following two modules i.e. i. Speaker Verification ii. Speaker Identification. 3.1 Speaker Verification Speaker verification task is to verify the claimed identity of a person from his voice. This process involves only binary decision about the claimed identity. 3.2 Speaker Identification In speaker identification there is no identity claim and the system decides who the speaking person is. Speaker identification can be further divided into two branches. i. Open-set speaker identification ii. Closed-set speaker identification 3.2.1 Open-Set Speaker Identification It decides to whom of the non-registered speakers‘, unknown speech sample belongs or makes a conclusion that the speech sample is unknown. 3.2.2 Closed-Set Speaker Identification It is a decision making process of whom of the registered speakers is most likely the author of the unknown speech sample. Depending on the algorithm used for the identification, the task can also be divided into docsity.com 26 i. Text-dependent speaker identification ii. Text-independent speaker identification The difference is that in the first case the system knows the text spoken by the person while in the second case the system must be able to recognize the speaker from any text. Applications such as access control (physical access to a room or access to computer memory) very often require confirmation of person‘s identity. The person submits his claimed identity (in terms of a code number) and has to prove that he really who he claims to be. Such proof is made by means of biometric measurement: finger prints, facial image, retina image or speech. In such applications the speaker is cooperative and is willing to utter a predetermined code word. Speaker recognition is thus performed on an a priori known text. In military, forensic and some commercial applications the speaker is non-cooperative and will not agree to utter a desired utterance when requested. The speaker has to be identified from his speech signal when the text in a priori unknown. Speaker recognition taxonomy is represented in Figure 0.1 Figure 0.1: Speaker recognition taxonomy 3.3 Process of Speaker Identification The process of speaker identification can be divided into two main phases. docsity.com 27 i. Speaker Enrollment Phase ii. Speaker Verification Phase 3.3.1 Speaker Enrollment Phase During the first phase i.e., speaker enrollment phase, speech samples are collected from the speakers, and they are used to train their models. The collection of enrolled models is also called a speaker database. This process is represented in Figure 0.2 Figure 0.2: Speaker enrollment phase 3.3.2 Speaker Verification Phase In the second phase, speaker verification phase, a test sample from an unknown speaker is compared against the speaker database. Figure 0.3: Speaker verification phase Both phases include the same first step, feature extraction, which is used to extract speaker dependent characteristics from speech. The main reason of this step is to perform data reduction of test data while retaining speaker discriminative information. These features are modeled and stored in the speaker database in enrollment phase. In the ‗Verification Phase‘, extracted features are compared against the models stored in the docsity.com 30 Roberto A. B. Sória, and Euvaldo F. Cabral Jr. [8] is related to the fact that classical approaches for speaker recognition yields satisfactory results but at the expense of long length training and test utterances. They introduced a novel method of speaker recognition using statistical features based on the cross correlation of MFCCs called here Mel-Frequency Cepstral Coefficients Correlations (MFC3). An advantage of the MFC3 approach is that there is no need for long time utterances. Segments of only 100ms to 150ms durations are sufficient. Final results reached an identification rate of 100% for a set of ten speakers and a verification rate of 99%, using the MLP as classifier. Usually, speaker recognition systems do not take into account the dependence between the vocal source and the vocal tract. A model of joint probability functions of the pitch and the feature vectors is proposed by Hassan Ezzaidi, Jean Rouat and Douglas O‘Shaughnessy [9]. They have used two pattern recognizers: LVQ–SLP and GMM. In all cases, they observe an increase of the identification rates and more specifically when using time duration of 500ms (6% higher). Most methods for speaker identification are based on parameter estimation so Bo Li, WenJu Liu and QiuHai Zhong [10] has tried to put forward a non-parameter method for speaker identification which is based on Fisher differentiation vector. The experiment shows that it is an effective method for text- dependent speaker identification. There is also a big need for the reduction of data vectors coming from the features extraction module. For this purpose several works has been done again in this field using Principle and Independent Component Analysis techniques. Arkadiusz Nagórski , Lou Boves [11] has presented a method to select a limited set of maximally information rich speech data from a database for optimal training and diagnostic testing of ASR systems. The method uses Principal Component Analysis (PCA) to map the variance of the speech material in a database into a low-dimensional space, followed by clustering and a selection technique. Another method that improves the accuracy of text dependent speaker identification systems by R. M. Nickel et. al. [12] exploits a set of novel speech features that is derived from a Principal Component Analysis (PCA) of voiced speech segments. The new PC features are only weakly correlated with the corresponding cepstral features. A distance measure that combines both, cepstral and PC pitch features provides a discriminative power that cannot be docsity.com 31 achieved with cepstral features alone. It is well known that the discriminative power of cepstral features declines if the dimensionality of the feature space is increased beyond its optimal value. By augmenting the feature space of a cepstral baseline system with PC pitch features they reduced the equal error probability of incorrect customer rejection versus incorrect impostor acceptance by 12.5% beyond the discriminative limit of the cepstral analysis. Peilv Ding et. al. [13] has proposed a new feature vector—Mel Frequency Principal Coefficient (MFPC), applied to speaker recognition. It is derived by performing Principal Component Analysis on the Mel Scale Spectrum Vector. Compared with conventional MFCC, MFPC efficiently exploited the correlation information among different frequency channels. These correlations, which are mainly caused by the vocal tract resonance, have been found to vary consistently from one speaker to another. They select these feature coefficients according to their Fisher Ratio, which will guarantee the largest discriminability between classes in the given dimensionality. The experiment results demonstrate that their proposed feature vector has characteristics of compactness, large discriminability and low redundancy. Another speaker recognition method presented by Justinian Rosca And Andri Kofmehl [14] is based on short-time spectra, however the feature extraction process does not correspond to the MFCC process. The motivation was to avoid what we see as shortcomings of present approaches, particularly the blurring effect in the frequency domain, which confuses rather than helps in distinguishing speakers. They introduced a speech synthesis model that can be identified using Independent Component Analysis (ICA). The ICA representations of log spectral data result in cepstral-like, independent coefficients, which capture correlations among frequency bands specific to the given speaker. It also results in speaker specific basis functions. The resulting speaker recognition method is text-independent, invariant over time, and robust to channel variability. Now after the features extraction module there is now need to work on speaker modeling. For that purpose the techniques based on conventional methods and artificial neural networks are briefly overviewed here. Petre G. Pop, and Eugen Lupu [15] has presented an approach to speaker verification task using single section Vector Quantization. As parameters they use LPC derived Cepstrum and MFCC. The results obtained in their experiments showed that the VQ method can be docsity.com 32 used for text dependent speaker verification. A. D. Constantinou et. al. [16] has also proposed a new class of VQ codebook design algorithms i.e. they introduced the notion of an adjacency map (AM), which provides a heuristic template for improved codebook design, by reducing the search space required for exhaustive optimization, while providing solutions close to the globally optimum, independent of the initial codewords or a target codebook size. Weighted distance measure and discriminative training are two different approaches to enhance VQ-based solutions for speaker identification. To account for varying importance of the LPC coefficients in SV, the so-called Partition Normalized Distance Measure (PNDM) successfully used normalized feature components. Ningping Fan and Justinian Rosca [17] has introduced an alternative, called heuristic weighted distance, to lift up higher order MFCC feature vector components using a linear formula. Experiments using the TIMIT corpus suggest that the new combined approach is superior to current VQ-based solutions (50% error reduction). It also outperforms the Gaussian Mixture Model using the Wavelet features tested in a similar setting. Methods of combining multiple classifiers with different features are viewed as a general problem in various application areas of pattern recognition. Ke Chen1 et. al. [18] has made a systematic investigation about that and classified their possible solutions into three frameworks, i.e. linear opinion pools, winner-take-all and evidential reasoning. The simulations show that results are better than not only the individual classifiers‘ but also ones obtained by combining multiple classifiers with the same feature. Mehmet Tunçkanat et. al. [19] has presented an approach based on neural networks for SR. The experimental results using Text-dependent and Text-independent recognition cases have been achieved with %94 and %88 accuracies, respectively. Todor Ganchev et. al. [20] has studied the applicability of Probabilistic Neural Networks (PNNs) as core classifiers to medium scale speaker recognition over fixed telephone networks. They has presented two PNN-based open-set text-independent systems for Speaker Identification and Speaker Verification correspondingly. The application of recurrent neural nets in a open-set text-dependent speaker identification task is addressed by Shahla Parveen And Philgreen [21]. Their motivation for applying recurrent neural nets to this domain was to find out their ability to take short-term spectral features but yet respond to long-term temporal events is advantageous for speaker identification. For 12 docsity.com 35 2 1 ( [ ] [ ] [ ]) p n k s n a k s n k      (0.4) When the prediction residual e[n] is small, predictor (4.3) approximates s[n] well. Total squared prediction error gives the minimum value, and partial derivatives of E with respect to the model parameters {a[k]} are set to zero: 0, 1,..., . [ ] E k p a k     (0.5) The problem of finding the optimal predictor coefficients results in solving of so- called (Yule-Walker) Auto Regression (AR) equations [22]. by resolving (4.5) for k=1,…,p. Depending on the choice of the error minimization interval in (4.6), there are two methods for solving the AR equations: covariance method and autocorrelation method [23]. According to [22]. , for unvoiced speech, the two methods do not have large difference, but for voiced speech, the covariance method can be more accurate. However, according to [23]. the autocorrelation method is the preferred method since it is computationally more efficient and guarantees always a stable filter. The AR equations for the autocorrelation method are of the following form: Ra r (0.6) Where R is a special type of matrix called Toeplitz matrix, a is the vector of the LPC coefficients and r is the autocorrelation. Both the matrix R and vector r are entirely defined by p autocorrelation samples. The autocorrelation sequence of s[n] is defined as [24]. 1 0 [ ] [ ] [ ] N k n R k s n s n k      (0.7) There exists an efficient algorithm for finding the solution the redundancy in the AR equations, known as Levinson-Durbin recursion [23]. The Levinson-Durbin procedure takes the autocorrelation sequence as its input, and produces the coefficients a[k]; k = 1,…,p. The time complexity of the procedure is 2( )O p as opposed to standard Gaussian elimination method [25]. , whose complexity is 3( )O p . The steps in computing the predictor coefficient using the autocorrelation method are summarized in Figure 0.1. The Levinson-Durbin procedure produces predictors of order 1,2,..,p-1 as its side- product. Another side-product of the procedure are intermediate variables called docsity.com 36 reflection coefficients k[i], i=1,...,p which are bounded by | [ ] | 1k i  . These are interpreted as the reflection coefficients between the tubes in the lossless tube model of the vocal tract [24]. Figure 0.1: LPC coefficient computation using the autocorrelation method. The Levinson-Durbin procedure produces predictors of order 1,2,..,p-1 as its side- product. Another side-product of the procedure are intermediate variables called reflection coefficients k[i], i=1,...,p which are bounded by | [ ] | 1k i  . These are interpreted as the reflection coefficients between the tubes in the lossless tube model of the vocal tract [24]. . Makhoul [25]. has shown that if the original spectrum has a wide dynamic range, the LP model becomes numerically instable. This justifies the use of pre-emphasis filter prior to LP analysis: the spectrum of the signal is whitened and the dynamic range is reduced. An adaptive formula for the pre-emphasis can be used with LPC analysis [25]. : [1] , [0] R R   (0.8) where R[i] is the autocorrelation sequence as defined in (4.7). The criterion (4.8) represents a simple voicing degree detector [24]. , which emphasizes more the voiced segments. An example is shown in Figure 0.1 Any signal can be approximated with the LP model with an arbitrary small prediction error [25]. The optimal model order depends on what kind of information one wants to extract from the spectrum. More insight into this can be seen by considering the frequency-domain interpretation of the LP. Makhoul [25]. has proved that the minimization in (4.6) equivalent to minimizing the square error between the signal magnitude spectrum and the model magnitude response. Thus it can be said that, the LP docsity.com 37 Figure 0.2: Example of voicing degree detector [25]. model transfer function is a least square approximation of the original magnitude spectrum. 4.1.2 Mel-Frequency Cepstrum Coefficients (MFCC) The speech signal is called quasi-stationary as it gradually varies with time. An example of speech signal is shown in Figure 0.3. Its features are quite stationary, when examined over a adequately short period of time (between 5 and 100 msec). However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal [26]. There are various techniques and methods for parametrically representing the speech signal for the identification and verification tasks, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFFC‘s are based on the known variation of the human ear‘s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have docsity.com 40 frequency; indeed it is approximately linear below 1 kHz and logarithmic above [31]. This approach is based on the psychophysical studies of human perception of the frequency content of sounds [31]. Mel cepstrum can be calculated using filter bank, which has one filter for each and every desired component of mel-frequency. Every filter in this bank has triangular band pass frequency response. Around the center frequency with increasing bandwidths, average spectrum is computed by each filter, as displayed in Figure 4.5 [26]. This filter bank is applied in frequency domain and therefore, it simply amounts to taking these triangular filters on the spectrum. In practice the last step of taking inverse DFT is replaced by taking discrete cosine transform (DCT) for computational efficiency. Figure 0.5: Triangular filters used to compute mel-cepstrum [26]. The number of resulting mel-frequency cepstrum coefficients is taken comparatively low, in the order of 12 to 20 coefficients. The zeroth coefficient is usually dropped out because it represents the average log energy of the frame and carries only a little speaker specific information [29]. A block diagram of the structure of an MFCC processor is given in Figure 4.4 [26]. The speech input is typically recorded at a sampling docsity.com 41 rate above 10000 Hz. This sampling frequency was chosen to reduce the effects of aliasing in the analog-to-digital conversion. These sampled signals can acquire all the frequencies with no aliasing in analog-to-digital conversion up to 5 kHz, which cover most energy of sounds that are generated by humans. The main use of the MFCC processor is to imitate the function of the human ears. Next step is frame blocking, in which the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100 [26]. Next frames are required to be windowed to minimize the signal discontinuities at the start and end of them. Signals are tapered to zero to minimize the spectral distortion. If we take the window as ( ),0 1w n n N   , where N is the number of samples in each frame, then the result of windowing is the signal 1 1( ) ( ) ( ),0 1y n x n w n n N    . Typically the Hamming window is used, which has the form: [26]. 2 ( ) 0.54 0.46cos ,0 1 1 n w n n N N           (0.9) After windowing, each frame is converted from the time domain into the frequency domain using fast fourier transform. The FFT is a fast algorithm to apply the Discrete Fourier Transform (DFT) which is defined on the set of N samples {Xn}, as follow: 1 2 / 0 , 0,1,2,..., 1 N jkn N n k k X x e n N       (0.10) Note that we use j here to denote the imaginary unit, i.e. 1j   . In general Xn‘s are complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero frequency corresponds to n = 0, positive frequencies 0 / 2sf F  yield to values1 / 2 1n N   , while negative frequencies / 2 0sF f   correspond to docsity.com 42 / 2 1 1N n N    .Here, Fs denote the sampling frequency. The result obtained after this step is often referred to as signal‘s spectrum or period gram [26]. As mentioned above, psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the ‗mel‘ scale. Human perception of frequency of the sound for speech signals does not follow a linear scale. The mel-frequency scale is linear for frequency less than 1000 Hz and has logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 mels. Therefore we can use the following approximate formula to compute the mels for a given frequency f in Hz: 10( ) 2595*log (1 / 700)mel f f  (0.11) In this final step, the log mel spectrum is converted back to time, and is called mel frequency cepstrum coefficients (MFCC). This gives good representation of the spectral properties of the signal for the given frame analysis. As the mel spectrum coefficients (and so their logarithm) are real numbers, they can be converted back to time domain using the Discrete Cosine Transform (DCT). The mel power spectrum coefficients that are the result of the last step are , 1,2,...,kS k K [26]. 1 1 (log )cos[ ], 1,2,..., 2 K n k k c S n k n K K            (0.12) 4.1.3 Comparison of MFCC and LPC MFCC and LPC described above are well known techniques used in speaker identification to describe signal characteristics, relative to the speaker discriminative vocal tract properties. They are quite similar as well as different. MFCC is based on the filtering of spectrum using properties of human speech perception mechanism. On the other hand LPCC is based on the speech production system only i.e. based on the autocorrelation of the speech frame. There is no general agreement in the literature about what method is better. However, it is generally considered that LPC are computationally less expensive while MFCC provide more precise result. The reason of such opinion is based on that all-pole docsity.com 45 However, mathematically VQ find a partitioning of the feature vector space into the predefined number of regions, when extracted set of feature vector is given at its input. These regions do not overlap and are added together from the whole feature vector space. Every vector inside such region is embodied by the corresponding centroid. The process of VQ for two speakers is represented in Figure 0.6. There are two important design issues in VQ: i. Method for generating the codebook, and ii. Codebook size Figure 0.6: Vector quantization of two speakers[33] Codebook size is a trade-off between running time and identification accuracy. With large size, high classification accuracy can be achieved but at the cost of running time and vice versa. Experimental results show that saturation point choice is 64 vectors in codebook. Advantages of vector quantization in speaker recognition are as follows i. Reduced storage space for spectral analysis information docsity.com 46 ii. In determining similarity of spectral analysis vectors, VQ needs reduced computations. In speech recognition, a major component of the computation is the determination of spectral similarity between a pair of vectors. Based on the VQ representation this is often reduced to a table lookup of similarities between pairs of codebook vectors. iii. Discrete representation of speech sounds The disadvantage of this method is in the fact that we have to get a very good initial estimate of the codebook vectors. It may so happen that the random initial selection is clustered in one area of the vector space. If this happens then the final codebook will not be global. This can be a serious problem. 4.3.2 Nearest Neighbors Nearest neighbors mingles the strong suit of the dynamic time warping and vector quantization methods. This technique does not cluster the data to obtain code book to, acquired from the training phase and make use of the temporal information. The distance between the input and stored frames is calculated and stored in a matrix. The nearest neighbor distance is the minimum stored value of the distance in matrix. The nearest neighbor distances for all input frames are averaged to arrive upon the matched score. These matched scores are then combined to shape an estimate of the likelihood ratio. This method is memory intensive and one of the most powerful methods. docsity.com 47 Chapter 5 Experimental Results and Analysis As speech interaction with computers becomes more persistent in activities such as financial services and information retrieval from speech databases, the utility of automatically recognizing a speaker based entirely on vocal characteristics increases. Given a speech sample, speaker recognition is concerned with extracting clues to the identity of the person who was the source of that utterance. There have been numerous approaches aimed at understanding the underlying process involved in the perception and production of speech. These approaches involve disciplines as diverse as pattern classification and signal processing to physiology and linguistics. The interdisciplinary nature of the problem is one thing that makes speech recognition such a complex and fascinating problem. This chapter gives the detailed description of the database which has been made for speaker recognition system for evaluating the various algorithms. It also contains the experiments performed on database to check the performance of classifiers in real time application. 5.1 Database used for Experiments Over the last two decades there has been an increasing interest in speaker recognition. In order to get adequate amounts of speech to train and test the speaker recognition system, speech databases are needed. There are several applications of speaker recognition, leading to a diversity of the structure and content of speaker recognition databases. The most obvious benefit of using standard and readily available (public) databases is that, it enables quantitative evaluation of methods and speaker recognition protocols. 5.1.1 Literature Survey of Existing Databases According to the survey organization of speaker recognition databases may be based on features such as i. The recording protocol ii. The population of participating subjects docsity.com 50 of ages may not respond to a real age distribution of users in a specific commercial application. On the other hand, in forensic applications criminals are also unequally distributed in age. In speech corpus, voice samples have been recorded from speakers belonging to different age groups ranging from 20 years to 55 years. 5.1.3 Decision Making and Performance The absolute performance of the different feature sets is measured by classification experiments. We use vector quantization based classification, from each speaker‘s training set, a fixed-sized codebook is generated. The speaker models are trained independently of each other, and therefore the recognition rates might not be as high as they would be if a discriminative training algorithm was used instead .It is believed that a certain modeling technique might be better for a certain feature set but not good for some other. However, since the VQ is a non-parametric modeling approach, minimal assumptions about the underlying feature distribution has been made. It is believed that the results generalize to other modeling techniques such as GMM modeling. The evaluation sets are matched against the speaker models in the database. As the matching function, the average quantization distortion with the Euclidean distance metric is used unless otherwise mentioned. A closed speaker database is being assumed, and therefore the speaker whose codebook yields the smallest distortion for the test sequence is selected as the identification decision. The type of the recognition task (closed-set identification, open-set identification, verification) is not considered here, since it only affects the type of the decision. In other words, if a certain feature set gives good performance in the closed-set identification task, it is expected to generalize to the other two tasks also. The performance of the classification is measured by the identification error rate: ( ) 100%e N Error N   where Ne is the number of incorrectly classified test sequences, and N is the total number of sequences. If the application is closed set speaker identification then the decision making process is similar to that in speech recognition: we select the candidate model with the smallest distance or largest probability measure. In speaker verification applications docsity.com 51 however, the decision as to whether to accept the claim of identity is more complicated since this is an open set problem. 5.2 Experimental Results The implementation for a speaker recognizer is done in two phases, i.e. training phase and the testing phase, irrespective of the technique used in feature extraction and feature matching module. 5.3 SR System Using LPC and VQ There were two phases of implementation for a speaker recognizer using LPC and VQ as features extraction and matching respectively i.e. 1. Training Phase 2. Testing Phase. 5.3.1 Training Phase In the training phase LPC coefficients using the lpcauto.m [36]. function of MATLAB have been computed. The real of these features gives the linear predictive coefficients (LPC). In speaker recognition after calculating the coefficients, we need a model of the acoustic properties of the speaker's voice. In this system vector quantization codebook has been used as speaker. 5.3.2 Testing Phase In the second phase of testing, the same module was used to make a model for the tested speaker so as to compare it with the stored speaker model. A codebook was built for each speaker and the distance to an unknown sample was estimated from the VQ distortion for each codebook: that is, the input was encoded with each codebook and the difference between the original speech and the codebook cluster centers was measured. This procedure was used both for the text-dependent and text-independent speaker recognition. In the text-dependent case same text files were used for the training purpose and the testing purpose whereas in the other case, speaker models were compared against those stored models which were not used in the training phase. docsity.com 52 5.3.3 Results The results are given in the following tables for 50 speakers which were the enrolled subjects. Each speaker has recorded a single session, in which each person has recorded both text ‗i.‘ and ‗ii.‘ six times, in which three files have been used for training purpose and three files have been used for testing purpose. In the text dependent case I have tested the accuracy for text ‗i.‘ and text ‗ii‘, the results for both text dependent and text independent modules are given in Table 0.2 and Table 0.3 respectively. In both cases three files are used for training and three for training, results are obtained for eight combinations of different numbers of files and then their mean and standard deviation values are calculated Table 0.2: Text dependent SR results for LPC and VQ Features (LPC) No. of Speakers Tested No. of Training files No. of Testing files Mean Error Percentage Std +/- For text ‗i‘ 50 3 3 26 3.37 For text ‗ii‘ 50 3 3 6.33 1.78 Similarly, in the text independent case I have selected three files from text ‗i‘ and three files for testing from text ‗ii‘. Table 0.3 shows the results of text independent modules of speaker recognition system using LPC. Table 0.3: Text independent SR results for LPC Features No. of Speakers Tested No. of Training Files No. of Testing Files Mean Error Percentage Std +/- LPC 50 3 3 64 46 5.3.4 Comments The results show a clear difference between LPC‘s text dependent and text independent modules performances. There is much performance degradation when we move from text dependent situation to the text independent case. Although the text dependent recognition gives good and acceptable results but the text independent recognition results using LPC are too much poor (less than 50 %). docsity.com 55 5.4.3 Speaker Modeling Using Vector Quantization These reduced order feature vectors were then given to a vector quantization classifier which made the codebooks for each speaker using the vqlbg.m function in the signal processing and Voice toolbox of MATLAB [36]. This function needs the number of codebooks to make and MFCC features. Size of the codebooks could be 32, 64, 128, 256 depending upon the signal variations. The most suitable codebook size for my data was 16. The codebooks for each speaker were stored in a .mat file to compare them with the tested codebooks. The size of the codebook is definitely a variable parameter and an intrinsic element of the system bearing special significance on performance. This characteristic determines the fluctuation of the quality of the codebook that can be obtained. The larger the number of code vectors in the codebook, the better the quality of codebook that can be obtained. The quality of a codebook is measured in the amount of quantization error obtained from the codebook. Increasing the size of the codebook also reduces the possible fluctuation from obtaining different codebooks. This means that for a fixed codebook size, if different codebooks are obtained, the difference in quantization error amongst the computed codebooks is decreased. In other words, one may observe that the quantization errors of codebooks obtained for a particular speaker tend to converge when increasing the codebook size. Therefore, not only is it possible to obtain a codebook with lower quantization error by increasing the allowable size, but it also facilitates the selection of a codebook to reference a given speaker. Of course a limit does exist on both the lowest quantization error that a codebook can produce and how much the quantization errors of codebooks of a specific size will converge. Once again, compromise is the key when selecting the codebook size. It should be selected in such a way as to reduce the variability of the codebooks to an acceptable degree. Not only would this ensure that an approximately equal quality codebook would be obtained if the algorithm is run again, but it would be implied that a good quality codebook is obtained. docsity.com 56 5.4.4 Testing Phase In the testing phase, the same procedure was followed as in the training phase. First the signal was preprocessed and then I find the feature vectors using mfcc.m. After that I made their codebooks using the same function of vqlbg. Then these codebooks were compared against the stored codebooks for each speaker using Euclidean distance as the similarity measure. Again a built-in function of MATLAB disteusq.m [33] was used for that purpose. The speaker with the minimum distance from the tested speaker codebook was declared as the registered person. This procedure was used both for the text-dependent and text-independent speaker recognition. In the text-dependent case same text files were used for the training purpose and the testing purpose. Whereas in the other case speaker models were compared against those stored models which were not used in the training phase. 5.4.5 Results The results are given in the table for 50 speakers which were the registered personalities. Each speaker had recorded six files in single session, from those six files three files were used training purpose three files have been used for testing purpose. In the text dependent case, experiments have been performed on both text ‗i.‘ and text ‗ii.‘. For text ‗i.‘ three files of each speaker have been trained and three files have been tested. Then I repeated the same procedure for text ‗ii.‘ In both cases results are obtained for eight combinations of different numbers of files and then their mean and Table 0.5: MFCC results for text dependent SR Features (MFCC) No. of Speakers Tested No. of Training files No. of Testing files Mean Error Percentage Std +/- For text ‗i‘ 50 3 3 11.49 3.98 For text ‗ii‘ 50 3 3 2.99 0.793 standard deviation values are calculated. The results are shown in Table 0.5. Same procedure is for text independent module. docsity.com 57 Table 0.6: MFCC results for text independent SR Features No. of Speakers Tested No. of Training Files No. of Testing Files Mean Error Percentage Std +/- MFCC 50 3 3 61.05 3.389 5.4.6 Comments Table 0.5 and Table 0.6 show the difference between both the cases. The results of text dependent speaker recognizer are much better than the text independent one and it‘s clearly because of the same text selected for both training and testing in the first case. In text dependent, the classification accuracy is 97% and only one speaker (speaker#7) is not correctly classified. The reason for the mismatch is, Speaker #7 has taken time longer than expected and there is much distortion in the voice samples along with the silent periods. 0 2 4 6 8 10 12 14 16 18 x 10 4 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 Figure 0.2: Waveform of a sentence utterd by speaker # 7 Similarly in the second case of text independent recognition is 61%. It because of the time difference of recorded text ‗i‘ and ‗ii‘, also there is more silent pauses in text ‗i‘ than the text itself. That‘s why coefficients of one speaker are showing with the docsity.com 60 It can be seen classification accuracy is very low, for the reason stated in paragraph below Table 0.8, thus it indicates pitch is an important feature of human voice. 5.6 Comparative Analysis Comparison errors rates achieved against certain corpus along with the percentage error of speaker recognition system, implemented for this project is shown in Table 5.9. Table 0.9: Comparisons of the error rates for speaker recognition Source Feature Method Input Text Pop Error Atal, 1974 Cepstrum Pattern Match Lab Dependent 10 i:2%@0.5s v:2%@1s Markel and Devis, 1979 LP Long term statistics Lab Independent 17 i:2%@39s Schwartz, 1982 LAR Non- parametric pdf Telephone Independent 21 i:2.5%@2s Li and Wrench, 1983 LP. Cepstrum Pattern Match Lab Independent 11 i:21%@3s v:4%@10s Doddington, 1985 Filter Bank DTW Lab Dependent 200 v:0.8%@6s Soong,et al, 1986 LP VQ Telephone 10 isolated digits 100 i:5%@1.5s i:1.5%@3.5s Higgins and Wohlford, 1988 Cepstrum DTW likelihood scoring Lab Independent 11 v:10%@2.5s i:4.5%@10s Attili, 1991 Cepstrum, LP, Autocorr Projected long term statistics Lab Dependent 90 v:1%@3s Higgins,et al, 1991 LAR, LP - Cepstrum, DTW likelihood scoring Office Dependent 186 v:1.7%@10s Tishby, 1995 LP HMM (AR mix) Telephone 10 isolated digits 100 v:2.8%@1.5s i:0.8%@3.5s Reynolds, 1995 Mel- Cepstrum HMM (GMM) Office Dependent 138 v:0.8%@10s i:0.12%@10s Che and Lin, 1991 Cepstrum HMM Office Dependent 138 i:0.56%@2.5s i:0.14%@10s v:0.62%@2.5s Colombi, et al, 1995 Cepstrum HMM microphone Office Dependent 138 i:0.22%@10s i:0.28%@10s Reynolds, 1995 Mel- Cepstrum HMM (GMM) Telephone Independent 416 v:6%/8%@10s v:3%/5%@30s SDSRS, 2007 LPC MFCC VQ Microphone Dependent 50 i:6%@ 5s i:3%@5s docsity.com 61 Chapter 6 Graphical User Interface It is the user interface that facilitates the communication of a human with a computer (software program or product). All forms of computer users interfaces, ranging from the classical text-only interfaces to graphical user interfaces, take input from the user, perform the requested tasks, and produce output to the user. Numerous techniques have been applied to enhance the effectiveness and efficiency of the user interfaces. This chapter presents the approach to the design graphical user interfaces and its working for speaker recognition system‘s software. The graphical user interface is designed to conduct the experiments includes a typical window screens. Figure 0.1 shows the first window of graphical user interface for speaker recognition system. It has four menus i.e. enrollment, identification, tools and help. Figure 0.1: First window displayed in graphical user interface 6.1 Enrollment Phase Clicking enrollment menu, a new window will be opened in which speakers can be enrolled. Figure 0.2 shows enrollment phase of GUI, speakers from the preset group is docsity.com 62 enrolled in the system by taking their voice samples. Name of the speaker will be entered in the text boxes along with the ID of the person. After entering the required information, voice sample of the speaker, voice sample is required to complete the enrollment of the speaker. On pressing the high-lighted button, a browse window will be opened, in which voice samples can be browsed and saved for enrollment purpose. After pressing this button, speaker will be enrolled in the system. Figure 0.2: Enrollment phase for graphical user interface Figure 0.3: Enrollment completion message Also the information of the enrolled subjects can be edited. It requires speaker‘s ID as shown in Figure 0.4. Figure 0.4: Editing speaker's information docsity.com 65 Chapter 7 Conclusion and Future Work This thesis attempted to achieve two goals. The first goal was to develop an understanding of different speaker recognition techniques which have been developed until now on, and the second goal was to build a computationally efficient system. For the understanding of speaker recognition process I have to make a detailed research on the two modules of a speaker recognizer i.e. features extraction techniques and feature modeling methods for the classification purpose. In the feature extraction module there are a number of techniques which have been developed from the last two decades. These are Mel-Frequency Cepstrum Coefficients (MFCC), Mel-Frequency Principle Components (MFPC), Mel-Frequency Independent Components (MFIC), Linear Predictive Coding (LPC), Linear Predictive Cepstrum Coefficients (LPCC), Perceptual Linear Prediction (PLP), and many more. In the features or speaker modeling module again there are a lot of techniques e.g. Vector Quantization (VQ), Gaussian Mixture Modeling (GMM), Hidden Markov Models (HMM), Artificial Neural Network based classifiers e.g. Back Propagation Neural Network (BPNN) which have been used for speaker identification applications. While attempting to achieve the second goal the main objective was to first search for the best feature extracting method so that it should best represent the human speech signal. After finding the best features from the speech signal I have to search again for the best classifier which will classify these features depending on the similarity of the feature vectors. During the implementation, I found that the best features extraction technique which comes out was MFCC while comparing it with others i.e. LPC by seeing the results of Table 0.5. Then for the classification purpose the best feature classification technique which comes out was vector quantization. 7.1 Future Directions Although many recent advances and successes in speaker recognition have been docsity.com 66 achieved, there are still many problems for which good solutions remain to be found. Most of these problems arise from variability, including speaker-generated variability and variability in channel and recording conditions. It is very important to investigate feature parameters that are stable over time, insensitive to the variation of speaking manner, including the speaking rate and level and robust against variations in voice quality due to causes such as voice disguise or colds. It is also important to develop a method to cope with the problem of distortion due to recording instruments, and background and channel noises. From the human-interface point of view, it is important to consider how the users should be prompted, and how recognition errors should be handled. Studies on ways to automatically extract the speech periods of each person separately from a dialogue involving more than two people have recently appeared as an extension of speaker recognition technology. docsity.com 67 References [1]. R. Chellappa et al., "Human and Machine Recognition of Faces: A Survey", Technical Report CAR-TR-731, University of Maryland Computer Vision Laboratory, 1994. [2]. Uros Rapajic, ―An Introduction to Multi-Lingual Speech Recognition‖, URL: http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/ur1/article1.html. [3]. X. Hueng et. al., ―Acoustic Modeling‖, Spoken Language Processing, Chapter 9, URL:http://www.iis.sinica.edu.tw/~whm/course/Speech-NTUT-2004S/slides/ Acoustic Modeling .pdf. [4]. Alpay Koc, ―Acoustic Feature Analysis For Robust Speech Recognition‖, M.Sc. Thesis, Institute for Graduate Studies in Science and Engineering, 2002. [5]. Todor Ganchev, Anastasios Tsopanoglou, Nikos Fakotakis and George Kokkinakis, "Probabilistic Neural Networks Combined With Gmms For Speaker Recognition Over Telephone Channels", 14th International Conference On Digital Signal Processing (Dsp2002), Volume II, pp.1081-1084 July 1-3, 2002, Santorini, Greece. [6]. Evgeny Karpov, ―Real-Time Speaker Identification‖, Jan 15, 2003, University of Joensuu Department of Computer Science, M. Sc. Thesis. [7]. Joseph Picone, ―Signal modeling techniques in speech recognition‖, Texas Instruments. Inc., Proceedings of the IEEE, p 1215-1247,Vol 81, Sep 1993. [8]. Roberto A. B. Sória, and Euvaldo F. Cabral Jr,, ―Combining Neural Networks Paradigms and Mel-Frequency Cepstral Coefficients Correlations in a Speaker Recognition Task‖, Proceedings of the Seventh International Conference on Signal Processing Applications and Technology}, VOL. 2, pp. 1725--1729, oct 1996. [9]. Hassan Ezzaidi, Jean Rouat and Douglas O‘Shaughnessy, "Combining pitch and MFCC for speaker recognition systems", A Speaker Odyssey - The Speaker Recognition Workshop, June 18-22, 2001, Crete, Greece. docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved