Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Spontaneous Speech Database for Romanian Language, Study notes of Voice

The spontaneous speech recognition for Romanian language is an opened domain, not yet explored as other fields of the speech recognition area.

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

rexana
rexana 🇬🇧

4.7

(11)

217 documents

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Spontaneous Speech Database for Romanian Language and more Study notes Voice in PDF only on Docsity! Advances in Intelligent Systems and Technologies Proceedings ECIT2008 – 5th European Conference on Intelligent Systems and Technologies Iasi, Romania, July 10-12, 2008 Spontaneous Speech Database for Romanian Language Vladimir Popescu, Cristina Petrea, Diana Haneş, Andi Buzo, Corneliu Burileanu Faculty of Electronics and Telecommunications “Politehnica” University of Bucharest cburileanu@mesnet.pub.ro Abstract. The spontaneous speech recognition for Romanian language is an opened domain, not yet explored as other fields of the speech recognition area. The goal is to achieve performance in order to obtain a stat-of-the-art with worthy results. This paper presents the research work in the field of spontaneous speech recognition. This work begins with the new Romanian corpus, built from scratch with words and triphones which led to important statistical results regarding the linguistic structures in the Romanian language. Conventions e used in order to reach high standards of performance with aplicability to further research. This paper sumarizes the passed phases, the statistical results and the achievements obtained in the corpus building phase, on the challenging way to a new spontaneous speech recognition tool for Romanian language. Keywords: Speech Recognition, Spontaneous Speech, Romanian Word Corpus, Romanian Triphones Corpus, Romanian Speech Database V.Popescu, C.Petrea, D.Haneş, A.Buzo, C.Burileanu 2 2 1. Introduction Fields of speech recognition like isolated-word speech recognition or continuous speech recognition are knowledge territories that have been crossed over and over again with considerable results. For the international languages (English, French) there are complete human-computer dialogue systems. In other languages such as Romanian, considered “under-resourced, from a software point of view” according to recent studies, [8], [9], the development of spoken dialogue systems is a long-term process. However, reusing language independent components in the architectures designed in the developed countries, with whom Romanian researchers are cooperating (e.g. France) and creating, in Romanian language, only the language-specific components, human- computer dialogue applications in Romanian language become feasible in the mid-term. The issue of Romanian spontaneous speech recognition in dialogue is of high importance, as spontaneous speech recognition it’s a hardly crossed territory with poorer results. That’s because there are major differences between speech read from a written script and real time mind made and freely expressed spontaneously speech. In spontaneous speech the speaker doesn’t preserve rules; he talks free, not necessarily grammatically correct. He can pronounce words in slang or in short forms, he can come up with unexpected interjections, he can make pauses or he can speak too fast, he may stammer or he may deeply breathe, he may hesitate, he may be incoherent. He may or may not be aware that his words are not always grammatically correct and his language contains disfluencies. The speaker’s mood has a big influence on his spontaneously spoken behavior as he may yawl, he may whisper, he may get confused and begin to stammer, he may laugh or cry and all his emotions will get reflected into the way he is expressing. The issues that appear with the spontaneous speech are: false starts, filled pauses, incoherence of the speaker with possible ungrammatical constructions and other similar behaviors. These mentioned problems make the spontaneous speech more difficult to deal with, compared to the read speech, when it comes to recognition. The state-of-the-art in this field of speech recognition reveals the poor accuracy for the freely spoken spontaneous speech. As the spontaneous speech makes use of the acoustic and linguistic models that have been constructed especially for speech read from scripts, the results are far away from the desired ones. The corpora used for the freely spoken speech has to be wide in order to permit a knowledge improvement on the structure of spontaneous speech, as for the moment it is limited. Speech recognition systems, based on statistical approaches, are available in the research community (SPHINX system - Carnegie Mellon University, HTK toolkit - Cambridge University, RAPHAEL system - Laboratoire d’Informatique de Grenoble, etc.), as well as in the commercial domain. Spontaneous Speech Database for Romanian Language 5 5 to represent the corresponding triphone. This step will be realized using HRest and HERest HTK tools [5]. The testing phase The testing phase will use all the output obtained in the training phase. Fig. 2. The system architecture considered for the testing phase V.Popescu, C.Petrea, D.Haneş, A.Buzo, C.Burileanu 6 6 Testing sentences will be used as input and the voice signal will be chosen from the already existent data base. The wave files used as input were first parameterized and the MFCCs (Mel Frequency Cepstrum Coefficients) parameters were extracted using the HCopy HTK tool [5]. The MFCC acoustic parameters will be used as input for the next step – the triphone decoding phase. At this step the acoustic parameters will be decoded using the HMMs trained for each of the existent triphones and a sequence of triphones will be obtained. At the next step this sequence of triphones will get converted into a word sequence. This is done by also considering a grammar with a finite number of states. Having the reference text, the results can be evaluated. Reference text will be compared to the text obtained by the system and spontaneous speech recognition tool performance can be determined with two kinds of evaluation scores: Sentence Error Rate and Word Error Rate. 3. Spontaneous Speech Data Base Building The corpora used in speech recognition should have the property of being wide. As wide as a corpus is as more useful it gets in spontaneous speech recognition domain. The corpus was created from. The intentions are to keep the scalability property for the corpus in order to add as many words as the further investigations and research will need. Efforts have been concentrated in order to create conventions and well followed steps for future use in database creation and enlargement. 3.1. Issues in Building Spontaneous Speech Data Bases A series of problems and specific elements have been identified related to construction of databases for continuous speech recognition. Speech recognition may be considered as a process of form recognition, and this can be achieved usually [1], based on rules or based on statistical methods. The last option is at the moment the preferred one because of good performances obtained under acceptable production costs [2]. Such step assumes usage of input data for training purposes so that the system may create automatically information to use at later stage. For speech recognition process the specific aspects of input data used for training depend on the type of application that has been considered initially: speech recognition or speaker recognition. Databases corresponding to these two types of applications have common parts but also specific parts. [3] When performing speech recognition the important observations are related to system type (speaker dependent or speaker independent), vocabulary size (small vocabulary: 10- 100 words, medium vocabulary: 100-1000 words, large vocabulary: 10000-100000 words). Databases used for speech recognition must: ensure good coverage of given vocabulary and of the most significant acoustic units (phonemes, phonemes in certain context); ensure inter-phoneme separation as good as possible; be speaker voice independent (for speaker independent recognition purposes) or reflect the voice of the speaker (for speaker dependent recognition purposes). Options available for database creation: direct recordings may cause specific problems: premises selection for recording purposes (studio), microphone selection Spontaneous Speech Database for Romanian Language 7 7 (unidirectional/multidirectional, with/without active filter); data acquisition from radio or TV shows that have been already transmitted over Internet, may have some specific problems: recordings are performed under different conditions (open air/studio, movies, etc), coding type uniformity issue (A - PCM, μ- PCM, etc.), usage of different sampling frequencies (4, 8, 16 kHz,); direct acquisition from TV and radio channels may have specific problems: digitization of acquired analog signal, homogeneity of recording conditions, control of possible power outages or jammed transmission. Fields needed in a vocal signal database for spontaneous speech recognition: vocal signal files, in various formats – WAV, OGG, AIFF, RAW etc; characteristics of the vocal signal files: acquisition moment, recoding length, speaker identity, speech type – read, spontaneous etc; label files, indicating the words of phonemes corresponding to each segment of the vocal signal recording; acoustic parameters files, synthetically representing the vocal signal: linear prediction coefficients (LPC), cepstral coefficients (that may be filtered on a MEL frequency scale) etc; an acoustic parameter file is associated to each vocal signal file. Table 1 shows the main characteristics for the newly created data base. TABLE 1. Data base characteristics Collecting procedure Recording Internet broadcasted Romanian TV shows Used language Spoken Romanian Recordings duration ~4 hours with vocal signal Speakers 12 Females 8 Males 4 Sessions per speaker 3-20 Time between recording sessions One day to two weeks Words total occurrences 37604 Words unique occurrences 8068 Speech type Oral, spontaneous Recording environment TV studio Vocal recorded signal sampling frequency 8kHz Based on these elements the following set of essential problems was determined: vocal signal recording segmentation – for the ease and reliability of vocal signal processing by the human expert, the preferred length of the vocal signal files corresponds to a recording between 60s and 180s; vocal signal labeling – labeling can be performed on word level (time consuming, manual process) or on phoneme or triphone (semi-automated, bootstrapping process starting from an initial manual labeling [5]); the latest lacks reliability, due to the statistical algorithm process, e.g. forced Viterbi alignment of hidden Markov models [6]; vocal signal parameterization – specific criteria need to be fulfilled. In speech recognition, the criteria are maximizing inter-phoneme dispersion and minimize V.Popescu, C.Petrea, D.Haneş, A.Buzo, C.Burileanu 10 10 first and the last labels are "sil"s representing ambient noise, and the middle label contains all the spoken text with "sp"s and "sil"s. Triphones labeling: from the phonetic word transcriptions the triphonetic transcriptions were created using HTK HLEd tool [5]. The histograms created for the triphonetic occurrences are illustrated in figures 4,5,6,7 and 8. From the word labeled files and the triphonetic transcriptions the triphonetic labeled files were generated. 3.4. Data Coding For this phase [5] Mel Frequency Cepstral Coefficients (MFCCs) were used. The HTK tool HCopy automatically converted the input data into MFCC vectors. The mfc files were obtained having as input the wave files. The target parameters are to be Mel-Frequency Cepstral Coefficients (MFCCs). The delta component is used and not the acceleration component (MFCC_E), the frame period is 10msec (HTK uses units of 100ns), the output is not saved in compressed format, and a crc checksum is not added. The FFT uses a Hamming window of 20 msec and the signal has first order preemphasis applied using a coefficient of 0.97. The filterbank has 26 channels and 12 MFCC coefficients are output. The variable ENORMALISE is by default true and performs energy normalization on recorded audio files. It cannot be used with live audio. Creating these files reduces the amount of preprocessing required during training, which itself can be a time-consuming process. 4. Statistics for Romanian Language The data base was created from scratch for future work and research in the spontaneous speech recognition domain. The results of the work, including the labeling, the dictionaries and all the convention took into consideration in creating the corpus, are illustrated in the histograms raised and the statistics regarding the number of the occurrences at word level and triphones level. Instead of simple phonemes, set of three phonemes were used. The reason of using triphones is that they permit the context analyzing, as they are entities that preserve the before and the after neighbors of a phonem, by including the left and right context phoneme in the triphonetic construction. The most representative triphones in the newly created corpus are presented in the following figures, as they have the most occurrences in the used words. Figure 4 illustrates the triphones that appear more than two thousand times in the corpus words. The triphones that have the most occurrences are “d+e” with 2305 occurrences and “d-e” with 1897 occurrences. Figure 5 illustrates the next ten triphones with more than 729 and less than 984 occurrences among the words considered for the corpus. Triphones “l+a” and “l-e” are dominant in this histogram. Figure 6 reveals the next ten different triphones which appear more than 618 times and less than 710 times among the words considered in the corpus. Spontaneous Speech Database for Romanian Language 11 11 2305 1897 1511 1324 1217 1146 1115 1094 1041 983 0 500 1000 1500 2000 2500 d+e d-e i_+n t-e r-e u-l S+i S-i p+e l-a Triphones Occurences Fig. 4. The triphones with most occurrences in the corpus 982 953 852 829 818 814 797 788 765 730 0 100 200 300 400 500 600 700 800 900 1000 l+a l-e i_-n e_X-a k+a k+u k-@ a-r+e p+r m+a Triphones Occurences Fig. 5. The next ten triphones with most occurrences V.Popescu, C.Petrea, D.Haneş, A.Buzo, C.Burileanu 12 12 709 694 691 689 689 678 658 636 633 619 560 580 600 620 640 660 680 700 720 tS+e s-@ t-@ s+@ a i-i_O k-u d+i d-i+n n-e Triphones Occurences Fig. 6. Next ten triphones with most occurrences The triphones percent representation according to the values in table 3 appears in figure 7. TABLE 2. The first ten most representative triphones Number of occurrences Number of triphones > 1000 9 500 - 1000 40 100 - 500 371 50 - 100 401 30 - 40 421 10 - 30 950 5 - 10 799 < 5 2104 From a total amount of 5095 triphones, which represent one hundred percent, the percentage distribution of the phonems represented in figures 4, 5, 6 is illustrated in figure 7. In figure 7, one percent from the total amount of triphones has more than five hundred and less than one thousand appearances in the words considered for data base. Seven percent from the total amount of triphones has more than one hundred and less than five hundred occurrences in the data base words. Eight percent from the total amount of triphones has more than fifty and less than one hundred occurrences in the data base words. Another eight percent from the total amount of triphones has more than thirty and less than forty occurrences in the data base words. Nineteen percent from the total amount of triphones has more than ten and less than one thirty occurrences in the data base words.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved