Design of Speech Database for Unit Selection in Kiswahili Text to Speech System

K. DRGAKURUMUCEMI. "Design of Speech Database for Unit Selection in Kiswahili Text to Speech System.". In: E-Tech conference, Nairobi August 2004. FARA; 2004.


When developing a Concatenative Text to Speech System [1, 3, 4] (i.e. a form of synthesis where waveforms are created by concatenating parts of natural speech recorded from humans) it is necessary that all the acoustically and perceptually significant sound variations (allophones) in the language are recorded so that they are played back each time the system synthesises speech. Improvement on the system is made by assuming that co-articulation (mutual influence between adjoining sounds) does not extend beyond phone-phone boundary [1]. In this case all possible phone-phone combinations are read and recorded. Each unit of the two phone combination is referred to as the diphone. Synthesis is then based on concatenation of the diphones thus taking care of the overlap in the phone-phone boundary. An even better system can be realised when each diphone is captured within the context of several words and synthesis carried out by using the best selection from the recorded words. It is clear then that this procedure must use proper selection of the sentences from which the diphones are to be captured. In other words, such sentences must be phonetically balanced; implying that they must have the same phone distribution as used entirely in the language.




