[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dtk-soft



Tim Cross writes:
 > 
 > If viavoice/elaquence create speech simply by using collections of
 > clicks, you have to wonder why anybody attempts any other technique -
 > such as that used by festival, mbrolla or dectalk. 

Let's try to clarify a few matters. At the outset, it should be noted
that I have no expertise whatsoever in speech synthesis. Those who are
more knowledgeable are welcome to correct my errors.

DECTALK is a "formant synthesizer", meaning that it uses a
mathematical model of human speech to generate control parameters
which are in turn converted to the input required by audio hardware
such as a sound card. Basically, various parameters and "sound
sources" in the model are combined to produce the final speech signal.
The research underlying DECTALK was carried out in the 1970's and
early 1980's by Dennis Klatt at MIT, who regrettably died in 1987.
Klatt's research was also commercialized by Telesensory Systems in the
early 80's as the Prose 2000 and Prose 4000 synthesizers, which
offered speech quality similar to that of DECTALK. I don't know what
ultimately happened to the Prose 2000/4000 speech synthesis software.
There is an article by Klatt available online, published in 1987,
which reviews the different techniques of speech synthesis that have
been tried. Klatt's recordings of various synthesizers can be heard at
http://cslu.cse.ogi.edu/tts/research/history/

While the above discussion focuses on the actual generation of the
speech signal, the most complicated part of a text to speech system is
the component that converts the text into the phonetic data from which
the speech is constructed. It is not simply a matter of applying rules
that define the correspondences between words and letter strings, on
the one side, and phonemes on the other. Due to prearticulation, a
single phoneme can sound somewhat different depending on its context;
there are also issues of stress to be considered, as well as the
over-all contour of a phrase or sentence, determined by an analysis of
punctuation and other clues to be found in the text itself. There is a
good paper available on the Web (I can find the reference if required)
describing the DECTALK software and the various stages of text to
speech conversion, which are implemented as separate threads for
reasons of efficiency.

I don't know what techniques of speech synthesis Viavoice uses; I
suspect it may also be a formant synthesizer. Festival uses diphone
synthesis: there is a large data base of pre-recorded sound segments, and these are
combined to produce the speech.

As indicated earlier, I am really not an expert in this area, which
explains why the preceding explanations are so general and likely to
be inaccurate in some of their detailed aspects.

-----------------------------------------------------------------------------
To unsubscribe from the emacspeak list or change your address on the
emacspeak list send mail to "emacspeak-request@cs.vassar.edu" with a
subject of "unsubscribe" or "help"


Emacspeak Files | Subscribe | Unsubscribe | Search