
It offers more flexibility because the collection of words isn’t limited to what has been pre-recorded by a human. If the speech has been generated by a computer, this is called formant synthesis.

This search is often done with decision trees, neural nets or Hidden-Markov-Models. Usually, these learned segments are stored in a database (either as human voice recordings or synthetically generated) that can be searched to find suitable speech parts (Unit Selection). We call this part Natural Language Processing (NLP). The structure (e.g. the pronunciation) of these entities is then learned in context. With this method, text is first normalized and divided into smaller entities that represent sentences, syllables, words, phonemes, etc. A very important method is Unit Selection synthesis.

There are different ways to artificially produce speech. AI-based TTS systems can take phonemes and intonation into account. Challenges for good TTS systems are the complexity of the human language: we intone words differently, depending on where they are in a sentence, what we want to convey with that sentence, how our mood is, and so on.
