We need to decide what solution(s) to use for everything related to Text-To-Speech. This requires some thought, since it's obviously a critical part of Tometo and most of the solutions cost money. The major problems are:
- Voice generation
- Phoneme/viseme generation
- Forced alignment (getting timestamps for where each word in a status or comment starts)
Current choices include:
- Voice generation: Google TTS, AWS Polly, eSpeak NG
- Phoneme/viseme generation: AWS Polly, eSpeak NG(?)
- Forced alignment: AWS Polly, aeneas
Google TTS and AWS Polly are both priced at $4 per 1 million characters, which means that we would be paying $4 per 3333 statuses (voice generation only) or $8 per 3333 statuses (voice generation, viseme generation and forced alignment using AWS Polly).
Aeneas and eSpeak NG both run locally and would therefore be a lot less expensive.