Page MenuHomeTometo Phabricator

Decide on TTS tooling
Open, Needs TriagePublic


We need to decide what solution(s) to use for everything related to Text-To-Speech. This requires some thought, since it's obviously a critical part of Tometo and most of the solutions cost money. The major problems are:

  • Voice generation
  • Phoneme/viseme generation
  • Forced alignment (getting timestamps for where each word in a status or comment starts)

Current choices include:

  • Voice generation: Google TTS, AWS Polly, eSpeak NG
  • Phoneme/viseme generation: AWS Polly, eSpeak NG(?)
  • Forced alignment: AWS Polly, aeneas

Google TTS and AWS Polly are both priced at $4 per 1 million characters, which means that we would be paying $4 per 3333 statuses (voice generation only) or $8 per 3333 statuses (voice generation, viseme generation and forced alignment using AWS Polly).

Aeneas and eSpeak NG both run locally and would therefore be a lot less expensive.

Event Timeline

aggums created this task.Jan 31 2020, 12:35 AM
aun added a subscriber: aun.Jan 31 2020, 12:38 AM