Ref: https://learn.cantrill.io/courses/1820301/lectures/42176865
Amazon Polly 101
- 🔧 Text-to-Speech (TTS) service
- âť—Â NO translation done in the conversion, same language!
- Voice engines:
- Standard TTS = Concatenative (concatenates phonemes)
- 💡 Phoneme = smallest unit of sound in a language
- Neural TTS = Phonemes → spectograms → vocoder → audio
- Much more human/natural sounding, but much more complex and computationally heavy
- Newer engines: long-form, generative (more powerful)
- Features
- Integration with other services and apps
- e.g. WordPress plugin to read WordPress articles out loud
- Many output formats supported (MP3, PCM, Ogg Vorbis…)
- Supports Speech Synthesis Markup Language (SSML) → markup tags provide additional control over how speech is generated
- e.g. emphasis, pronunciation, whispering, over-exaggerated “newscaster” speaking style
- Lexicons: define how to read certain specific text
- e.g. “AWS → Amazon Web Services”
- Speech mark: encode where a sentence/word starts or ends in audio
- helpful for e.g. lip-syncing or word highlighting
- Screenshot