Have you ever used Alexa or Siri to help find information, directions or news? If you have, then you've most likely heard a synthesized voice that is obviously not a human speaker replying back. When you hear Text to Speech content, it is quite easy to understand what the voice is telling you – but you’d never mistake the generated voice for a human one. We understand synthetic voices like Siri and Alexa because they are speaking our language, even in cases where they are translating from other languages. But with so many advances in technology, particularly in the voiceover and narration industry, why can’t Siri sound more human?
The answer is complex and involves not only technology, but an innate sensitivity we have to “near-human” figures and voices. Learning more about Text to Speech (TTS) technology and how it is progressing can help you understand both what is happening in this emerging market and how you can potentially use TTS to your own benefit.
TTS allows your device to read aloud to you from digital text. TTS is ideal for aiding those who struggle with reading or for labeling specific items on a screen. It is often used for emerging readers or learners that need extra support, but is almost universally available and supported on devices around the world. While TTS does speak, it is obviously a synthetic voice or conversational AI without emotion or inflection; it is not narration or a performance. Humans have a particularly fine-tuned awareness of speech and most of us can pick out a synthetic voice when it is used. Conversational intelligence can relay facts, directions, information and reminders, but it is not very adept at portraying subtext or the emotion behind a piece.
Even the most exquisite prose and poetry loses its sparkle when read out by conversational AI. So why haven't we moved further along the path to a less synthetic AI voice -- one that packs real emotion? The answer could surprise you, and involves a very real but often misunderstood emotion triggered by "near-human" interactions, resulting in some versions becoming a little too real and triggering feelings of unease and discomfort in listeners (more on this later).
A TTS engine works by converting written text into a phonemic representation which is then converted into waveforms that are sent out as sound. A TTS engine is compatible with most personal devices, from smartphones to laptops, tablets and readers. TTS can read from documents, web pages, books and more, making it a flexible and useful system for providing information. As long as a text file is available, then TTS can read it aloud.
In some cases, the words will be highlighted on the screen as they are read; this is often seen in TTS designed for educational purposes. The synthetic voice used is generated by a computer and lacks emotion or emphasis. Optical character recognition can also help support TTS tools and ensure the passages are being read properly and accurately. Ultimately, the listener is provided with a straightforward reading of the text without any insight into the text's meaning, which would normally be gleaned from the narrator's manner of speech.
TTS is used in a wide range of industries, in education and training, and even for entertainment and daily life. If you’ve ever asked Google or Siri to read you something while you are doing another task, you have used TTS and conversational intelligence, whether you realize it or not. TTS and other technologies, including Nlp and NLU, can be used to create conversational AI that engages and offers support in a variety of applications and settings.
TTS does work better for some applications than others. Since the voice is flat and unable to convey emotion, it is not particularly useful for video applications, video game dialogue or audiobooks. When a performance is required or emotion needs to be conveyed, a human narrator is the superior option.
For defining words, creating scratch videos during production, and providing service or support, TTS is a natural match. In education, short segments of artificial narration can be very clear and useful in cases where additional support is needed or when training adults. When this type of narration is heard repeatedly or for extended periods of time however, it can become unengaging for some and maybe even a little annoying. Despite this, TTS is excellent for conveying information when using applications like Alexa, Google and Siri.
TTS is already very useful, but it is not yet ready for prime time. Because it lacks emotion, TTS generally only plays a relatively minor supporting role in the production of entertainment – often acting as a placeholder until the real audio for a media project or game is completed.
Developing a “better” TTS voice has several barriers; both technological and social. There are still challenges to overcome in creating a smooth and conversational synthesized voice and an artificial intelligence sophisticated enough to recognise when to emote. While the technological barriers can be overcome, there is also a human wariness of “almost-human" synthetics that is tough to shake.
Called the “uncanny valley”, this wariness is a reaction to a robot or human-like AI posing as a human. As robotics or CGI become more human-like in sound or appearance, they become more appealing to our senses as we notice familiar traits that "humanize" them and evoke feelings of recognition. But this only occurs up to a certain point. Once a humanoid-type object or sound starts to closely but imperfectly resemble an actual human being, it begins to provoke a feeling of eeriness and revulsion. Neurological research has shown that some people are more sensitive to the uncanny valley than others, but virtually all of us mistrust a synthetic face or voice that is a little too human without actually being human.
Human sounding conversational intelligence can be a little too good in this regard, sometimes creating a final product that listeners describe as creepy due to the uncanny valley effect. This effect could undermine a listener’s confidence in the brand that is using the TTS or the information the user is receiving, so imposing some barriers to perfection can actually benefit the technology by preventing the uncanny valley effect from occurring at all.
TTS is a lifesaver for some users who would otherwise not be able to read or understand written materials, but it is not quite ready for eLearning content that uses voiceover. TTS is not emotive enough for use in entertainment materials, but for processes like voiceover data collection and for scratch audio, it can be a huge timesaver and a way to reduce costs and inefficiencies. By collecting voice data via our extensive linguistic network, Andovar too is at the forefront of improving TTS enough to be even more useful -- which also doesn't trigger the uncanny valley effect.
Learn more about the right approach to sound for your project or brand – and discover where TTS can be used for support and where a human voiceover is better. We can help you determine what you need for your next eLearning project using the technology that is best for your needs. Get in touch today to learn more or to discover just what a difference our solutions can make for your brand.