Improvements in speech synthesis has increased the use of intelligent assistants like Amazon Alexa, Apple Siri and many others, however, advanced speech capabilities are moving closer to providing a essential service. Speech technologies that are based upon Artificial Intelligence (AI) are developing towards the ultimate objective of allowing the ability to speak to millions of people with speech impairments or speech loss.
Modern voice technology drives an enormous, highly market for smart devices. Based on the report 2022 Smart Audio Report 1 by NPR as well as Edison Research, 62 percent of Americans who are 18 or over utilize a voice assistant on the form of a device. For businesses, taking part in the growing trend of sophisticated voice capabilities is crucial, not just for the sake of establishing their own synthetic voice as well as taking part in the countless opportunities to direct communication with customers through AI-powered agents that can respond and listen to the device of the user to a natural sounding conversation.
Complex Speech Synthesis Pipeline
The technology of speech synthesis has advanced significantly from the voice encoder or vocoder, which was created over 100 years ago to cut the amount of bandwidth used in transmissions over telephone lines. Nowadays, vocoders are highly sophisticated subsystems that rely on deep learning algorithms such as convolutional neural networks (CNNs). Actually the neural vocoders are the final of speech synthesis pipelines which include an acoustic system capable of creating various features of speech that listeners can use to discern gender or age as well as other characteristics associated with human speech. In this process the model produces the acoustic characteristics, which are typically found in mel-spectrograms. These maps this linear frequency spectrum to an area that is believed to be more reflective of human perception. Then neural vocoders such as DeepMind’s Google WaveNet employ these acoustic characteristics to produce quality audio output waves.
Text-to-speech (TTS) options are abundant in the marketplace with a variety of options, including applications for mobile devices that are downloadable, open source applications like OpenTTS as well as complete cloud-based, multi-language solutions like Amazon Polly, Google Text-to-Speech along with Microsoft Azure Text to Speech among others. A majority of TTS applications and services are compatible with the industry standard Speech Synthesis Markup Language (SSML) that allows for a consistent method for speech synthesis software to allow more natural speech patterns, such as the use of pauses, phrasings, emphasis and intonation.
Giving Voice to the Individual
The current TTS software is capable of providing high quality voice that’s a away from the artificially-voiced electrolarynx, or the voice that famous Stephan Hawking employed as his signature voice after the improvement of technology for improving voice quality was made available 2.. But, these software and services are geared towards offering a real-time voice interface to websites, applications video as well as automatic voice responses systems and similar. Impersonating a specific person’s voice — including their distinct tone and speech patterns — isn’t their main goal.
While certain services like Google’s offer the option of creating a voice supplied by the user through an arrangement that is unique, they’re not specifically designed to satisfy the vital need to reproduce the voice of the individual. For people who have lost their voice the need to reproduce their voice is vital because our voice is so closely linked to our identity. even a simple greeting in a voice can convey much more than individual words. People who lose their voice experience a loss of connection that goes beyond the loss of voice. They feel that the capability to communicate with other people in their own voice is the main future of technology for speech synthesis.
The Emergence of Voice Cloning
The efforts continue to reduce the barriers to providing artificial voices that be matched to the individuality of each person. For instance, in the year 2000, actress Val Kilmer revealed that after losing his voice because of surgical treatment for cancer of the throat, UK company Sonantic provided the actor with a voice that was recognizable as his unique. In a different, highly-publicized voice cloning program that was used, the voice of famous chef Anthony Bourdain was cloned in the film about his life. The voice was able to speak in the Bourdain voice that the chef had written but had never spoken in his the course of his life.
Another voice technology pioneer, VocalID, provides individuals with custom-made voices based upon recordings that each person “banks” with the company in case of loss of voice. They also provide custom voices created from recorded recordings banked by volunteers and matched with the person who lost voice. The individual is then able to run the customized voice synthesizer application for the device of their choice, whether it’s IoS, Android, or Windows mobile device, and carry conversations using their personal voice.
The technology to clone voices is advancing rapidly. In the summer of 2018, Amazon demonstrated the ability to duplicate a voice with audio files shorter than 60 seconds in duration. While it’s advertised as a method to bring back the voice of long-lost family members, the demonstration by Amazon illustrates AI’s capabilities to provide voice output that is reminiscent of the same voice that you know.
Due to the connection between identity and voice high-fidelity speech generation can be both a benefit and a danger. Similar to deepfake videos fake voice cloning is an extremely serious security risk. A high-quality voice replica was mentioned as the main cause in the fraudulent transfer of $33 million in the early 2020. In this instance the bank’s manager wired the money in response to a transfer request that was made using the voice of a person he recognized, but was later discovered to be a fake deepfake voice.
Conclusion
In the interest of assessing the potential market for this technology, researchers at commercial and academic organizations are seeking out new ways to create speech output that is that can handle all the subtleties of human speech to better engage consumers. With all the opportunities for market growth however, modern speech synthesis technology is expected to provide a greater personal benefit to millions of people born with no voice or are unable to speak as a result of accidents or illnesses.
Sources
1. “The Smart Audio Report.” national public media, June 2022. https://www.nationalpublicmedia.com/insights/reports/smart-audio-report/.
2. Handley, Rachel. Stephen Hawking’s voice, made by a man who lost his own. BeyondWords, July 15, 2021. https://beyondwords.io/blog/stephen-hawkings-voice/.