Microsoft has developed VALL-E, an artificial intelligence (AI) capable of imitating a voice from a sample of just three seconds. Some demonstrations are very convincing. The firm is aware of the danger of such a tool placed in the hands of malicious people.
After the “deep fake” in image or video, will we see the arrival of “deep fake” sound? It is possible since MicrosoftMicrosoft unveiled a new artificial intelligence (AI) model of speech synthesis called VALLEY. His particuliarity ? She can imitate and therefore simulate a person’s voice with a simple three-second audio sample. Once it has learned a specific voice, this AI can synthesize that person’s sound, while preserving its timbre and emotions.
At Microsoft, it is believed that VALL-E could be used for applicationsapplications voice synthesis, but also, and this is obviously more worrying, for editing speech in a recording. It would be possible to edit and modify the sound from a transcriptiontranscription text of a speech. Imagine a speech by a politician modified by this artificial intelligence…
Le « machine learning » en action
For the firm, VALL-E is what is called a “language model of codeccodec neural”, and it relies on an audio compression technology called EnCodecrevealed by Meta (FacebookFacebook) last October. Unlike other speech synthesis methods that typically synthesize speech by manipulating waveforms, VALL-E generates audio codec codes from textual and acoustic samples. It basically analyzes a person’s sound, breaks that information down into tokens (tokenstokens) thanks to EnCodec, and it uses “machine learning” to match the three-second sample with what it has learned.
For this, Microsoft relied on the sound library LibriLight. It contains 60,000 hours of English speech from over 7,000 speakers, mostly taken from LibriVox public domain audiobooks. For VALL-E to generate a meaningful result, the voice in the three-second sample must closely match a voice in the training data.
I must do something about it.
An example © VALL-E
Microsoft is aware of the danger
To convince you, Microsoft provides dozens of audio examples of the AI model in action. Some are frighteningly similar, but others are clearly synthetic and the human ear manages to distinguish that it is an artificial intelligence. What is impressive is that in addition to preserving the tone and emotion of the person speaking, VALL-E is able to reproduce the environment and conditions of the recording. Microsoft takes the example of a telephone call with the acoustic and frequency properties specific to this type of conversation.
Asked about the dangers of such artificial intelligence, Microsoft confirms that the source code is not available, and the firm is aware that ” this can lead to potential risks of model misuse, such as voice spoofing or impersonating a specific speaker. To mitigate these risks, it is possible to build a detection model to discriminate if an audio clip has been synthesized by VALL-E. We will also put Microsoft AI principles into practice when further developing the models. ».