U.S. Air Force Seeks Voice-Transformation Technology

Voice transformation is one part of the Terminator's arsenal that the U.S. Air Force would like to have available. Researchers are being solicited to help ordinary human airmen disguise their voices—even to sound like another person altogether.

This could be accomplished with voice transformation algorithms that can also detect transformed voices.

As you may recall, in "Terminator 2," the bad-guy shape-shifting T1000 takes over the person of John Connor's foster mother. When John becomes suspicious during a phone conversation with her (it), the good-guy Terminator (Arnold, of course) takes over the conversation, imitating John's spoiled West Coast brat voice perfectly.

Here are the requirements, from the official U.S.A.F. solicitation:

The goal of this phase is to research techniques to analyze a person [sic] voice for voice transformation. While voice transformation have [sic] been around for awhile, the ability [sic] to transform a person's voice to a target voice is not yet solved. Parameters such as the speaking rate, stress, and intonation will provide broad parameters for modeling a person's voice. A finer grain analysis of a person's voice may also be performed by de-convolving an audio signal into its glottal pulse and vocal tract information.

Transforming a speaker's voice so it is unrecognizable may be less difficult than you might think. Studies were conducted in 1980 in which subjects were tested on their ability to recognize a group of 53 voices, 29 of which were actually familiar to the listener. In the study, 31 percent of speakers could be identified with a single word, 66 percent from a single sentence, but only 83 percent from a full 30 seconds of speech. So, for some of the time (or for some speakers), voices are just hard to recognize consistently.

Transforming a speaker's voice into a target voice is much more difficult. Some of the difficulties relate to:

  • Formant spectra: the coarse structure of the different parts of speech. "Formant" refers to the regions of concentration of energy, prominent on a sound spectrogram, that collectively constitute the frequency spectrum of a speech sound. This is the most common target of voice transformation algorithms, which work by constructing a map between the formant spectra of the two voices
  • Prosodic features: These are aspects of speech that vary from person to person, like fundamental pitch of the voice, timing—the patterns and rhythms of speech.
  • Mannerisms: This refers to word choices and preferred phrases and other high-level behaviors. For example, someone from New Jersey might imitate the voice of someone from Arkansas perfectly, but still fail to convince a listener owing to a failure to select the right phrases.

Incredibly, the U.S.A.F. is even looking further ahead for different uses for voice transformation technology, including "medical applications if a person's voice box was damaged, in the gaming industry and animated films for creating and modify voices, for voice dubbing of foreign films, and for creating/reducing a person's accent."

You might enjoy these speech-related articles:

Read more at the USAF voice transformation and detection solicitation and at DefenseTech; see also this interesting short article on voice transformation.

(This Science Fiction in the News story used with permission from Technovelgy.com —where science meets fiction.)