Decoupling segmental and prosodic cues of non-native speech through vector quantization

Anonymous submission to INTERSPEECH 2023

Abstract

Accent conversion (AC) seeks to transform utterances from a non-native speaker to appear native-like. Compared to voice conversion, which generally treats accent and voice quality as one, AC provides a finer-grained decomposition of speech. This paper presents an AC system that further decomposes an accent into its segmental and prosodic characteristics, and provides independent control of both channels. The system uses conventional modules (acoustic model, speaker/prosody encoders, seq2seq model) to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. However, naive application of this idea prevents the system from learning and transferring prosody. We show that vector quantization and removal of repeated codewords allows the system to transfer prosody and improve transfer of voice quality, as verified by objective and perceptual measures.

Block Diagram

Block Diagram
Block diagram of the proposed system. The prosody encoder and seq2seq model are trained jointly as an autoencoder. For accent conversion, segmentals come from U1 and prosody from U3, thus providing independent control of both channels.

Notes

Audio Samples

L2 speaker Text Input L2 speech VQ-inf (Baseline) VQ-128 (Proposed)
NJS We will have to watch our chances.
It was a curious coincidence.
It was the same way with our revolvers and rifles.
ZHAA I'll be out of my head in fifteen minutes.
These rumors may even originate with us.
He was worth nothing to the world.

References

[1] G. Zhao et al., "L2-ARCTIC: A non-native English speech corpus," in Proc. Interspeech, 2018, pp. 2783-2787.