Prosody Accent Conversion Audio Samples

Abstract

Accent conversion (AC) seeks to transform utterances from a non-native speaker to appear native-like. Compared to voice conversion, which generally treats accent and voice quality as one, AC provides a finer-grained decomposition of speech. This paper presents an AC system that further decomposes an accent into its segmental and prosodic characteristics, and provides independent control of both channels. The system uses conventional modules (acoustic model, speaker/prosody encoders, seq2seq model) to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. However, naive application of this idea prevents the system from learning and transferring prosody. We show that vector quantization and removal of repeated codewords allows the system to transfer prosody and improve transfer of voice quality, as verified by objective and perceptual measures.

Block Diagram

Notes

The L1 speaker speaks the General American Accent.
NJS's native language is Spanish.
ZHAA's native language is Arabic.
Dataset (L2-ARCTIC corpus [1]): https://psi.engr.tamu.edu/l2-arctic-corpus/

Audio Samples

Input speech: original unmodified speech recordings
- U1: L1 speaker (BDL) (i.e., native segmentals)
- U2: L2 speaker (NJS / ZHAA) (i.e., non-native speaker's identity)
- U3: L2 speaker (NJS / ZHAA) (i.e., non-native speaker's prosody)
Baseline: VQ-inf accent conversion model
Proposed: the proposed VQ-128 accent conversion model

L2 speaker	Text	Input L2 speech	VQ-inf (Baseline)	VQ-128 (Proposed)
NJS	We will have to watch our chances.
	It was a curious coincidence.
	It was the same way with our revolvers and rifles.
ZHAA	I'll be out of my head in fifteen minutes.
	These rumors may even originate with us.
	He was worth nothing to the world.