PHONOS: PHOnetic Neutralization for Online Streaming Applications

Anonymous submission to Interspeech 2026

Streaming accent neutralization ≤ 241 ms GPU latency

Abstract

Speaker anonymization (SA) systems typically modify voice identity (e.g., timbre) while leaving regional or non-native accents intact, which is problematic because accents can substantially narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that also replace/mask accented speech to match that of a target speaker. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign content tokens with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Built on TVTSyn, our method achieves under 241 ms end-to-end GPU latency. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space.

Figures

Figure 1
Figure 1: PHONOS inference pipeline. Non-native speech is encoded into content tokens, accent-translated to native tokens, and decoded into a waveform conditioned on the original or pseudo-speaker embedding.
Figure 2
Figure 2: TVTSyn training workflow. (a) content encoder trained against HuBERT k-means pseudo-labels, and (b) wav decoder conditioned on speaker embedding trained with self-supervision and discriminator objectives.
Figure 3
Figure 3: Golden speaker generation. Native and non-native content embeddings are duration-aligned via silence-aware DTW, then synthesized with the non-native speaker’s identity.
Figure 4
Figure 4: PHONOS’s accent translator architecture. Non-native content tokens pass through ConvNeXt and limited-context transformer layers to produce native content tokens.

Audio Samples: IndicTTS

Speaker Utterance Original Parallel Native Golden Speaker FAC (PHONOS)
Hindi Female utt_001
utt_002
utt_003
utt_004
Assamese Male utt_001
utt_002
utt_003
utt_004

Audio Samples: L2-ARCTIC

Speaker Utterance Original Parallel Native Golden Speaker FAC (PHONOS)
TNI arctic_b0442
arctic_b0443
arctic_b0470
arctic_b0532
ASI arctic_b0514
arctic_b0518
arctic_b0523
arctic_b0533

References