Demo for "SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody"

Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors. 

Section 1

This section presents edited speech samples using SpeechTripleNet in terms of timbre and emphasis. For each source speech, we list five possible modifications in five rows. The first row shows the timbre reference speech and the ground-truth pitch and energy latent variables, which yield an edited voice-converted speech. The second row modifies the pitch latent variable for the bolded word in the text, which is also highlighted between two vertical red lines in the pitch latent image. The third row modifies the energy latent variable in a similar manner. The fourth row modifies both the pitch and energy latent variables. Finally, the fifth row modifies all three variables simultaneously: timbre, pitch, and energy. Please note that while the bold word is emphasized in some utterances, in others it is de-emphasized.

Index Text Source speech Timbre reference Pitch latent Energy latent Edited speech
p245_185 He's not perfect.
same as source
same as source
same as source
p302_073 That alibi is now gone.
same as source
same as source
same as source
p234_309 That was a bonus, but it was not the main objective.
same as source
same as source
same as source
p335_064 This must be wrong.
same as source
same as source
same as source
p347_062 The message is just not getting through.
same as source
same as source
same as source
p238_193 The squad is too small.
same as source
same as source
same as source
p302_316 Is that Titanic?
same as source
same as source
same as source
p234_222 We talk about Mr Michael Johnson, and he is awesome.
same as source
same as source
same as source
p234_083 They were badly prepared for it.
same as source
same as source
same as source

Section 2

This section presents samples to demonstrate the capability of SpeechTripleNet to turn statements into questions. For each sample, we manipulate the latent space to increase the pitch and energy for some words, mostly the ending word, to make the original statement sound like a question. The first row for each sample converts the original speech into the speaker identity of the timbre reference; the second row presents the sample being modified into a question; the third row shows the sample after jointly modifying the speaker identity and prosody.

Index Text Source speech Timbre reference Prosody modified Edited speech
p302_109 Or so it would appear.
same as source
p302_077 It is simple, really.
same as source
p294_385 However, many questions remained unanswered.
same as source
p238_370 Not that Barber has been unhappy.
same as source
p294_222 She has so much to offer.
same as source
p248_250 It was a hit.
same as source
p335_148 Drink and petrol prices remain untouched.
same as source
p245_063 He was popular.
same as source
p294_143 It's hard not to.
same as source
p302_267 It's a matter of huge concern.
same as source