Demo for "SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody"

Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.

Section 1: Speech editing: timbre & emphasis
Section 2: Speech editing: statement to question

Section 1

This section presents edited speech samples using SpeechTripleNet in terms of timbre and emphasis. For each source speech, we list five possible modifications in five rows. The first row shows the timbre reference speech and the ground-truth pitch and energy latent variables, which yield an edited voice-converted speech. The second row modifies the pitch latent variable for the bolded word in the text, which is also highlighted between two vertical red lines in the pitch latent image. The third row modifies the energy latent variable in a similar manner. The fourth row modifies both the pitch and energy latent variables. Finally, the fifth row modifies all three variables simultaneously: timbre, pitch, and energy. Please note that while the bold word is emphasized in some utterances, in others it is de-emphasized.

Index	Text	Timbre reference
p245_185	He's not perfect.
		same as source
		same as source
		same as source

p302_073	That alibi is now gone.
		same as source
		same as source
		same as source

p234_309	That was a bonus, but it was not the main objective.
		same as source
		same as source
		same as source

p335_064	This must be wrong.
		same as source
		same as source
		same as source

p347_062	The message is just not getting through.
		same as source
		same as source
		same as source

p238_193	The squad is too small.
		same as source
		same as source
		same as source

p302_316	Is that Titanic?
		same as source
		same as source
		same as source

p234_222	We talk about Mr Michael Johnson, and he is awesome.
		same as source
		same as source
		same as source

p234_083	They were badly prepared for it.
		same as source
		same as source
		same as source

Section 2

This section presents samples to demonstrate the capability of SpeechTripleNet to turn statements into questions. For each sample, we manipulate the latent space to increase the pitch and energy for some words, mostly the ending word, to make the original statement sound like a question. The first row for each sample converts the original speech into the speaker identity of the timbre reference; the second row presents the sample being modified into a question; the third row shows the sample after jointly modifying the speaker identity and prosody.

Index	Text	Timbre reference	Prosody modified
p302_109	Or so it would appear.		✗
		same as source	✓
			✓
p302_077	It is simple, really.		✗
		same as source	✓
			✓
p294_385	However, many questions remained unanswered.		✗
		same as source	✓
			✓
p238_370	Not that Barber has been unhappy.		✗
		same as source	✓
			✓
p294_222	She has so much to offer.		✗
		same as source	✓
			✓
p248_250	It was a hit.		✗
		same as source	✓
			✓
p335_148	Drink and petrol prices remain untouched.		✗
		same as source	✓
			✓
p245_063	He was popular.		✗
		same as source	✓
			✓
p294_143	It's hard not to.		✗
		same as source	✓
			✓
p302_267	It's a matter of huge concern.		✗
		same as source	✓
			✓