Disentangled speech representation learning aims to separate different factors of variation from speech into disjoint representations. This paper focuses on disentangling speech into representations for three factors: spoken content, speaker timbre, and speech prosody. Many previous methods for speech disentanglement have focused on separating spoken content and speaker timbre. However, the lack of explicit modeling of prosodic information leads to degraded speech generation performance and uncontrollable prosody leakage into content and/or speaker representations. While some recent methods have utilized explicit speaker labels or pre-trained models to facilitate triple-factor disentanglement, there are no end-to-end methods to simultaneously disentangle three factors using only unsupervised or self-supervised learning objectives. This paper introduces SpeechTripleNet, an end-to-end method to disentangle speech into representations for content, timbre, and prosody. Based on VAE, SpeechTripleNet restricts the structures of the latent variables and the amount of information captured in them. It is a pure unsupervised/self-supervised learning method that only requires speech data and no additional labels. Our qualitative and quantitative results demonstrate that SpeechTripleNet is effective in achieving triple-factor speech disentanglement, as well as controllable speech editing concerning different factors.
This section presents edited speech samples using SpeechTripleNet in terms of timbre and emphasis. For each source speech, we list five possible modifications in five rows. The first row shows the timbre reference speech and the ground-truth pitch and energy latent variables, which yield an edited voice-converted speech. The second row modifies the pitch latent variable for the bolded word in the text, which is also highlighted between two vertical red lines in the pitch latent image. The third row modifies the energy latent variable in a similar manner. The fourth row modifies both the pitch and energy latent variables. Finally, the fifth row modifies all three variables simultaneously: timbre, pitch, and energy. Please note that while the bold word is emphasized in some utterances, in others it is de-emphasized.
Index | Text | Source speech | Timbre reference | Pitch latent | Energy latent | Edited speech |
---|---|---|---|---|---|---|
p245_185 | He's not perfect. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p302_073 | That alibi is now gone. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p234_309 | That was a bonus, but it was not the main objective. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p335_064 | This must be wrong. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p347_062 | The message is just not getting through. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p238_193 | The squad is too small. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p302_316 | Is that Titanic? | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p234_222 | We talk about Mr Michael Johnson, and he is awesome. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
|||||
p234_083 | They were badly prepared for it. | ![]() |
![]() |
|||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
same as source | ![]() |
![]() |
||||
![]() |
![]() |
This section presents samples to demonstrate the capability of SpeechTripleNet to turn statements into questions. For each sample, we manipulate the latent space to increase the pitch and energy for some words, mostly the ending word, to make the original statement sound like a question. The first row for each sample converts the original speech into the speaker identity of the timbre reference; the second row presents the sample being modified into a question; the third row shows the sample after jointly modifying the speaker identity and prosody.
Index | Text | Source speech | Timbre reference | Prosody modified | Edited speech |
---|---|---|---|---|---|
p302_109 | Or so it would appear. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p302_077 | It is simple, really. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p294_385 | However, many questions remained unanswered. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p238_370 | Not that Barber has been unhappy. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p294_222 | She has so much to offer. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p248_250 | It was a hit. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p335_148 | Drink and petrol prices remain untouched. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p245_063 | He was popular. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p294_143 | It's hard not to. | ✗ | |||
same as source | ✓ | ||||
✓ | |||||
p302_267 | It's a matter of huge concern. | ✗ | |||
same as source | ✓ | ||||
✓ |