当前位置：首页 > news >正文

Speech Synthesis (LASC11062)

news 2025/10/24 17:27:06

大纲

Module 1 – introduction
Module 2 - unit selection
Module 3 - unit selection target cost functions
Module 4 - the database
Module 5 - evaluation
Module 6 - speech signal analysis & modelling
Module 7 - Statistical Parametric Speech Synthesis (SPSS)
Module 7 bonus - hybrid speech synthesis
Module 8 - Deep Neural Networks (DNNs)
Module 9 - sequence-to-sequence models

Module 1 – introduction

We can identify four main challenges for any builder of a TTS system.

1. Semiotic classification of text (text normalisation)：Festival
2. Decoding natural-language text：homographs (POS); shallow (“syntactic”) structure
3. Creating natural, human-sounding speech：low-level signal quality (vocoders); segmental quality (pronunciation, stress, connected speech processes); augmentative prosody (text-related); affective prosody (not necessarily text-related) (generating ‘affective’ or ‘emotional’ speech)
4. Creating intelligible speech：Closer to a solved problem than naturalness （interestingly, the most natural-sounding systems are not always the most intelligible）；Can achieve human levels of intelligibility （straightforward with good statistical parametric systems）；Unit selection systems （generally less intelligible than natural speech but this is in lab conditions with semantically-unpredictable sentences）

We can also identify two current and future main challenges：

1. Generating affective and augmentative prosody
2. Speaking in a way that takes the listener’s situation and needs into account

Current issues：

What is currently possible in commercial systems

• neutral reading-out-loud speaking style
• some languages only (fewer than 50)
• a few expressive ‘tricks’ but not true expressive speech

What is currently possible in research systems
• more varied speaking style, some expressivity (e.g., acted ‘emotions’)
• adapted and personalised voices
• use of ‘found data’
• more languages

Module 2 – unit selection 单元选择语音合成

Interactive unit selection

1. Key concepts

The context in which a sound is produced affects that sound：

articulatory trajectories 调音轨迹（舌头上下移动时，这与嘴唇的张开、软腭的打开/关闭、或声带聚集）
phonological effects 音系之间影响 such as assimilation where entire sounds change category depending on their environment
prosodic environment 韵律的环境 changes the way sounds are produced. For example, their fundamental frequency and their duration are strongly influenced by their position in the phrase. 短语中重音位置不一样

我们观察到的语音信号（波形）是在不同时间尺度上运行的相互作用过程的产物。语境是复杂的--它不仅仅是前面/后面的声音。我们希望同时
-- 将语音建模为一串单元units（这样我们就能将波形片段串联起来concatenate waveform fragments）
-- 考虑当前时刻之前/期间/之后的语境影响 before/during/after the current moment in time

Context-dependent units offer a solution：

• engineer the system in terms of a simple linear string of units
• account for context with a different version of each unit for every different context

但是，我们怎么知道所有不同的语境是什么呢？
- 如果我们列举所有可能的语境，它们实际上是无限的
- 语言中有无数个不同的句子，语境可能跨越整个句子（或更远）
幸运的是，语境对当前语音的影响effects才是最重要的。因此，接下来我们可以考虑减少有效不同语境effectively different contexts的数量

First solution: diphones 双音子

假设影响当前声音的唯一语境是前一个音素preceding phone和后一个音素following phone，于是

diphone units
number of unit types O(N^2)

Problems with diphone synthesis 双音子合成的问题

Signal processing is required to manipulate:

F0 & duration: fairly easy, within a limited range
Spectrum: not so easy, can do simple smoothing at the joins but otherwise it’s not obvious what aspects to modify

But, this extensive signal processing：

introduces artefacts and degrades the signal
cannot faithfully replicate every detail of natural variation in speech (don’t know what to replicate; don’t have powerful enough techniques to manipulate every aspect of speech)

不能通过增加unit types的数量来减少manipulation的需要：因为上下文因子unit types数量会指数级增长（stressed and unstressed versions 增长两倍2000-4000； Phrase-final / non-final versions 在增长两倍4000-8000）

Why is feasible to unit selection synthesis / statistical parametric synthesis:
Some contexts are (nearly) equivalent (don't need every speech sound in every possible context; just each speech sound in a sufficient variety of different contexts)

In diphone synthesis, there is just one recorded copy of each diphone (one copy of each unit type in a carrier phrase to ensure that the diphones were in a “neutral” context). The F0 and duration of that recorded copy will be arbitrary. If we simply concatenated these recordings, we would get an arbitrary and very probably discontinuous F0 contour. We must manipulate the recording in order to impose the predicted F0 (e.g., to get gradual declination over a phrase), and to impose predicted duration.

In unit selection synthesis, we have many recordings of each diphone to choose from. In some versions of unit selection, we will use the front end’s predictions of F0 and duration to help us choose the most appropriate one.

But now we want the effects of context (for lots of different contexts)

The key concept of unit selection speech synthesis:

record a database of speech containing natural variation caused by context
at synthesis time, search for the most appropriate sequence of units

Several unit sizes (half phone, diphone, …) are possible - the principles are the same

总结：双音子合成每个双音子只录一次，合成时候要处理；单元选择会有多种单元（unit可以是各种大小，比如diphone, phone, half phone）在不同上下文，我们选择最适合的单元去拼接。

2. Target and candidate units

From multi-level / tiered / structured linguistic information
to a linear string of **context-dependent units**

The target unit sequence: 上图的序列，每个unit包含上下文信息。each unit annotated with linguistic features.
下一步我们要找到 candidate waveform fragment to render the utterance (from database)。
【！Importantly, the linguistic features are local to each target and each candidate unit】

candidate units from the database：

Retrieve candidate units from the pre-recorded database

下一步找到 the best-sounding sequence of candidates。

1. Quantify “best sounding”
2. Search for the best sequence

不需要看neighbors，只看相同的linguistic features

3. Target cost and join cost 目标成本和连接成本

3.1. Target cost function

the target cost function： measuring similarity （Similarity between candidate sequence and the target sequence）

The ideal candidate unit sequence might comprise units taken from identical linguistic contexts to those in the target unit sequence but this will not be possible in general. So we must use less-than-ideal units from non-identical (i.e., mismatched) contexts. We need to quantify how mismatched each candidate is, so we can choose amongst them.

The mismatch ‘distance’ or ‘cost’ between a candidate unit and the ideal (i.e., target) unit is measured by the target cost function

Taylor describes two possible formulations of the target cost function

independent feature formulation (IFF)
- assume that units from similar linguistic contexts will sound similar
- target cost function measures linguistic feature mismatch
acoustic-space formulation (ASF)
- acoustic properties of the candidates are known
- make a prediction of the acoustic properties of the target units
- target cost function measures acoustic distance between candidates and targets

3.2. Join cost function

the join cost function (joins not joints! 整体不是仅连接处): measuring concatenation quality; the acoustic mismatch between two candidate units

After candidate units are taken from the database, they will be joined (concatenated). There will have mismatches in acoustic properties around the join point including the spectral envelope, F0, energy

The acoustic mismatch between consecutive candidates is measured by the join cost function

A typical join cost quantifies the acoustic mismatch across the concatenation point
e.g., spectral characteristics (parameterised as MFCCs, perhaps), F0, energy
Festival’s multisyn uses a sum of normalised sub-costs (weights tuned by ear) 加权和
Common to also inject some knowledge into the join cost （rule-based）; e.g., some phones are easier to splice than others

**对比两个unit的spectral envelope, F0, energy**

Typical join cost function (e.g., Festival’s multisyn) uses one frame from each side of the join （very local, may miss important information ）

To improve:

consider several frames around the join, or the entire sequence of frames
consider deltas (the rate of change)
(probabilistic) model of the sequence of frames. e.g., hybrid synthesis, which typically involves predicting acoustic parameter trajectories, addresses this

4. Search (for the best sequence)

The total cost of a particular candidate unit sequence under consideration is the sum of

the target cost for every individual candidate unit in the sequence
the join cost between every pair of consecutive candidate units in the sequence

There is a single globally-optimal (lowest total cost) sequence; a search is required, to find this sequence

What base unit type is really used? Homogeneous or heterogeneous units?

Homogeneous system will be easier to implement in software：（unit types are same）

the start and end points of all candidate units align
the search lattice is simple

Heterogeneous system will be a little more complex：(multiple unit types)

start and end points will not all align
number of target costs and join costs to sum up will vary for different paths through the lattice
some normalisation may be needed to correct for this

Heterogeneous unit type (multi-phone units

Homogeneous unit type
with the “zero join cost trick”
= heterogeneous units !

ASF will eventually lead us to hybrid methods which use statistical models to make predictions about the acoustic properties of the target unit sequence

we will have a close look of IFF, ASF in next module （仅是目标成本）

Module 3 - unit selection target cost functions

1. Independent Feature Formulation (IFF)

Independent Feature Formulation (IFF) target cost function：no prediction of any acoustic properties is involved 没有声学属性
count the number of linguistic features in the context of the candidate that do not match those of the corresponding target unit

Example calculation of IFF target cost
for two competing candidates

我们发现：

co-articulation: left context has a stronger effect on the current sound than the right context，所以left context权重比right大一点
对于影响F0的特征只有布尔值，而没有"near match"的概念。比如left-phonetic-context of [v]和[b]相似，都是voiced。而不是radically different ones, like a liquid. 因此我们应该适当降低惩罚而不直接加所有的权重。

how is prosody “created” using an IFF target cost function (no explicit predictions of any acoustic properties)?

candidates from appropriate contexts, when selected, will have appropriate prosody
the join cost will ensure that F0 is continuous

So, we simply need to make sure the linguistic features capture sufficient contextual information that is relevant to prosody • e.g., stress status, position in phrase
Optional: if our front end predicts symbolic prosodic features (e.g., ToBI accents and boundary tones), then we can use them in the target cost function

很明显缺点有：不同linguistic features匹配的两个unit也可能听起来很像；所以我们可以比较听感上相似度（声学属性），也就是之后说的ASF

2. Acoustic Space Formulation (ASF) target cost function

acoustic features 声学特征是什么？

simple acoustic properties such as F0, duration and energy
a more detailed specification such as the spectral envelope (e.g., as cepstral coefficients)

It will only work if we can accurately predict these properties from the linguistic feature 从linguistic预测出来的属性

how about predicting a complete acoustic specification? 使用regression方法，比如decision regression tree 通过linguistic（如voiced? stop? stressed?）来预测 f0。

3. Mixed IFF + ASF

加权和，sub-costs 一部分用linguistic，一部分用acoustic
原因：
- ASF escapes some of the sparsity problems inherent in IFF
- but our acoustic properties do not capture all possible acoustic variation （e.g., voice quality, such as phrase-final creaky voice）
- and, of course, our predictions of acoustic properties will contain errors

4. unit selection design choices

unit types：通常 diphones 或者 half-phones。可以用“zero-join-cost-trick”来使用更大的unit（即连续的unit不需要计算）。
target cost：Festival几乎都是IFF，一小部分acoustic
join cost：通常包括f0, energy, spectral envelop。一些系统还会在joins处作smooth处理。
search：动态规划 + pruning
database

Module 4 – the database

1. key concepts

basic ASR： • Hidden Markov Models • finite state language model • decoding
base unit type： relatively small number of types （e.g., diphone）

in unit selection, base unit type is strictly matched between target and candidate; unless database is badly designed: then we would have to back off to a similar type

therefore, target cost does not need to query the base unit type ; only query its context
context: the linguistic and acoustic environment in which a base unit occurs
- phonetic context - the sounds before and after it
- prosodic environment - stress, prosody, …
- position - in the syllable, word, phrase, …
coverage: We would like a database of speech which contains every possible speech base unit type; in every possible context (i. e. every unit-in-context)

但是Zipf distribution的问题同样存在phoneme中

2. script design 文本设计

Why design a script: In practice, it will be impossible to find a set of sentences that includes at least one token of every unit-in-context type
Goals
- Cover as many types (in context) as possible
- With as few tokens as possible - i.e., in as few sentences as possible

Typical approach to script design: a greedy algorithm for text selection

Find a very large text corpus (e.g., as used in the ARCTIC corpora)
• e.g., newspaper text, out-of-copyright novels, web scraping
Make an exhaustive ‘wish list’ of all possible types (in context) that we would like
Find the sentence in the corpus which provides the largest number of different types that we don’t already have
Add that sentence our recording script
Remove those types from the ‘wish list’
If recording script is long enough, stop. Otherwise, go to 3

Example of text selection

We’ll assume that we have a large corpus of text to start from

Corpus cleaning
- Define the vocabulary (e.g., only words in our dictionary, or the most frequent words in the corpus)
- Discard all sentences that contain out-of-vocabulary (OOV) words
- Discard all sentences that are too long (hard to read out loud) or too short (atypical prosody)
- Optional: discard hard-to-read sentences
Front-end processing
- Pass the text through the TTS front end to obtain, for each sentence
- base unit sequence (e.g., diphones)
- linguistic context of each unit (e.g., stress)

The wish list (with 'stress' as context)

e.g., aa_aa_unstressed, aa_aa_stressed, aa_ae_unstressed, aa_ae_stressed...

Optional improvements：

Guarantee at least one token of every base unit type
Try to cover the rarest units first
- count occurrences in the original large corpus to find the rarest one
- how to implement: include weights in the “richness” measure that reward rarer units in inverse proportion to their frequency

Optional: domain-specific script

Select (or manually design, or automatically generate) in-domain sentences
Measure coverage obtained so far
Fill in the gaps in coverage, using sentences selected from the large text corpus

3. annotating the database

Now we have: a script composed of sentences && a recording of each sentence
What needs to be done: a time-aligned phonetic transcription of the speech && annotate the speech with supra-segmental linguistic information

Analytical labelling

Forced alignment

Pronunciation model = dictionary + optional vowel reduction

flat start training

Module 5 – evaluation

跨系统比较 cross-system comparisons : optionally, control certain components, such as

a common database (as in the Blizzard Challenge)
fixed annotation and label alignments
common front end

Subjective evaluation 主观评估

ask listeners to perform some task
test design
materials used

Listener task

a simple, obvious task: • “choose the version you prefer” • 5 point scales • “type in the words you heard”
to pay attention to specific aspects of speech, e.g., prosody (time-consuming!)
then perform a more sophisticated analysis of the outcome • e.g., pairwise task followed by multi-dimensional scaling analysis

Test design

absolute vs. relative judgements
- Absolute - in other words, listeners rate a single, isolated stimulus • Mean Opinion Score (MOS)
- Relative - listeners compare multiple stimuli
  more than two stimuli, optionally including references (lower and/or upper) • rating (e.g., multiple MOS), ranking , sorting
interface
MUSHRA: Method for the subjective assessment of intermediate quality levels of coding systems
test / sample size: number of listeners, test duration per listener, number of stimuli per listener and in total
- • maximum test duration 45 minutes
- at least 20 listeners, and preferably more
- as many different sentences as possible, to mitigate the effects of any atypical ones
the listeners (“subjects”): type of listener, how to recruit them, quality control of their responses
- within vs. between subjects designs
- within subjects: all has same stimuli (too many!); priming or ordering effect
- between subjects: a "virtual subject"; no memory carry-over effect

Materials 材料选择

Two potentially opposing requirements

expected usage (domain) of the system
goals of the evaluation and the type of analysis we plan to do

e.g. for intelligibility testing we might choose between:

isolated words

can narrow down range of possible errors listener can make
might even design around minimal pairs (e.g., DRT (Diagnostic Rhyme Test) , MRT (Modified Rhyme Test))

full sentences

errors will be more variable & harder to analyse
much more natural task for the listener, perhaps closer to target domain

Materials: intelligibility

‘normal’ material - e.g., sentences from a newspaper. • tend to get a ceiling effect, due to interference from semantics (predictability)
Semantically Unpredictable Sentences (SUS) : e.g., “The unsure steaks overcame the zippy rudder”
Diagnostic Rhyme Test (DRT) or Modified Rhyme Test (MRT): uses minimal pairs
- e.g., “Now we will say cold again.” “Now we will say gold again.”
- specific to individual phonemes - a diagnostic unit test
- very time consuming and therefore rarely used
other ways to avoid a ceiling effect:
- Add noise
- Induce additional cognitive load with another task in parallel

Materials: naturalness

“Randomly” selected text: which domain? newspapers or novels?
Carefully designed text: e.g., Harvard (IEEE) sentences • in phonetically balanced lists

Objective evaluation 客观评估

simple distances to reference samples
- Compare acoustic properties to a natural reference sample (‘gold standard’)
- Time-align natural and synthetic: frame-by-frame comparison, sum up local differences
- Does not account for natural variation (could use multiple natural examples)
or perhaps more sophisticated auditory models
Based only on properties of the signal
- spectral envelope: Mel-Cepstral Distortion (MCD)
- F0 contour: Root Mean Square Error of F0 (RMSE F0) and/or correlation
Complex objective measures
- from telecommunications (for distorted natural speech): e.g., PESQ (P.862) or POLQA (P.863)
  - PESQ (Perceptual evaluation of speech quality) is based on a weighted combination of differences in many properties of speech, such as the higher-order statistical properties of various spectral coefficients

NO! (e.g., type-in text what ppl really heard )

Module 6 – speech signal analysis & modelling

Two parts：

speech signal analysis: generalising source+filter to excitation+spectral envelope

epoch detection (‘pitch marking’): one point in each pitch period
F0 estimation (‘pitch tracking’): avg rate vibration of vocalfold in a local region
spectral envelope estimation

speech signal modelling：: representing speech parameters in a form suitable for statistical modelling

speech parameters
representations suitable for modelling
converting back to a waveform

1. speech signal analysis

Often, we don’t really need the ‘true’ source and filter. We just need to work with the speech signal, so that we can

measure: • individual properties: e.g., F0 for use in the join cost
modify: • phonetic identity • prosody
manipulate: • waveforms: e.g., to smoothly concatenate candidates from the database

Epoch detection vs F0 estimation

Epoch detection (also known as pitch marking, Glottal Closure Instant (GCI) detection)

pitch-synchronous signal processing 音高同步信号处理
- TD-PSOLA 时域-基音同步叠加
- or simply just overlap-add joining of units：对齐前后两个unit的epoch/pitchmark
- a few vocoders operate pitch synchronously

PSOLA 算法的原理是：将原始语音信号与一系列基音同步窗相乘得到一系列短时分析信号：将短时分析信号修正后得到短时合成信号，根据原始语音波形信号和目标波形的基音曲线和音长，确定二者之间的基音周期映射，从而确定所需的短时基音序列；将合成的短时基音序列与目标基音周期同步排列，重叠相加得到基音波形，此时合成的语音波形就是所期望的基音周期曲线和音长。

时域基音同步叠加 (TD-PSOLA)，TD-PSOLA 算法只在时域对波形进行处理，不对信号作频域上的调整。音高(基频 f0) 的修改是通过改变基频之间的时间间隔得到的，音长的修正是通过重复或者省略一些语音片断来完成。

F0 estimation (also known as pitch determination, pitch tracking, F0 tracking, …)

a component of the join cost in all unit selection systems 用来计算join cost的
used in the target cost, for systems that predict F0 targets (ASF)
a parameter for most (probably all) vocoders

1.1 Epoch detection

What we need epoch detection for: PSOLA (Pitch Synchronous Overlap and Add) 两个音频合并时候需要找到epoch来同步

A simple algorithm for epoch detection：

pre-process • remove unwanted frequencies with a low-pass filter 直接移除F0以外的频率后变成简单正弦波了~
peak picking 找到正弦波的顶点处（求导后找零点）
- differentiate
- smooth, to remove spurious low-amplitude variations
- find zero crossings (from positive to negative)
post-process • correct for time offset - e.g., to align pitchmark with largest peak in each period 算法简单所以我们做点后处理去对齐，以及一些噪声需要处理

epoch detection (pitch marking / Glottal Closure Instant(GCI) detection) 用于signal processing的算法。而下面的f0 estimation (pitch tracking) 用于 parameterising 语音信号。

1.2 F0 estimation

可以用1/T from epoch detection 来估计 F0，但是会有很多local errors！（这个error可以减轻通过选取多个epoch作平均）。更聪明做法，引入lag去找重合。

Cross-correlation (also known as “modified autocorrelation”)

We search for a peak in the (modified) autocorrelation function.

There will be a large peak at a lag of 0, another at the pitch period and then every exact multiple of the pitch period 比如第二个图上在0延迟（也就是重合）最大，第二个最大的就是我们想找的

The autocorrelation method （上图右）

Pick the highest non-zero-lag peak over some search range

the corresponding lag = the pitch period (measured in samples)

对于整个utterance，我们可以设置t为每个10ms，所以我们会获得很多F0。

Not always as easy at that sounds: 有很多困难点

real signals are not perfectly periodic
formants will lead to some waveform self-similarity at lags other than exact multiples of the pitch period
choose the search range carefully: 搜索范围的选择
- if upper limit too high, we may choose a peak at too great a lag:
  overestimate the pitch period = underestimate F0 by a factor (e.g., pitch halving)
- if lower limit too low, we may choose the zero-lag peak

因此，基本上pitch estimation 都是在 auto-correlation或者 Cross-correlation的基础上再加上多种预处理和后处理机制。

1.3 pre-process

使用 low-pass filtering 低通滤波器：移除vocal tract informatoin （e.g., formants）和 unvoiced sounds，并且通常和下采样连用（降低计算复杂度）
spectal flattering: 频谱包络线平整化，如 inverse filtering

1.4 post-processing

可能有多个f0 的 candidates，我们需要选择最佳candidates序列。

dynamic programming： YAPPT

所以通用方法 pre-process + autocorrelation + post-process。通常包含大量参数：

Alternatives to auto-correlation：

cepstral domain methods
comb filtering： an adaptive filters that eliminates the harmonic (at multiples of F0)
probabilistic methods 概率方法：监督训练

倒频谱可以被视为在不同频带上变化速率的信息

1.5 Evaluation

ground truth 包含： hand-labelled F0 contours； Laryngograph(EGG) recordings；一些公开数据集

error types： voicing status errors （voice还是unvoice的检测）； F0 error （in voiced speech）

F0 estimation算法一般都会假设periodicity，因此对于creaky voice效果很差。而epoch detection （pitch marking）表现比较平均。

2. Spectral envelope estimation 频谱包络预测

去除频谱图的details来获得包络线。

当计算频谱的window的窗口大小是整数个周期T0时 (comparable)，power spectrum在时域上变换是周期性的。
当计算频谱的window大小包含多个pitch period周期时，power spectrum在频域上显示周期性。

如上图频谱图（spectrogram）或时频谱所示。

The STRAIGHT vocoder：需要用上面 envelope的估计方法减少harmonic影响(自调节window)。

在STRAIGHT分析阶段，使用 F0 adaptive window 最小化envelope中harmonic的干预。并且通过对频率插值来 smooth 包络线。

STRAIGHT 更smooth，少了点details，但是更加independent

3. Speech signal modelling 参数化语音合成

speech parameters representation + regression function -> waveform

speech parameters 包括 f0，envelope，还有aperiodic energy

parameters需要转换为适合建模的representation，

fixed in number (per frame), and low-dimiensional
at a fixed frame rate
a good separation of prosodic and segmental identity aspects of speech 所以可以独立建模
well behaved and stable when we perturb them (e.g., by averaging)
for statistical modelling, uncorrelated params (can avoid having to model covariance)

STRAIGHT vocoder 如何实现？首先是分析阶段，三个params：

1. smooth spectral envelope： high resolution (same as FFT) ； highly-correlated parameters 因为过滤他们的filterbanks是高度相关的。

为了提高统计上表达能力，需要表达envelope为 Mel-cepstrum。（和MFCCs动机相同，但本质不一样。）具体方法：

warp the frequency scale：不使用lossy discrete filterbank，改用continuous function (all-pass filter)
decorrelate：转换spectrum为cepstrum 倒谱
reduce dimensionality：截断cepstrum，ASR时候保留12个系数，合成时候保留更多40~60个

2. aperiodic energy：在每个frequency上的 periodic和aperiodic energy的比率。 periodic energy为 harmonic的顶点形成的包络线， aperiodic energy为 harmonic的低点形成的包络线。同时，也是high resolution (same af FFT)，high-correlated parameters

reduce dimensionality：reduce resolution by averaging across broad frequency bands (5~25 bands on a mel scale)

3. f0 analysis： + Voiced / unvoiced decision

在STRAIGHT vocoder的合成阶段，分析阶段获得的 f0 用来生成 periodic pulse（作为voice energy 也就是激励）；分析的aperiodic energy生成 non-periodic component （如噪声）；分析的envelope作为 filter。

STRAIGHT vocoder。分析阶段输入wave作分析，转换为参数（中间部分），在到合成。
中间缺失部分即之后的 SPSS，通过参数合成新的未见过的语音。

Module 7 – Statistical Parametric Speech Synthesis（SPSS）

三种方法比较:

unit selection：用linguistic specification的target cost和phontic的 join cost 拼接 units
Speech signal modelling： source-filter model，首先分离excitation和spectral envelope，然后重建wave
Statistical Parametric Synthesis：从lingustic specification 预测parameters，是一个回归问题。

1. TTS 作为一个 seq2seq 的 regression回归问题

解决regression tree + Hidden Markov Model (HMM)：一个生成序列一个预测。

作为回归任务：两个任务需要完成。第一个是处理phonetic sequence，决定duration，创建frame序列（可用HMM）。第二个是Prediction (regression)，基于特征预测每一帧。（可用回归树，如CART）
使用context-dependent models完成实现：解决少sample的type和没sample的type；以及相似模型的parameter sharing。方法：Grouping contexts according to phonetic features

Linguistic processing：

from text to linguistic features using the front end (same as in unit selection)
attach linguistic features to phonemes: “flatten” the linguistic structures
we then create one context-dependent HMM for every unique combination of linguistic features

Training the HMMs：

need labelled speech data, just as for ASR (supervised learning)
need models for all combinations of linguistic features, including those unseen in the training data （by parameterising the models using a regression tree）

Synthesising from the HMMs:

use the front end to predict required sequence of context-dependent models (the regression tree provides the parameters for these models)
use those models to generate speech parameters
use a vocoder to convert those to a waveform

Module 7+ Hybrid speech synthesis

比较：

SPSS (with HMMs or DNNs)： flexible, robust to labelling errors，but naturalness is limited by vocoder
Unit selection： potentially excellent naturalness but strongly affected by labelling errors ；很难在新数据上优化（对于target和join cost）。
Hybrid synthesis：robust 统计模型 + waveform 拼接； ASF target cost function！

回顾一下：

Signal processing：参数化语言信号的方法，如MFCCs
Unit selection：在linguistic and/or acoustic space上有稀疏性。
SPSS：使用HMM/DNN的 seq2seq 回归

Case study: Trajectory Tiling 轨迹平铺?

使用

Module 8 – Deep Neural Networks（DNN）

第一篇使用DNN的: Statistical parametric speech synthesis using deep neural networks (2013)

可应用于语音生成中的声学语音建模，以克服前面提到的局限性，实现更好的输入到聚类和/或聚类到特征的映射

Module 9 – sequence-to-sequence models (encoder-decoder)

这些模型解决三个问题：

regression from input to output
alignment during training (和ASR任务类似)
duration prediction during inference (synthesis)

所有模型解决类似的方式解决1。模型差别主要在2和3。

One class of models (e.g., Tacotron 2) attempts to jointly solve 2 and 3 using a single neural architecture. Another class of models (e.g., FastPitch) uses separate mechanisms for 2 and 3.

Whilst it appears elegant to solve two problems with a single architecture, we know that the problem of alignment is actually very different from the problem of duration prediction. Alignment is very similar to Automatic Speech Recognition (ASR), so we might want to take advantage of the best available ASR models to do that. In contrast, duration prediction is a straightforward regression task.

Reading

语音信号数字处理(L.R.Rabiner) 403 pages
语音信号处理 (162 pages)