[PDF] Thi Thu Trang NGUYEN - Free Download PDF (2024)

1 Université Paris-Sud Ecole Doctorale 427: Informatique de Paris-Sud Laboratoire d Informatique pour la M&eacute...

Université Paris-Sud Ecole Doctorale 427: Informatique de Paris-Sud Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur

Specialty : Computer Science

Doctor of Science Defense on Thursday, 24 September 2015

Thi Thu Trang NGUYEN HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation

Committee: Advisors:

Reviewers: Examiner:

Christophe D’ALESSANDRO Do Dat TRAN

Directeur de recherche CNRS (LIMSI)

Philippe MARTIN Yannis STYLIANOU Laurent BESACIER Sophie ROSSET

Professeur émérite (Université Paris-Diderot 7)

Professeur (Institut Polytechnique de Hanoi, Vietnam)

Professeur (Université de Crète, Grèce) Professeur (Université Joseph Fourier, Grenoble) Directeur de recherche CNRS (LIMSI)

Groupe Audio et Acoustique LIMSI-CNRS Rue John von Neumann - Campus Universitaire d’Orsay - Bât 508 F-91405 Orsay Cedex, France

ED 427 - Université Paris-Sud UFR Sciences Orsay Batiment 650 rue Noetzlin 91405 Orsay Cedex, France

This dissertation is dedicated to: My son Teddy, who was six months when I started, My parents and my husband for their love, endless support and encouragement.

Acknowledgements Foremost, I would like to express my most sincere and deepest gratitude to my thesis advisors M. Christophe d’ALESSANDRO (Directeur de Recherche at LIMSI-CNRS, France), Prof. TRẦN Đỗ Đạt and Prof. PHẠM Thị Ngọc Yến (MICA-CNRS, Vietnam) for their continuous support and guidance during my PhD program, and for providing me with such a serious and inspiring research environment. I am really graceful to Christophe for his excellent mentorship, caring, patience, and immense knowledge on Text-To-Speech (TTS). His advise helped me in all the time of research and writing of this thesis. He has also helped me much in completing the joint program administration, applying for scholarship Excellence Eiffel, and funding for traveling or conference. I am very thankful to Prof. Đạt, M. Eric CASTELLI and Prof. Yến for shaping my thesis at the beginning, for their supports in applying for scholarship Évariste Galois, and for their enthusiasm and encouragement. Prof. Đạt has substantially facilitated my PhD research, especially at the time I was a freshman on speech processing and TTS, with his valuable comments on Vietnamese TTS. I am fortunate to have the opportunity to work with Albert RILLIARD (LIMSI). He has brought me great joy and crucial encouragement during my PhD. He has taught me various essential knowledge, such as prosody, statistical analysis, and perceptual evaluation. That has had a great impact on steering this thesis, leading to considerable results for my work. I am very grateful to Albert for his caring and advice on research, writing and presentation. It is my pleasure to thank my thesis reviewers: Prof. Philippe MARTIN (Université ParisDiderot 7), and Prof. Yannis STYLIANOU (Toshiba’s Cambridge Research Laboratory) for accepting and spending their time on reading and giving valuable feedback on my thesis. I would also like to thank Mme. Sophie ROSSET (LIMSI), and Prof. Laurent BESACIER (LIG) for their acceptance to be in my defense committee. I would like to thank Prof. Jacqueline VAISSIÈRE (LPP) for her caring and support during my first three-month internship in France as well as my PhD. I highly appreciate the opportunity to know and work with M. Alexis MICHAUD (MICA). I am sincerely indebted to Alexis for his suggestions and valuable comments on linguistics and writing. I take this opportunity to extend my heartfelt gratitude to my dear friends and colleagues ; especially Marc for his constructive discussions and co-operation ; Areti, Hảo, Chi, Hải Anh, Thuỳ for their encouragement and enthusiasm in revising English for my dissertation ; and together with Olivier, David, Samuel, anh Cường, Khoa, Diệp, anh Sơn, Xuân for their supports and comments for my PhD, and for many fun and a friendly working ambiance at LIMSI and MICA. I wish to give thanks to students: Lan, Thắng, Tùng and the subjects for their efforts in conducting/participating the perception tests at MICA; to my Vietnamese friends in Paris: Khánh, Ngọc Anh, Bình for their enthusiastic supports in recording sessions at LIMSI, and anh Bắc, Hiếu for their helpful suggestions. The present research would not have been feasible without financial supports from the French government with the two scholarships: Évariste Galois and Excellence Eiffel. I would

6 also like to acknowledge the funding from the Région Ile-de-France through the FUI ADN-TR project (2011-2014), Vietnamese NAFOSTED fund for participating conferences. I also take this opportunity to express my gratefulness to Prof. Nicole BIDOIT, Director and to Stéphanie DRUETTA, Assistant of the Ecole Doctorale d’Informatique de Paris-Sud for their supports during my research. Last but not the least, I would like to dedicate this moment to my son Teddy and my husband Chí, who have given me much courage to accomplish this thesis, to my parents for their endless love and support during all my PhD.

Contents Notations and Abbreviations

List of Tables

List of Figures

Lists of Media files

1 Vietnamese Text-To-Speech: Current state and Issues 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Text-To-Speech (TTS) . . . . . . . . . . . . . . . . . . . . 1.2.1 Applications of speech synthesis . . . . . . . . . . 1.2.2 Basic architecture of TTS . . . . . . . . . . . . . . 1.2.3 Source/filter synthesizer . . . . . . . . . . . . . . . 1.2.4 Concatenative synthesizer . . . . . . . . . . . . . . 1.3 Unit selection and statistical parametric synthesis . . . . . 1.3.1 From concatenation to unit-selection synthesis . . 1.3.2 From vocoding to statistical parametric synthesis . 1.3.3 Pros and cons . . . . . . . . . . . . . . . . . . . . . 1.4 Vietnamese language . . . . . . . . . . . . . . . . . . . . . 1.5 Current state of Vietnamese TTS . . . . . . . . . . . . . . 1.5.1 Unit selection Vietnamese TTS . . . . . . . . . . . 1.5.2 HMM-based Vietnamese TTS . . . . . . . . . . . . 1.6 Main issues on Vietnamese TTS . . . . . . . . . . . . . . 1.6.1 Building phone and feature sets . . . . . . . . . . . 1.6.2 Corpus availability and design . . . . . . . . . . . 1.6.3 Building a complete TTS system . . . . . . . . . . 1.6.4 Prosodic phrasing modeling . . . . . . . . . . . . . 1.6.5 Perceptual evaluations with respect to lexical tones 1.7 Proposition and structure of dissertation . . . . . . . . . . 2 Hanoi Vietnamese phonetics and phonology: 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Vietnamese syllable structure . . . . . . . . . 2.2.1 Syllable structure . . . . . . . . . . . . 2.2.2 Syllable types . . . . . . . . . . . . . . 2.3 Vietnamese phonological system . . . . . . . 2.3.1 Initial consonants . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

25 27 28 28 29 31 32 33 33 34 36 38 40 41 42 43 43 44 45 45 46 46

Tonophone approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

49 51 51 52 55 56 56

. . . . . . . . . . . . . . . . . . . . .

Contents

2.3.2 Final consonants . . . . . . . . . . . . . . . 2.3.3 Medials or Pre-tonal sounds . . . . . . . . . 2.3.4 Vowels and diphthongs . . . . . . . . . . . . Vietnamese lexical tones . . . . . . . . . . . . . . . 2.4.1 Tone system . . . . . . . . . . . . . . . . . 2.4.2 Phonetics and phonology of tone . . . . . . 2.4.3 Tonal coarticulation . . . . . . . . . . . . . Grapheme-to-phoneme rules . . . . . . . . . . . . . 2.5.1 X-SAMPA representation . . . . . . . . . . 2.5.2 Rules for consonants . . . . . . . . . . . . . 2.5.3 Rules for vowels/diphthongs . . . . . . . . . Tonophone set . . . . . . . . . . . . . . . . . . . . 2.6.1 Tonophone . . . . . . . . . . . . . . . . . . 2.6.2 Tonophone set . . . . . . . . . . . . . . . . 2.6.3 Acoustic-phonetic tonophone set . . . . . . PRO-SYLDIC, a pronounceable syllable dictionary 2.7.1 Syllable-orthographic rules . . . . . . . . . 2.7.2 Pronounceable rhymes . . . . . . . . . . . . 2.7.3 PRO-SYLDIC . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

56 58 58 60 60 61 63 63 64 64 65 66 66 67 67 69 69 70 71 72

3 Corpus design, recording and pre-processing 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Raw text . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Rich and balanced corpus . . . . . . . . . . 3.2.2 Raw text from different sources . . . . . . . 3.3 Text pre-processing . . . . . . . . . . . . . . . . . . 3.3.1 Main tasks . . . . . . . . . . . . . . . . . . 3.3.2 Sentence segmentation . . . . . . . . . . . . 3.3.3 Tokenization into syllables and NSWs . . . 3.3.4 Text cleaning . . . . . . . . . . . . . . . . . 3.3.5 Text normalization . . . . . . . . . . . . . . 3.3.6 Text transcription . . . . . . . . . . . . . . 3.4 Phonemic distribution . . . . . . . . . . . . . . . . 3.4.1 Di-tonophone . . . . . . . . . . . . . . . . . 3.4.2 Theoretical speech unit sets . . . . . . . . . 3.4.3 Real speech unit sets . . . . . . . . . . . . . 3.4.4 Distribution of speech units . . . . . . . . . 3.5 Corpus design . . . . . . . . . . . . . . . . . . . . . 3.5.1 Design process . . . . . . . . . . . . . . . . 3.5.2 The constraint of size . . . . . . . . . . . . 3.5.3 Full coverage of syllables and di-tonophones 3.5.4 VDTS corpus . . . . . . . . . . . . . . . . . 3.6 Corpus recording . . . . . . . . . . . . . . . . . . . 3.6.1 Recording environment . . . . . . . . . . . 3.6.2 Quality control . . . . . . . . . . . . . . . . 3.7 Corpus preprocessing . . . . . . . . . . . . . . . . . 3.7.1 Normalizing margin pauses . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

75 77 78 78 78 79 79 80 80 81 81 82 83 83 83 84 84 86 86 88 89 90 91 91 92 93 93

2.4

2.5

2.6

2.7

2.8

Contents

3.8

3.7.2 Automatic labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 The VDTS speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93 95 95

4 Prosodic phrasing modeling 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Analysis corpora and Performance evaluation . . . . . . . . . . 4.2.1 Analysis corpora . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Precision, Recall and F-score . . . . . . . . . . . . . . . 4.2.3 Syntactic parsing evaluation . . . . . . . . . . . . . . . . 4.2.4 Pause prediction evaluation . . . . . . . . . . . . . . . . 4.3 Vietnamese syntactic parsing . . . . . . . . . . . . . . . . . . . 4.3.1 Syntax theory . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Vietnamese syntax . . . . . . . . . . . . . . . . . . . . . 4.3.3 Syntactic parsing techniques . . . . . . . . . . . . . . . 4.3.4 Adoption of parsing model . . . . . . . . . . . . . . . . 4.3.5 VTParser, a Vietnamese syntactic parser for TTS . . . 4.4 Preliminary proposal on syntactic rules and breaks . . . . . . . 4.4.1 Proposal process . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Proposal of syntactic rules . . . . . . . . . . . . . . . . . 4.4.3 Rule application and analysis . . . . . . . . . . . . . . . 4.4.4 Evaluation of pause detection . . . . . . . . . . . . . . . 4.5 Simple prosodic phrasing model using syntactic blocks . . . . . 4.5.1 Duration patterns of breath groups . . . . . . . . . . . . 4.5.2 Duration pattern of syllable ancestors . . . . . . . . . . 4.5.3 Proposal of syntactic blocks . . . . . . . . . . . . . . . . 4.5.4 Optimization of syntactic block size . . . . . . . . . . . 4.5.5 Simple model for final lengthening and pause prediction 4.6 Single-syllable-block-grouping model for final lengthening . . . 4.6.1 Issue with single syllable blocks . . . . . . . . . . . . . . 4.6.2 Combination of single syllable blocks . . . . . . . . . . . 4.7 Syntactic-block+link+POS model for pause prediction . . . . . 4.7.1 Proposal of syntactic link . . . . . . . . . . . . . . . . . 4.7.2 Rule-based model . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Predictive model with J48 . . . . . . . . . . . . . . . . . 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 101 103 103 105 106 107 107 107 110 114 115 117 119 119 120 121 123 125 126 128 132 133 134 137 137 137 139 139 141 143 145

5 VTED, a Vietnamese HMM-based TTS system 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . 5.2 Typical HMM-based speech synthesis . . . . . . . 5.2.1 Hidden Markov Model . . . . . . . . . . . 5.2.2 Speech parameter modeling . . . . . . . . 5.2.3 Contextual features . . . . . . . . . . . . 5.2.4 Speech parameter generation . . . . . . . 5.2.5 Waveform reconstruction with vocoder . . 5.3 Proposed architecture . . . . . . . . . . . . . . . 5.3.1 Natural Language Processing (NLP) part 5.3.2 Training part . . . . . . . . . . . . . . . . 5.3.3 Synthesis part . . . . . . . . . . . . . . .

. . . . . . . . . . .

147 149 149 149 151 152 154 155 156 157 158 158

. . . . . . . . . . .

Contents

5.4

5.5

5.6

5.7 5.8

Vietnamese contextual features . . . . . . . . . . . . . 5.4.1 Basic Vietnamese training feature set . . . . . 5.4.2 ToBI-related features . . . . . . . . . . . . . . . 5.4.3 Prosodic phrasing features . . . . . . . . . . . . Development platform and configurations . . . . . . . 5.5.1 Mary TTS, a multilingual platform for TTS . . 5.5.2 Mary TTS workflow of adding a new language 5.5.3 HMM-based voice training for VTED . . . . . Vietnamese NLP for TTS . . . . . . . . . . . . . . . . 5.6.1 Word segmentation . . . . . . . . . . . . . . . . 5.6.2 Text normalization (vted-normalizer) . . . . . . 5.6.3 Grapheme-to-phoneme conversion (vted-g2p) . 5.6.4 Part-of-speech (POS) tagger . . . . . . . . . . . 5.6.5 Prosody modeling . . . . . . . . . . . . . . . . 5.6.6 Feature Processing . . . . . . . . . . . . . . . . VTED training voices . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . .

6 Perceptual evaluations 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 Evaluations of ToBI features . . . . . . . . 6.2.1 Subjective evaluation . . . . . . . . . 6.2.2 Objective evaluation . . . . . . . . . 6.3 Evaluations of general naturalness . . . . . 6.3.1 Initial test . . . . . . . . . . . . . . . 6.3.2 Final test . . . . . . . . . . . . . . . 6.3.3 Discussion on the two tests . . . . . 6.4 Evaluations of general intelligibility . . . . . 6.4.1 Measurement . . . . . . . . . . . . . 6.4.2 Preliminary test . . . . . . . . . . . 6.4.3 Final test with Latin square . . . . . 6.5 Evaluations of tone intelligibility . . . . . . 6.5.1 Stimuli and paradigm . . . . . . . . 6.5.2 Initial test . . . . . . . . . . . . . . . 6.5.3 Final test . . . . . . . . . . . . . . . 6.5.4 Confusion in tone intelligibility . . . 6.6 Evaluations of prosodic phrasing model . . 6.6.1 Evaluations of model using syntactic 6.6.2 Evaluations of model using syntactic 6.7 Conclusion . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . rules . blocks . . . .

. . . . . . . . . . . . . . . . . . . . .

7 Conclusions and perspectives 7.1 Contributions and conclusions . . . . . . . . . . . . . 7.1.1 Adopting technique and performing literature 7.1.2 Proposing a new speech unit – tonophone . . 7.1.3 Designing and building a new corpus . . . . . 7.1.4 Proposing a prosodic phrasing model . . . . . 7.1.5 Designing and constructing VTED . . . . . . 7.1.6 Evaluating the TTS system . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

158 158 160 161 163 163 163 164 167 167 168 171 171 172 173 173 174

. . . . . . . . . . . . . . . . . . . . .

177 179 180 180 181 184 184 185 187 187 187 188 189 191 191 192 194 196 197 198 199 200

. . . . . . .

203 205 205 207 207 209 211 211

. . . . . reviews . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

Contents

7.2

Perspectives . . . . . . . . . . . . . . . . . . . 7.2.1 Improvement of synthetic voice quality 7.2.2 TTS for other Vietnamese dialects . . 7.2.3 Expressive speech synthesis . . . . . . 7.2.4 Voice reader . . . . . . . . . . . . . . . 7.2.5 Reading machine . . . . . . . . . . . .

. . . . . .

List of publications A Vietnamese syntax parsing A.1 Syntax theory . . . . . . . . . . . . . . A.1.1 Syntax and grammar . . . . . . A.1.2 Parts Of Speech (POS) . . . . A.1.3 Phrase structure grammar . . . A.1.4 Dependency structure grammar A.2 Syntactic parsing techniques . . . . . . A.2.1 Treebank corpus . . . . . . . . A.2.2 Generative models . . . . . . . A.2.3 Discriminative models . . . . . A.2.4 Perceptron . . . . . . . . . . . A.2.5 Advanced parsing methods . . A.3 Vietnamese classifiers . . . . . . . . .

213 213 214 215 215 215

217 . . . . . . . . . . . .

. . . . . . . . . . . .

219 221 221 222 223 225 227 228 228 230 231 234 234

B Corpus design and prosodic phrasing modeling B.1 Semi-automatic correction of breath noise labeling B.2 VNSP-ThuTrang . . . . . . . . . . . . . . . . . . . B.3 Syntactic rules . . . . . . . . . . . . . . . . . . . . B.3.1 Formal symbols representing syntactic rules B.3.2 Proposal of syntactic rules . . . . . . . . . . B.4 Breath groups and syllable ancestors . . . . . . . . B.5 Syntactic blocks . . . . . . . . . . . . . . . . . . . B.6 Algorithm of syntactic-block devision . . . . . . . . B.7 Syntactic-block+link+POS model . . . . . . . . .

. . . . . . . . .

237 239 240 241 241 242 244 249 250 251

. . . .

255 257 258 259 261

. . . . . . . . . . . .

C VTED design, construction and perceptual C.1 The ToBI transcription model . . . . . . . . C.2 Mary TTS platform . . . . . . . . . . . . . C.3 Examples of test GUI screens . . . . . . . . C.4 Test corpus examples . . . . . . . . . . . . .

. . . . . . . . . . . .

evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

Bibliography

267

Abstract / Résumé

283

Notations and Abbreviations Notation / Abbreviation A ADT ANOVA AP C CALM CART CC CD-HMM CFG DFKI

Expansion Adjective ADjuncT Analysis Of Variance Adjective Phrase subordinate Conjunction Causal-Anticausal Linear filter Model Classification And Regression Tree Coordinate Conjunction Continuous Distribution HMM Context-free Grammars

German Research Center for Artificial Intelligence

DRT DSP E EHMM

Diagnosis Rhyme Test Digital Signal Processing prEposition

EM F0 G2P GUI H HMM HTK

Expectation Maximization

HTS I IoIT IPA J48

HMM-based speech synthesis Interjection Institute of Information Technology International Phonetic Alphabet

JAWS

Job Access With Speech

Explanation

A labeler included in the festvox project (http://festvox.org/) Fundamental Frequency Grapheme-To-Phoneme Graphical User Interface Head Hidden Markov Model Hidden markov model ToolKit

Head element of a syntactic phrase A portable toolkit for building and manipulating hidden Markov models

The Java implementation of the C4.5 algorithm The world’s most popular screen reader Determiner

14 Notation / Abbreviation LCA LMA LP LPCFG LTAG M MARY (TTS) MEASYLDIC

Notations and Abbreviations

Expansion Lowest Common Ancestor Log Magnitude Approximation Linear Prediction Lexical Probabilistic Context-free Grammars Lexicalized Tree-Adjoining Grammars nuMeral Modular Architecture for Research on speech sYnthesis MEAningful SYLlable DICtionary

MFCC ML MLSA MOS MSDHMM N NLP NP Np NSW Nu OBJ OBJ2 OBL OCR PCFG PDF POS PP PRD PROSYLDIC

Mel Frequency Cepstral Coefficents Maximum Likelihood Mel Log Spectrum Approximation Mean Opinion Score Multi-Space probability Distribution HMM Noun Natural Language Processing Noun Phrase Pronoun Non-Standard Word Unit noun OBJect OBJect 2 OBLique Optical Character Recognition Probabilistic Context-free Grammars Probability Density Function Part-Of-Speech Prepositional Phrase PReDicate PROnounceable SYLlable DICtionary

PSOLA R S SAMPA

Pitch Synchronous OverLap and Add adveRb

SBAR/SB SPTK SSML

Explanation

A pronunciation dictionary including all meaningful Vietnamese syllables extracted from a huge raw text

Primary object Secondary object

Word class or a lexical category

A pronunciation dictionary including all pronounceable Vietnamese syllables

Main or independent clause Speech Assessment Methods Phonetic Alphabet Subornidate or dependent clause Speech signal Processing ToolKit Speech Synthesis Markup Language

Notations and Abbreviations

Notation / Abbreviation SUB T TDPSOLA ToBI

TTS UCP

V VCL VDTO

Expansion SUBject

Auxiliary/modal words Time-Domain Pitch Synchronous OverLap and Add Tones and Break Indices, a set of conventions for transcribing and annotating the prosody of speech Text-To-Speech A phrase including two or more head elements in different categories, connected by a coordinating conjunction

VDTS

Verb Vietnam Lexicography Centre Vietnamese Di-Tonophone and Others Vietnamese Di-Tonophone Speech

VEVA VNSP VOS

VTed EVAluation tool VNSpeechCorpus for synthesis Voice Of Southern vietnam

VP VSYL

Verb Phrase Vietnamese SYLlable

VTed/VTED

Vietnamese TExt-to-speech Development system

VTParser

WEKA

Waikato Environment for Knowledge Analysis

WinPitch X-SAMPA XML XP Z

Explanation

The analysis corpus including VDTS and other recorded sentences The final training corpus designed for VTed

A Vietnamese TTS system, http:// www.ailab.hcmus.edu.vn/ A new designed corpus with 100% syllable coverage

An adopted Vietnamese parser using averaged perceptron and shift-reduce parsing algorithm A collection of machine learning algorithms for data mining tasks: A Windows speech analysis program optimized for intonation research

Extended Speech Assessment Methods Phonetic Alphabet eXtensible Markup Language Unclassified phrases Bound morphemes

List of Tables 1.1

Unit-selection and HMM-based speech synthesis

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12

Structure-based types of Vietnamese syllables . . . . . . . . . . . . . . . . . . Hanoi Vietnamese initial consonants . . . . . . . . . . . . . . . . . . . . . . . Hanoi Vietnamese final consonants . . . . . . . . . . . . . . . . . . . . . . . . Hanoi Vietnamese vowels and diphthongs . . . . . . . . . . . . . . . . . . . . Hanoi Vietnamese tones (Ferlus, 2001, p. 298) . . . . . . . . . . . . . . . . . Vietnamese tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanoi Vietnamese initial/final consonants: Grapheme (orthography) to phomeme Hanoi Vietnamese vowels/diphthongs: Grapheme (Orthography) to phoneme Vietnamese tonophone set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanoi Vietnamese acoustic-phonetic tonophones – consonants . . . . . . . . . Hanoi Vietnamese acoustic-phonetic tonophones – vowels . . . . . . . . . . . Hanoi Vietnamese pronounceable rhymes, *: not exist but pronounceable. The medial orthography “o” (e.g. “oanh” [w EN]) is changed to “u” if the initial is ff /k/ (its orthography must be “q”), e.g. “loanh quanh” [lw EN qw EN] (to go ff ff around); some rhymes do not exist yet are pronounceable: (q)uec, (q)ueng, (q)uep, (q)uem, (q)uap, (q)uam . . . . . . . . . . . . . . . . . . . . . . . . . .

55 56 57 58 61 63 64 65 67 68 69

The final raw data for Vietnamese corpus design . . . . . . . . . . . . . . . . Number of speech units in theory and in the raw text . . . . . . . . . . . . . Distribution of top 9 frequent (p1-9) and rare (r9-1) speech units of the raw text Number of di-phones/di-tonophones having small frequencies . . . . . . . . . New corpora designed with the same size as the old one VNSP. SAME: candidate sentences containing the most frequent uncovered unit, SAME-B: candidate sentences containing the rarest one . . . . . . . . . . . . . . . . . . . . . VSYL – the corpus with a complete syllable coverage, and VDTS – the corpus with a complete di-tonophone coverage . . . . . . . . . . . . . . . . . . . . . .

81 85 85 86

3.1 3.2 3.3 3.4 3.5

3.6

4.1 4.2 4.3 4.4 4.5 4.6 4.7

. . . . . . . . . . . . . . . .

VDTO analysis and test corpus . . . . . . . . . . . . . . . . . . . . . . . . . . VietTreebank corpus (Nguyen et al., 2009, p. 14) . . . . . . . . . . . . . . . . Vietnamese POS tag set (Le et al., 2010, p. 14) . . . . . . . . . . . . . . . . . F-score of the adopted parsing system on English Test Set comparing with state-of-the-art parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of experiment comparing between different Vietnamese parsers . . . . Experimental results of three syntax parsing types for Vietnamese . . . . . . ANOVA results of Syntactic Rules and break indices on Pause length and Final lengthening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 90 105 110 111 117 118 119 122

List of Tables

4.8 4.9 4.10 4.11 4.12

Detail precisions syntactic rules in VNSP-Broadcaster and VTDO-Analysis . 123 Evaluation of syntactic rules in VNSP-Broadcaster and VTDO-Analysis corpora125 Summarization of syllable number of breath groups and different level ancestors132 Different limits for syntactic blocks (n=6..17;27) . . . . . . . . . . . . . . . . 134 Improvement of pause prediction with rules using syntactic link and POS predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.13 Performance of three rule-based models for pause prediction using syntactic block, syntactic link, POS. T1 pau: pauses predicted by blocks having at least 5 syllables; T2 pau: pauses predicted by blocks having from 2 to 4 syllables . 143 4.14 Performance of pause predictive models with J48 using different contextual features; and rule-based model . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.1 5.2 5.3 5.4 5.5 5.6

Prosodic phrasing rules . . . . . . . . . . . . . . . . . . . . . . . ToBI boundary tone (i.e. intonation rules) for phrases . . . . . . New training features for final lengthening from syntactic blocks Some HMM configuration values for VTED . . . . . . . . . . . . Vietnamese Non-Standard Word categorization for VTED . . . . HoaSung TTS and different versions of VTED . . . . . . . . . .

. . . . . .

160 161 162 165 170 174

Anova results of pair-wise comparison test. . . . . . . . . . . . . . . . . . . . Anova results of MOS test for initial VTED versions . . . . . . . . . . . . . . Anova results of MOS test for initial and final VTED versions. . . . . . . . . The Latin square design for three voices (1, 2, 3). . . . . . . . . . . . . . . . . Coverage of tone pairs of test corpus for the tone intelligibility test . . . . . . Anova results of the initial tone intelligibility test: first version of VTED and a natural speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Anova results of final tone intelligibility test . . . . . . . . . . . . . . . . . . . 6.8 Tone confusion of initial version VTED1 in the first tone intelligibility test . . 6.9 Tone confusion of last version VTED5 and natural speech in the final tone intelligibility test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Anova results of MOS test and pair-wise comparison. . . . . . . . . . . . . . .

181 185 186 189 192

6.1 6.2 6.3 6.4 6.5 6.6

194 196 196 197 199

B.1 VNSP corpus with Broadcaster and ThuTrang voices . . . . . . . . . . . . . . 241 B.2 Constituent syntactic rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.3 Functional rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 C.1 C.2 C.3 C.4 C.5

Test corpus examples of Tone intelligibility test . . . . . . . . . . . . . . . . . Test corpus examples of MOS test . . . . . . . . . . . . . . . . . . . . . . . . Test corpus examples of Intelligibility test . . . . . . . . . . . . . . . . . . . . Test corpus examples of Pair-wise preference test using syntactic rules . . . . Test corpus examples of Pair-wise preference test using syntactic-blocks, -links and POSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

261 262 263 264 265

List of Figures 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 2.5

3.1 3.2 3.3 3.4

4.1 4.2 4.3 4.4 4.5 4.6 4.7

Basic architecture of a TTS system (NLP: Natural Language Processing, DSP: Digital Signal Processing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs (Zen et al., 2009). . . . . Core architecture of HMM-based speech synthesis system (Yoshimura, 2002). General HMM-based synthesis scheme (Zen et al., 2009, p. 5). . . . . . . . . The position of “medial” /w/ in Vietnamese syllables: (a) Thompson (1987) and (b) Vogel et al. (2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The hierarchical structure of Vietnamese syllables by Doan (1977) . . . . . . The concluded hierarchical structure of Vietnamese syllables. . . . . . . . . . Location of Vietnamese diphthong centroids (Kirby, 2011). . . . . . . . . . . Eight tone templates of Vietnamese tones (Michaud, 2004): A1 (level tone 1), A2 (falling tone 2), C2 (broken tone 3), C1 (curve tone 4), B1 (rising tone in sonorant-final syllables – 5a), D1 (rising tone in obstruent-final syllables – 5b), B2 (drop tone in sonorant-final syllables – 6a) and D2 (drop tone in obstruent-final syllables – 6b). . . . . . . . . . . . . . . . . . . . . . . . . . . . Main tasks in raw text pre-processing. . . . . . . . . . . . . . . . . . . . . . . Corpus design: repetitions of selection processes. . . . . . . . . . . . . . . . . Soundproof vocal booth. The iPad screen was put in a suitable and straight position for the speaker. The anti-pop filter was in front of the microphone. . An example of transcription files (TextGrid) for sentence “Lão muốn gì lão làm cho bằng được” [law-4 mu@n-5a zi-2 law-4 lam-2 tCO-1 á˘aN-2 âW@k-6b] in (a) the old speech corpus by a broadcaster (manual labeled) and (b) the new speech corpus by ThuTrang (automatic labeled). . . . . . . . . . . . . . . . . An example of syntax tree using constituent parsing with grammar-functional labels: (a) hierarchical tree and (b) XML format. . . . . . . . . . . . . . . . . Classification of clausal elements (Kroeger, 2005, p. 62). . . . . . . . . . . . . An example of a sentence annotated in VietTreebank using: (a) brackets and (b) a hierarchical tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General approach for prosodic phrasing modeling using syntactic rules. . . . . An example of rule application to syntax tree and transcription file. This process was automatically performed by our program. . . . . . . . . . . . . . Final lengthening (ZScore) and Log(Pause) of predicted break indices. . . . . Distributions of pause length of predicted boundaries by break indices using syntactic rules in (a) VNSP-Broadcaster (b) VDTO-Analysis. . . . . . . . . .

29 33 35 36

52 53 55 59

62 80 87 91

104 109 113 120 122 123 124

List of Figures

4.8

4.9

4.10 4.11

4.12

4.13

4.14

4.15 4.16 4.17 4.18 4.19

4.20 4.21

5.1 5.2 5.3 5.4 5.5 5.6

5.7 5.8 5.9

Distributions of non-final/final syllable duration (ZScore) of breath groups in the VDTO-Analysis corpus, factored by syllable numbers of breath groups. Breath groups having more than 24 syllables were excluded. . . . . . . . . . . 126 Distributions of syllable durations (ZScore) by syllable positions in breath groups in the VTDO-Analysis corpus, factored by syllable numbers of breath groups. Breath groups having more than 18 syllables were excluded. . . . . . 127 Highest (1st, 2nd and 3rd level) and lowest ancestors of the syllable “Phương” (Phuong) in syntax tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Distributions of syllable durations (ZScore) by syllable positions in lowest ancestors in the VTDO-Analysis corpus, factored by syllable numbers of these ancestors. Breath groups having more than 18 syllables were excluded. . . . . 130 Distributions of syllable durations (ZScore) by syllable positions in second level ancestors, factored by syllable numbers of these ancestors. Last syllables of higher-level ancestors and syllables with subsequent pauses were excluded. Ancestors having more than 18 syllables were excluded. . . . . . . . . . . . . 131 Distributions of duration (ZScore) and subsequent pause length (log scale) of final syllables of syntactic blocks with a maximum of 17 syllables, factored by syllable numbers of these blocks. If there was no subsequent pause, log(pause) was set to 0 for ease of representation. . . . . . . . . . . . . . . . . . . . . . . 133 Distributions of duration (ZScore normalization) of final syllables of syntactic blocks with a maximum of 6 syllables, factored by syllable numbers of these blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Distributions of pause length of final syllables of syntactic blocks with a maximum of 10 syllables, factored by syllable numbers of these blocks. . . . . . . 135 Examples of combining single syllable syntactic blocks with the next block. . 137 Examples of combining single syllable syntactic blocks with the previous block. 138 Exception cases for combining single syllable syntactic blocks. . . . . . . . . . 138 Distributions of normalized duration (ZScore) of final syllables of combined syntactic blocks with a maximum of syllable number of 6, factored by syllable number of these blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Example of syntactic links in syntax trees. . . . . . . . . . . . . . . . . . . . . 140 Distributions of pause length after the last syllables of syntactic blocks having (a) at least 5 syllables; (b) from 2 to 4 syllables. The x-axis shows syntactic links of these last syllables, factored by their next syntactic links. . . . . . . . 141 Examples of HMM structure (Masuko, 2002). . . . . . . . . . . . . . . . . . . Output distributions. PDF: Probability Density Function. . . . . . . . . . . . Basic structure of a feature vector modeled by HMM (Yoshimura, 2002) . . . Unified framework of HMM (Yoshimura, 2002). . . . . . . . . . . . . . . . . . Decision trees for context clustering (Yoshimura, 2002) . . . . . . . . . . . . . Relation between probability density function and generated parameter for a Japanese phrase “unagi” (top: static, middle: delta, bottom: delta-delta) (Yoshimura, 2002). A smooth trajectory is generated from a discrete sequence of distributions, by taking the statistical properties of the delta and delta-delta coefficients into account. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional excitation model. . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixed excitation model (Yoshimura et al., 2005). . . . . . . . . . . . . . . . . Proposed architecture of the HMM-based TTS system for Vietnamese. . . . .

150 150 151 152 153

154 155 156 157

List of Figures

5.10 HMM-based voice training in Mary TTS (Würgler, 2011). . . . . . . . . . . . 166 5.11 Overlap ambiguity of Vietnamese word segmentation (graph representation) (Le et al., 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.12 Normalization model for Vietnamese (NSWs: Non-Standard Words). . . . . . 169 6.1 6.2

6.3

6.4

6.5

6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16

A.1 A.2 A.3 A.4 A.5

Preference rate of VNSP-VTed1 (With ToBI) and VNSP-VTed2 (Without ToBI).180 Preference rate by lexical tones and boundary modes with a 3-point scale: (+1) VNSP-VTed2 (Without ToBI), (0) The-same, and (+1): VNSP-VTed1 (With ToBI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Discontinuity in spectrum and F0 in (b) VNSP-VTed1 (With ToBI) compared to (a) VNSP-VTed2 (Without ToBI) of of “tốt” [tot-5b] (good) in “. . . càng nhiều càng tốt” [kaN-1 ñiew-2 kaN-1 tot-5b] (as much as possible). . . . . . . . 182 Unexpected voice quality in (b) With ToBI compared to (a) Without ToBI of “mét” (meter) [mEt-5b] in “bao nhiêu mét” [baw-1 ñiew-1 mEt-5b] (how many meters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Score of naturalness (MOS Test) of initial HMM-based TTS system VTED, non-uniformed unit-selection TTS system HoaSung, with a natural speech reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Score of naturalness (MOS Test) of initial and final versions of VTED, and two natural voices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Error rates of intelligibility in utterance elements. . . . . . . . . . . . . . . . . 188 Error rates of initial, final VTED and a natural speech at phoneme, tone and syllable levels. The test was designed based on Latin square matrix 3x3. . . 190 Edit operations of initial, final VTED and a natural speech at phoneme, tone and syllable levels. The test was designed based on Latin square matrix 3x3. 190 Correct rates of tone intelligibility of initial system. . . . . . . . . . . . . . . . 193 Correct rates by tone types of tone intelligibility. . . . . . . . . . . . . . . . . 193 Correct rates of the final tone intelligibility test. . . . . . . . . . . . . . . . . 195 Correct rates by tone types of final tone intelligibility test. . . . . . . . . . . . 195 Pair-wise comparison of VTED-VNSP with/without prosodic phrasing model using syntactic rules (manual). . . . . . . . . . . . . . . . . . . . . . . . . . . 198 MOS score of VTED-VNSP with/without prosodic phrasing model using syntactic rules (manual). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Pair-wise comparison of VTED-VDTS with/without prosodic phrasing model using syntactic blocks (automatic). . . . . . . . . . . . . . . . . . . . . . . . . 200 Language as a correlation between gestures and meaning Classification of clausal elements (Kroeger, 2005, p. 62). The original perceptron learning algorithm. . . . . . . . The voted perceptron algorithm. . . . . . . . . . . . . . The averaged perceptron learning algorithm. . . . . . .

(Valin, 2001, p. 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

222 226 231 232 233

B.1 Breath noises were wrong labeled as a part of the previous segments [j k i] in the carrying text: (a) do nghị sĩ Chang Young Dal, và đoàn Nga [], (b) nghị sĩ klyus viktor - alexandrovich, (c) . . . . . . . . . . . . . . . . . . . . . . . . . . 239 B.2 Breath noises were wrong labeled as a part of the next segments [v a b]. . . . 240 B.3 Distribution of Breath Group length in VTDO-Analysis. . . . . . . . . . . . . 245

List of Figures

B.4 Distributions of syllables’ durations (ZScore) by positions in highest ancestors in the VTDO-Analysis corpus, factored by syllable numbers of highest ancestors. Ancestors having more than 24 syllables were excluded. . . . . . . B.5 Distributions of syllable duration differences (Delta ZScore) by positions in lowest ancestors in the VTDO-Analysis corpus, factored by syllable numbers of these ancestors. Last syllables of higher level ancestors and syllables with subsequent pauses were excluded. Ancestors having more than 24 syllables were excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Distributions of syllable durations (ZScore) by positions in syntactic blocks with a maximum of 27 syllables, factored by syllable numbers of these blocks. Syllables with subsequent pauses were excluded. Ancestors having more than 18 syllables were excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Distributions of duration (ZScore normalization) of final syllables of syntactic blocks with a maximum of 17 syllables, factored by syllable numbers of these blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Distributions of pause length of final syllable of syntactic blocks with a maximum of 17 syllables, factored by syllable numbers of these blocks. . . . . . . . B.9 Distributions of pause length after the last syllables of syntactic blocks having (a) at least 5 syllables (ambiguous cases with next and current syntactic links of 2-2,2-3,3-2,3-4); (b) from 2 to 4 syllables (ambiguous cases with next and current syntactic links of 2-1,2-2,2-h2,3-1,3-l1,4-2). The x-axis shows next POSs of these last syllables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.10 Distributions of pause length after the last syllables of syntactic blocks having at least 2 syllables. The x-axis shows the next syntactic link of these last syllables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.11 Distributions of pause length after the last syllables of syntactic blocks having (a) at least 5 syllables (ambiguous cases with next POSs of “CC”); (b) from 2 to 4 syllables (ambiguous cases with next POSs of “L” or “M”). The x-axis shows the POS of these last syllables, factored by next POSs. . . . . . . . . C.1 C.2 C.3 C.4 C.5

Overall process to add a new language to Mary TTS. . GUI of MOS test (naturalness). . . . . . . . . . . . . . GUI of Intelligibility test. . . . . . . . . . . . . . . . . GUI of Tone intelligibility test. . . . . . . . . . . . . . GUI of Pair-wise preference test. . . . . . . . . . . . .

. . . . .

246

247

248

249 249

252

253 258 259 259 260 260

Lists of Media files • The PhD student page, including thesis introduction, list of publications with softcopies, demo voices, ect. is available at https://perso.limsi.fr/trangntt. • The online demonstration of the VTED system is available at https://perso.limsi. fr/trangntt/online-demo. • The demo voices are available at https://perso.limsi.fr/trangntt/demo-voices or at the “Demo voices” menu of the PhD student page. This webpage includes several samples for different perception tests: – MOS test – Intelligibility test – Tone intelligibility test – Pair-wise comparison test.

Chapter 1

Vietnamese Text-To-Speech: Current state and Issues Contents 1.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Text-To-Speech (TTS) . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

1.4 1.5

1.6

1.7

1.2.1

Applications of speech synthesis . . . . . . . . . . . . . . . . . . . . . .

1.2.2

Basic architecture of TTS . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.3

Source/filter synthesizer . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.4

Concatenative synthesizer . . . . . . . . . . . . . . . . . . . . . . . . . .

Unit selection and statistical parametric synthesis

. . . . . . . . . .

1.3.1

From concatenation to unit-selection synthesis . . . . . . . . . . . . . .

1.3.2

From vocoding to statistical parametric synthesis . . . . . . . . . . . . .

1.3.3 Pros and cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vietnamese language . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 38

Current state of Vietnamese TTS

. . . . . . . . . . . . . . . . . . . .

1.5.1

Unit selection Vietnamese TTS . . . . . . . . . . . . . . . . . . . . . . .

1.5.2

HMM-based Vietnamese TTS . . . . . . . . . . . . . . . . . . . . . . . .

Main issues on Vietnamese TTS . . . . . . . . . . . . . . . . . . . . .

40 41 42 43

1.6.1

Building phone and feature sets . . . . . . . . . . . . . . . . . . . . . . .

1.6.2

Corpus availability and design

1.6.3

Building a complete TTS system . . . . . . . . . . . . . . . . . . . . . .

1.6.4

Prosodic phrasing modeling . . . . . . . . . . . . . . . . . . . . . . . . .

1.6.5

Perceptual evaluations with respect to lexical tones . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

Proposition and structure of dissertation . . . . . . . . . . . . . . . .

1.1. Introduction

1.1

Introduction

Building systems that mimic human capabilities in understanding, generating or coding speech for a range of human-to-human and human-to-machine interactions has been increasingly expected for many recent years. One important task for obtaining such systems is to artificially produce the human speech. This field of study is known both as speech synthesis, i.e. the generation of synthetic speech, and Text-To-Speech (TTS), i.e. the conversion of written text to machine-generated speech. A TTS system is one that reads text out loud through the computer’s sound card or other speech synthesis devices. Vietnamese, the official language of Vietnam, is a tonal language, in which pitch is mostly used as a part of speech, changing the meaning of a word/syllable. Although Vietnamese TTS has been recently receiving a range of research on a number of synthesis techniques, there is a need for a complete and high-quality TTS system for this language with an appropriate corpus. The initial motivation of this work was to build a high-quality TTS system assisting Vietnamese blind people to access written text. The main objective of this research was then narrowed to build a high-quality TTS system with unlimited vocabulary. The Hidden Markov Model 1 (HMM-)based speech synthesis technique, a statistical parametric approach that will be discussed in this chapter, was chosen for developing a Vietnamese TTS system due to its predominance on general quality, footprint and robustness. The initial tasks to build a high-quality TTS system for Vietnamese were first outlined as the following: • Studying the Vietnamese phonetics and phonology to discover the way to model the lexical tones in phonemes; • Designing and recording a new corpus, which covers both phonemic and tonal contexts, for Vietnamese TTS systems; • Proposing a novel prosodic model to improve the quality of a HMM-based TTS system for Vietnamese; • Designing a complete architecture and a contextual feature set for, and building, an HMM-based Vietnamese TTS system; • Designing, carrying out and analyzing various perceptual evaluations of synthetic voices with respect to the lexical tones. This chapter presents the current state and issues in Vietnamese TTS, from which propositions of this research are given. Section 1.2 shows main applications and the basic architecture of TTS systems. This section also describes the two main current speech synthesis techniques: (i) Source/filter synthesizer, and (ii) Concatenative synthesizer. Statistical parametric and unit selection synthesis, the two prominent state-of-the-art speech synthesis techniques, are discussed in Section 1.3. Based on our initial motivation and their pros and cons, the HMMbased speech synthesis, one of the most well known technique in the statistical parametric approach, was chosen to develop VTED, a Vietnamese TTS system. In Section 1.4, some main characteristics of the Vietnamese language are introduced. Section 1.5 presents the current state of Vietnamese TTS, including existing TTS software applications in real-life as well as related research. Some discussions on main issues on Vietnamese TTS, which were considered as the final motivation of this work, are given in Section 1.6. 1. A statistical Markov model for representing probability distributions over sequences of observations. The system being modeled is assumed to be a Markov process with unobserved (hidden) states.

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

1.2

Text-To-Speech (TTS)

1.2.1

Applications of speech synthesis

Over the last few decades, speech synthesis or TTS has considerably drawn attention and resources from not only researchers but also the industry. This field of work has progressed remarkably in recent years, and it is no longer the case that state-of-the-art systems sound overtly mechanical and robotic. The concept of high quality TTS synthesis appeared in the mid eighties, as a result of important developments in speech synthesis and natural language processing techniques, mostly due to the emergence of new technologies. In recent years, the considerable advances in quality have made TTS systems more common in various domains and numerous applications. It appears that the first real-life use of TTS systems was to support the blind to read text from a book and converting it into speech. Although the quality of these initial systems was very robotic, they were surprisingly adopted by blind people due to their availability compared to other options such as reading braille or having a real person do the reading (Taylor, 2009). Nowadays, there have been a number of TTS systems to help blind users to interact with computers. One of the most important and longest applications for people with visual impairment is a screen reader in which the TTS can help users navigate around an operating system. Blind people also widely have been benefiting from TTS systems in combination with a scanner and an Optical Character Recognition (OCR) software application that gives them access to written information (Dutoit and Stylianou, 2003). Recently, TTS systems are commonly used by people with reading disorder (i.e. dyslexia) and other reading difficulties as well as by preliterate children. These systems are also frequently employed to aid those with severe speech impairment usually through a voice output communication aid (Hawley et al., 2013). Handicapped people have been widely aided by TTS techniques in Mass Transit. Nowadays, there exist a large number of talking books and toys that use speech synthesis technologies. High quality TTS synthesis can be coupled with “a computer aided learning system, and provide a helpful tool to learn a new language”. Speech synthesis techniques are also used in entertainment productions such as games and animations, such as the announcement of NEC Biglobe 2 on a web service that allows users to create phrases from the voices of Code Geass 3 – a Japanese anime series. TTS systems are also essential for other research fields, such as providing laboratory tools for linguists, vocal monitoring, etc. Beyond this, TTS systems have been used for reading messages, electronic mails, news, stories, weather reports, travel directions and a wide variety of other applications. One of the main applications of TTS today is in call-center automations where textual information can be accessed over the telephone. In such systems, a user pays an electricity bill or books some travel and conducts the entire transaction through an automatic dialogue system (Taylor, 2009)(Dutoit, 1997). Another important use of TTS is in speech-based question answering systems (e.g. Yahoo! 2009 4 ) or voice-search applications (e.g. Microsoft, 2009 5 , Google 2009 6 ), where speech recognition and retrieval system are tightly coupled. In such those systems, users can pose their information need in a natural input modality, i.e. spoken language and then receive a collection of answers that potentially address the information need directly. On smartphones, some typical and well-known voice interactive applications 2. 3. 4. 5. 6.

http://www.biglobe.co.jp/en/ http://www.geass.jp/ http://answers.yahoo.com/ http://www.live.com http://www.google.com/mobile

1.2. Text-To-Speech (TTS)

whose main component is multi-lingual TTS are Google Now, Apple Siri, AOL, Nuance Nina, Samsung S-Voice, etc. In these software applications, a virtual assistant allows users to perform a number of personalized, effortless command/services via a human-like conversational interface, such as authenticating, navigating menus and screens, querying information, or performing transactions.

1.2.2

Basic architecture of TTS

The basic architecture of a TTS system, illustrated in Figure 1.1, has two main parts (Dutoit and Stylianou, 2003) with four components (Huang et al., 2001). The first three – i.e. Text Processing, Grapheme-to-Phoneme (G2P) Conversion and Prosody Modeling – belong to the high-level speech synthesis, or the Natural Language Processing (NLP) part of a TTS system. The low-level speech synthesis or Digital Signal Processing (DSP) part – forth component – generates the synthetic speech using information from the high-level synthesis. The input of a TTS system can be either raw or tagged text. Tags can be used to assist text, phonetic, and prosodic analysis.

Figure 1.1 – Basic architecture of a TTS system (NLP: Natural Language Processing, DSP: Digital Signal Processing). The Text Processing component handles the transformation of the input text to the appropriate form so that it becomes speakable. The G2P Conversion component converts orthographic lexical symbols (i.e. the output of the Text Processing component) into the corresponding phonetic sequence, i.e. phonemic representation with possible diacritical information (e.g. position of the accent). The Prosody Modeling attaches appropriate pitch, duration and other prosodic parameters to the phonetic sequence. Finally, the Speech Synthesis component takes the parameters from the fully tagged phonetic sequence to generate the corresponding speech waveform (Huang et al., 2001, p. 682). Due to different degrees of knowledge about the structure and content of the text that the applications wish to speak, some components can be skipped. For instance, some certain broad requirements such as rate and pitch can be indicated with simple command tags appropriately located in the text. An application that can extract much information about the structure and content of the text to be spoken considerably improve the quality of synthetic speech. If the input of the system contains the orthographic form, the G2P Conversion module can be absent. In some cases,

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

an application may have F0 contours pre-calculated by some other process, e.g. transplanted from a real speaker’s utterance. The quantitative prosodic controls in these cases can be treated as “special tagged field and sent directly along with the phonetic stream to speech synthesis for voice rendition” (Huang et al., 2001, p. 6). Text Processing. This component is responsible for “indicating all knowledge about the text or message that is not specifically phonetic or prosodic in nature”. The basic function of this component is to convert non-orthographic items into speakable words. This called text normalization from a variety symbols, numbers, dates, abbreviations and other nonorthographic entities of text into a “common orthographic transcription” suitable for next phonetic conversion. It is also necessary to analyze white spaces, punctuations and other delimiters to determine document structure. This information provides context for all later processes. Moreover, some elements of document structure, e.g. sentence breaking and paragraph segmentation, may have direct implications for prosody. Sophisticated syntax and semantic analysis can be done, if necessary, for further processes, e.g. to gain syntactic constituency and semantic features of words, phrases, clauses, and sentences (Huang et al., 2001, p. 682). G2P Conversion. The task of this component is to “convert lexical orthographic symbols to phonemic representation” (i.e. phonemes - basic units of sound) along with “possible diacritic information (e.g. stress placement)” or lexical tones in tonal languages. “Even though future TTS systems might be based on word sounding units with increasing storage technologies, hom*ograph disambiguation and G2P conversion for new words (either true new words being invented over time or morphologically transformed words) are still necessary for systems to correctly utter every word. G2P conversion is trivial for languages where there is a simple relationship between orthography and phonology. Such a simple relationship can be well captured by a handful of rules. Languages such as Spanish and Finnish belong to this category and are referred to as phonetic languages. English, on the other hand, is remote from phonetic language because English words often have many distinct origins”. Letter-to-sound conversion can then be done by general letter-to-sound rules (or modules) and a dictionary lookup to produce accurate pronunciations of any arbitrary word. (Huang et al., 2001, p. 683). Prosody Modeling. This component provides prosodic information (i.e. “an acoustic representation of prosody”) to parsed text and phone string from linguistic information. First, it it necessary to break a sentence into prosodic phrases, possibly separated by pauses, and to assign labels, such as emphasis, to different syllables or words within each prosodic phrase. The duration, measured in units of centi-seconds (cs) or milliseconds (ms), is then predicted using rule-based (e.g. Klatt) or machine-learning methods (e.g. CART). Pitch, a perceptual correlate of fundamental frequency (F0) in speech perception, expressed in Hz or fractional tones (semitones, quarter tones...), is generated. F0, responsible for the perception of melody, is probably the most characteristic of all the prosody dimensions; hence generation of pitch contours is an incredibly complicated language-dependent problem. Intensity, expressed in decibels (dB), can be also modeled. Besides, prosody depends not only on the linguistic content of a sentence, but also on speakers and their moods/emotions. Different speaking styles can be used for a prosody generation system, and different prosodic representations can then be obtained (Huang et al., 2001). Speech Synthesis. This final component, a unique one in the low-level synthesis, takes predicted information from the fully tagged phonetic sequence to generate corresponding speech waveform. In general, there currently have been two basic approaches concerning speech synthesis techniques: (i) Source/filter synthesizers: Produce “completely synthetic” voices using a source/filter model from the parametric representation of speech, (ii) Concate-

1.2. Text-To-Speech (TTS)

native synthesizers: Concatenate pre-recorded human speech units in order to construct the utterance. The first approach has issues in generating speech parameters from the input text as well as generating good quality speech from the parametric representation. In the second approach, signal processing modification and several algorithms/strategies need to be employed to make the speech sound smooth and continuous, especially at join sections. Details on these approaches are presented in next subsections.

1.2.3

Source/filter synthesizer

The main idea of this type of synthesizer is to re-produce the speech from its’ parametric representation using a source/filter model. This approach makes use of the classical acoustic theory of speech production model, based on vocal tract models. “An impulse train is used to generate voiced sounds and a noise source to generate obstruent sounds. These are then passed through the filters to produce speech” (Taylor, 2009, p. 410). It turns out that formant synthesis and classical Linear Prediction (LP) are basic techniques in this approach. Formant synthesis uses individually “controllable formant filters which can be set to produce accurate estimations of the vocal tract transfer function”. The parameters of the formant synthesizer are determined by a set of rules which examine the phone characteristics and phone context. It can be shown that very natural speech can be generated so long as the parameters are set very accurately. Unfortunately it is extremely hard to do this automatically. The inherent difficulty and complexity in designing formant rules by hand has led to this technique largely being abandoned for engineering purposes. In general, formant synthesis produces intelligible, often “clean” sounding, but far from natural. The reasons for this are: (i) the “too simplistic” source model, (ii) the “too simplistic” target and transition model, which misses many of the subtleties really involved in the dynamics of speech. While the shapes of the formant trajectories are measured from a spectrogram, the underlying process is one of motor control and muscle movement of the articulators (Taylor, 2009, p. 410). Classical Linear Prediction adopts the “all-pole vocal tract model”, which is similar to formant synthesis with respect to the source and vowels in terms of production. It differs in that all sounds are generated by an all-pole filter, whereas parallel filters are common in formant synthesis. Its main strength is that the vocal tract parameters can be determined automatically from speech. Despite its ability to faithfully mimic the target and transition patterns of natural speech, standard LP synthesis has a significant unnatural quality to it, often impressionistically described as “buzzy” or “metallic” sounding. While the vocal tract model parameters can be measured directly from real speech, an explicit impulse/noise model can still be used for the source. The buzzy nature of the speech may be caused by an “overly simplistic” sound source (Taylor, 2009, p. 411). The main limitations of those techniques concern “not so much the generation of speech from the parametric representation, but rather the generation of these parameters from the input specification which is created by the text analysis process. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human derived rules, no matter how “expert” the rule designer” (Taylor, 2009, p. 412). Furthermore, acquiring data is fundamentally difficult and improving naturalness often necessitates a considerable increase in the complexity of the synthesizer. The classical linear prediction technique can be considered as “a partial solution to the complexities of specification to parameter mapping”, where the issue of generating of the vocal tract parameters explicitly is bypassed by data measurement. The source parameters however, are still “specified by an explicit model, which was identified as the main source of the unnaturalness” (Taylor, 2009, p. 412).

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

A new type of glottal flow model, namely a Causal-Anticausal Linear filter Model (CALM), was proposed in the work of Doval et al. (2003). The main idea was to establish a link between two approaches of voice source modeling, namely the spectral modeling approach and the time-domain modeling approach, that seemed incompatible. Both approaches could be envisaged in a unified framework, where time-domain models can be considered, or at least approximated by a mixed CALM. The “source/filter” model can be considered as an “excitation/filter” model. The non-linear part of the source model is associated to the excitation (i.e. quasi-periodic impulses), and the mixed causal-anticausal linear part of the model is associated to the filter component, without lack of rigor.

1.2.4

Concatenative synthesizer

This type of synthesizer is based on the idea of concatenating pieces of pre-recorded human speech in order to construct a utterance. This approach can be viewed as an extension of the classical LP technique, with a noticeable increase in quality, largely arising from the abandonment of the over simplistic impulse/noise source model. The difference of this idea from the classical linear prediction is that the source waveform is generated using templates/samples (i.e. instances of speech units). The input to the source however is “still controlled by an explicit model”, e.g. “an explicit F0 generation model of the type that generates an F0 value every 10ms” (Taylor, 2009, p. 412). During database creation, each recorded utterance is segmented into individual phones, di-phones, half-syllables, syllables, morphemes, words, phrases or sentences. Different speech units considerably affect the TTS systems: a system that stores phones or di-phones provides the largest output range, but may lack clarity. For specific (limited) domains, the storage of entire words, phrases or sentences allows for high-quality output. However, di-phones are the most popular type of speech units, a di-phone system is hence a typical concatenative synthesis system. The synthesis specification is in the form of a list of items, each with a verbal specification, one or more pitch values, and a duration. The prosodic content is generated by explicit algorithms, while signal processing techniques are used to modify the pitch and timing of the di-phones to match that of the specification. Pitch Synchronous OverLap and Add (PSOLA), a traditional method for synthesis, operates in the time domain. It separates the original speech into “frames pitch-synchronously” and performs modification by overlapping and adding these frames onto a new set of epochs, created to match the synthesis specification. Other techniques developed to modify the pitch and timing can be found in the work of Taylor (2009). While this is successful to a certain extent, it is not a perfect solution. It can be said that we can “never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data” (Taylor, 2009, p. 435). One other obvious issue is how to successfully join sections of a waveform, such that the joins cannot be heard hence the final speech sounds smooth, continuous and not obviously concatenated. The quality of these techniques is considerably higher than classical, impulse excited linear prediction. All these have roughly similar quality, meaning that the choice of which technique to use is mostly made of other criteria, such as speed and storage.

1.3. Unit selection and statistical parametric synthesis

1.3

Unit selection and statistical parametric synthesis

Based on two basic approaches of speech synthesis, many improvements have been proposed for a high-quality TTS system. Statistical parameter speech synthesis along with the unit selection techniques are termed two prominent state-of-the-art techniques and hence widely discussed by a number of researchers with different judgments. This section describes and makes a comparison of those techniques.

1.3.1

From concatenation to unit-selection synthesis

In a concatenative TTS system, the pitch and timing of the original waveforms are modified by a signal processing technique to match the pitch and timing of the specification. Taylor (2009, p. 474) made two assumptions for a di-phone system: (i) “within one type of di-phone, all variations are accountable by pitch and timing differences” and (ii) “the signal processing algorithms are capable of performing all necessary pitch and timing modifications without incurring any unnaturalness”. It appears that these assumptions are “overly strong, and are limiting factors on the quality of the synthesis. While work still continues on developing signal processing algorithms, even an algorithm which changed the pitch and timing perfectly would still not address the problems that arise from first assumption. The problem here is that it is simply not true that all the variation within a di-phone is accountable by pitch and timing differences”. The observations about the weakness of concatenative synthesis lead to the development of “a range of techniques collectively known as unit-selection. These use a richer variety of speech, with the aim of capturing more natural variation and relying less on signal processing”. The idea is that for each basic linguistic type, there are a number of units, which “vary in terms of prosody and other characteristics” (Taylor, 2009, p. 475).

(a) General unit-selection scheme

(b) Clustering-based unit-selection scheme

Figure 1.2 – General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs (Zen et al., 2009). In the unit-selection approach, new naturally sounding utterances can be synthesized by selecting appropriate sub-word units from a database of natural speech (Zen et al., 2009), according to how well a chosen unit matches a specification/a target unit (i.e. target cost) and how well two chosen units join together (i.e. concatenation cost). During synthesis, an algorithm selects one unit from the possible choices, in an attempt to find the best overall sequence of units that matches the specification (Taylor, 2009). The specification and the

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

units are entirely described by a feature set including both linguistic features and speech features. A Viterbi style search is performed to find the sequence of units with the lowest total cost, which is calculated from the feature set. According to the review of Zen et al. (2009), there seem to be two basic techniques in unit-selection synthesis, even though they are theoretically not very different: (i) the selection model (Hunt and Black, 1996), illustrated in Figure 1.2a (ii) the clustering method that allows the target cost to effectively be pre-calculated (Donovan et al., 1998), illustrated in Figure 1.2b. The difference is that, in the second approach, units of the same type are clustered into a decision tree that asks questions about features available at the time of synthesis (e.g., phonetic and prosodic contexts).

1.3.2

From vocoding to statistical parametric synthesis

As mentioned earlier, the main limitation of source/filter synthesizers is generating speech parameters from the input specification that was created by text analysis. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human derived rules, no matter how “expert” the rule designer is (Taylor, 2009). It is hence necessary a “complex model”, i.e. trainable rules from speech itself, for that purpose. The solution can be found partly from the idea of vocoding, in which a speech signal is converted into a (usually more compact) representation so that it can be transmitted. In speech synthesis, the parameterized speech is stored instead of transmitted. Those speech parameters are then proceeded to generate the corresponding speech waveform. As a result, the statistical parametric synthesis is based on the idea of vocoding for extracting and generating speech parameters. But the most important is that it provides statistical, machine learning techniques to automatically train the specification-to-parameter mapping from data, thus bypassing the problems associated with hand-written rules. Extracted speech parameters are aligned together with contextual features/features to build “trained models”. In a typical statistical parametric speech synthesis system, parametric representations of speech including spectral and excitation parameters (i.e. vocoder parameters, which are used as inputs of the vocoder) are extracted from a speech database and then modeled by a set of generative models. The Maximum Likelihood (ML) criterion is usually used to estimate the model parameters. Speech parameters are then generated for a given word sequence to be synthesized from the set of estimated models to maximize their output probabilities. Finally, a speech waveform is reconstructed from the parametric representations of speech (Zen et al., 2009). Although any generative model can be used, HMMs have been particularly well known. In HMM-based speech synthesis 7 (HTS) (Yoshimura et al., 1999), the speech parameters of a speech unit such as the spectrum and excitation parameters (e.g. fundamental frequency - F0) are statistically modeled and generated by context dependent HMMs. Training and synthesis are two main processes in the core architecture of a typical HMM-based speech synthesis system, as illustrated in Figure 1.3 (Yoshimura, 2002). In the training process, the ML estimation is performed using the Expectation Maximization (EM) algorithm, which is very similar to that for speech recognition. The main difference is that both spectrum (e.g., mel-cepstral coefficients and their dynamic features) and excitation (e.g., log F0 and its dynamic features) parameters are extracted from a database of natural speech modeled by a set of multi-stream context-dependent HMMs. Another differ7. http://hts.sp.nitech.ac.jp/

1.3. Unit selection and statistical parametric synthesis

Figure 1.3 – Core architecture of HMM-based speech synthesis system (Yoshimura, 2002).

ence is that linguistic and prosodic contexts are taken into account in addition to phonetic ones (called contextual features). Each HMM also has its state-duration distribution to model the temporal structure of speech. Choices for state-duration distributions are the Gaussian distribution and the Gamma distribution. They are estimated from statistical variables obtained at the last iteration of the forward-backward algorithm. In the synthesis process, an inverse operation of speech recognition is performed. First, a given word sequence is converted into a context-dependent label sequence, and then the utterance HMM is constructed by concatenating the context-dependent HMMs according to the label sequence. Second, the speech parameter generation algorithm generates the sequences of spectral and excitation parameters from the utterance HMM. Finally, a speech waveform is synthesized from the generated spectral and excitation parameters using excitation generation and a speech synthesis filter (Zen et al., 2009, p. 4), that is a vocoder with a source-excitation/filter model. Figure 1.4 illustrates the general scheme of HMM-based synthesis (Zen et al., 2009, p. 5). In an HMM-based TTS system, a feature system is defined and a separate model is trained for each unique feature combination. Spectrum, excitation, and duration are modeled simultaneously in a unified framework of HMMs because they have their own context dependency. Their parameter distributions are clustered independently and contextually by using phonetic decision trees due to the combination explosion of contextual features. The speech parameter generation is actually the concatenation of the models corresponding to the full context label sequence, which itself has been predicted from text. Before generating parameters, a state sequence is chosen using the duration model. “This determines how many frames will be generated from each state in the model. This would clearly be a poor fit to real speech where the variations in speech parameters are much smoother”.

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

Figure 1.4 – General HMM-based synthesis scheme (Zen et al., 2009, p. 5).

1.3.3

Pros and cons

Statistical parameter speech synthesis offers an alternative to overcoming limitations of the parametric synthesis approach, which uses statistical machine learning techniques to infer the specification-to-parameter mapping from data. This technique can be simply described as “generating the average of some sets of similarly sounding speech segments”. That directly contrasts with the purpose of unit-selection synthesis that “retains natural unmodified speech units, but using parametric models offers other benefits” (Zen et al., 2009). Unit selection speech synthesis is a sub-type and a natural extension of concatenative synthesis, and deals with the issues of “how to manage large numbers of units, how to extend prosody beyond just F0 and timing control, and how to alleviate the distortions caused by signal processing” (Taylor, 2009, p. 474). While both those techniques mainly depend on data, in the concatenative approach, the data is effectively memorized, whereas in the statistical approach, the general properties of the data are learned Taylor (2009, p. 447). As mentioned above, while many possible approaches to statistical synthesis are possible, most work has focused on using hidden Markov models (HMMs). Main differences between unit-selection and HMM-based speech synthesis are summarized in Table 1.1. Both approaches use features of speech units but in different ways. In the HMM-based speech synthesis, contextual features including phonetic, linguistic and prosodic features are used in both training and synthesis: (i) in training, contextual features are force-aligned with speech parameters to build context-dependent HMMs (ii) in synthesis, contextual features are used to build a context-dependent label sequence and according to that, an utterance HMM is constructed by concatenating the context-dependent HMMs. Whereas, the unit-selection

1.3. Unit selection and statistical parametric synthesis

Table 1.1 – Unit-selection and HMM-based speech synthesis Criteria Approach Idea Preferred applications Techniques

Quality Footprint Robustness

Unit-selection synthesis Data-driven: memorize data (natural speech) Multi-template Retain natural unmodified units by selecting appropriate sub-words

HMM-based synthesis Parameter-driven: learn properties of data Statistics Generate the average of some sets of similarly sounding segments

Limited domain

Open domain

Target cost, concatenation cost Single tree Discontinuity at the join High quality at waveform level Less preferred Best examples are better Large run-time data Hit or miss (with spurious errors, quality is severely degraded)

Machine learning Multiple trees (spectral, F0, duration) Smooth Vocoded speech (buzzy) More understandable Best examples are worse Small run-time data

Voice Extremely difficult modification Fixed voice

Stable Flexible to change speaking types voice characteristics or emotion Various voices

synthesis uses both text features (phonetic and prosodic contexts are typically used) and speech features (i.e. spectral and acoustic features) to calculate and minimize the target cost (best units) and concatenation cost (the best sequence) of units for an utterance. Those techniques have received considerable attention and resources in improving the synthetic voice. Each technique has pros and cons, hence has been adopted in different applications and domains. As a result, the difference between synthetic voices of both approaches and the human voice has been small enough in terms of naturalness for their real-life applications. Unit-selection synthesis tends to be more suitable for applications having limited domain with high quality voice, such as transportation announcement system (e.g. train station, airport) or call centers (for services 24/7). On the other hand, HMM-based speech synthesis can work well in any applications, especially having open domain such as SMS/email/e-newspaper reading systems, question/answering systems or speech translation systems. According to the review of Zen et al. (2009), in both the Blizzard Challenge in 2005 and 2006, where “common speech databases were provided to participants to build synthetic voices, the results from subjective listening tests revealed that HMM-based synthetic voice was more preferred (through mean opinion scores) and more understandable (through word error rates)”. The best examples of unit-selection synthesis are better than those of HMMbased synthesis. The HMM-based speech synthesis systems need less memory to store the parameters of the model (statistics of acoustic models), hence smaller run-time data, than memorizing the data (multi-templates of speech units) as unit-selection synthesis systems. Therefore, the HMM-based speech synthesis systems can be constructed with a small amount of training data. This leverages the HMM-based approach in supporting multilingual languages, with the fact that only the contextual features to be used depend on each language. In unit-selection synthesis, when a required sentence happens to need phonetic and

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

prosodic contexts that are under-represented in a database, the quality of the synthesizer can be severely degraded. Even though this may be a rare event, a single bad join in an utterance can ruin the listeners’ flow. It is not possible to guarantee that bad joins and/or inappropriate units will not occur, simply because of the vast number of possible combinations that could occur. Whereas, HMM-based synthesis is more “robust” than unit-selection synthesis, for instance, to noise/fluctuations due to the recording conditions or the lack of some speech units. This is because adaptive training can be viewed as “a general version of several feature-normalization techniques such as cepstral mean/variance normalization, stochastic matching, and bias removal” (Zen et al., 2009). The main advantage of statistical parametric synthesis including the HMM-based approach is its flexibility in changing its voice characteristics, speaking styles, and emotions. This is still problematic with unit-selection synthesis in spite of its combination of voiceconversion techniques. However, we can easily change voice characteristics, speaking styles, and emotions in statistical parametric synthesis by transforming the model parameters. There have been four major techniques to accomplish this: adaptation, interpolation, eigenvoice, and multiple regression, cf. Zen et al. (2009). Besides, unit-selection synthesis usually requires various control parameters to be manually tuned. Statistical parametric synthesis, on the other hand, has few tuning parameters because all the modeling and synthesis processes are based on mathematically well-defined statistical principles (Zen et al., 2009). The major disadvantage of statistical parametric synthesis against unit-selection synthesis is the quality of the synthesized speech. The three degrading quality factors are: (i) vocoders (i.e. synthetic speech of a basic HMM-based TTS system sounds buzzy with a mel-cepstral vocoder with simple periodic pulse-train or white-noise excitation), (ii) acoustic modeling accuracy (from which speech parameters are directly generated) and (iii) over-smoothing (i.e. detailed characteristics of speech parameters are removed in the modeling part and cannot be recovered in the synthesis part). Many research groups have contributed various refinements to achieve state-of-the-art performance of HMM-based speech synthesis (cf. Zen et al. (2009) for details).

1.4

Vietnamese language

Vietnamese, the official language of Vietnam, belongs to the Mon-Khmer branch of the Austroasiatic family. The majority of the speakers of Vietnamese are spread over the South East Asia area as well as by some overseas, predominantly in France, Australia, and the United States (Kirby, 2011). The pronunciation of educated speakers from Hanoi, the capital of Vietnam, in general, is the most widely accepted as a sort of standard (Thompson, 1987). Vietnamese text is written in a variant of the Latin alphabet (chữ quốc ngữ) with additional diacritics for tones, and certain letters. This script has been existing in its current form since the 17th century, and has become the official writing system since the beginning of the 20th (Le et al., 2010). However, there are a number of characteristics of Vietnamese that distinguish it from occidental languages. First, Vietnamese is a tonal language, in which pitch is mostly used as a part of speech, changing the meaning of a word/syllable. There are six different lexical tones in the writing system, each tone can contribute to the creation of the morpheme and the meaning of a word/syllable, e.g. “ba” (father) – level tone, “bà” (grandmother) – falling tone, “bã” (residue) - broken tone, “bả” (bait) – curve tone, “bá” (aunt) – rising tone, “bạ” (strengthen) – drop tone. The tones make the Vietnamese language have a musical characteristic; make sentences rhythmic and melodious (Nguyen, 2007).

1.4. Vietnamese language

Second, Vietnamese is an “inflectionless language in which its word forms never change”. Vietnamese lacks morphological markers of grammatical case, number, gender, tense, and hence it has no finite/non finite distinction. In other words, Vietnamese words do not change depending on grammatical categories, e.g. “bạn” is the same for singular and plural (in contrast to “student” and “students” in English), the same for male and female (in contrast to “ami” and “amie” in French). . . This inflectionless characteristic makes a “special linguistic phenomenon common in Vietnamese: type mutation, where a given word form is used in a capacity that is not its typical one (a verb used as a noun, a noun as an adjective. . . ) without any morphological change. For example, the word ”yêu“ may be a noun (the devil) or a verb (to love) depending on context” (Le et al., 2010). Third, Vietnamese is a non-affix language in contrast to the means of generating antonyms in English/French by the prefixes “im-”, “ir-”, “un-”, e.g. “impolite”, “unreadable”, “irregular”. . . That is to say Vietnamese word structure does not use the affixes (prefixes, suffixes or infixes) (Nguyen, 2007). And fourth, Vietnamese is an isolating language, the most extreme case of an analytic language, in which the boundary of syllable and morpheme is the same, each morpheme is a single syllable. Each syllable usually has an independent meaning in isolation, and polysyllables can be analyzed as combinations of monosyllables (Doan, 1977). Hence, a syllable in Vietnamese is not only a phonetic unit but also a grammatical unit (Doan, 1999b). Lexical units may be formed of one or several syllables, always remaining separate in writing. Although dictionaries contain a majority of compound words, monosyllabic words actually account for a wide majority of word occurrences. This is in contrast to synthetic languages, like most Western ones, where, although compound words exist, most words are composed of one or several morphemes assembled so as to form a single token (Le et al., 2010). Some examples can be found in different languages are presented in Example 1. Example 1 Syllables, morphemes and words in English, French and Vietnamese (Nguyen, 2007) • In English: The word “unladylike” has three morphemes (un)(lady)(like) and four syllables (un)(la)(dy)(like), while the word “dogs” has two morphemes (dog)(s) and one syllable. • In French: The word “école” (school) has two syllables (é)(cole) and one morpheme, while the word “vendeur” (seller) has two syllables (ven)(deur) and two morphemes (vend-eur). • In Vietnamese : The sentence “Đẹp vô cùng tổ quốc ta ơi!” (How beautiful our country is!) has seven morphemes, seven syllables: (Đẹp)(vô)(cùng)(tổ)(quốc)(ta)(ơi), and five words including three mono words: (đẹp), (ta), (ơi) and two compound words: (vô cùng), (tổ quốc).

Fifth, the Vietnamese language has adopted quite many words from foreign languages, such as tiếng Hán (Chinese) and French. For example, the words “đấu tranh” (struggle), “giai cấp” (class), “nhân nghĩa” (benevolent and righteous) are tiếng Hán, while “nhà ga” (gare), “xà phòng” (savon), “cà phê” (café) are French (Nguyen, 2007). And finally, Vietnamese is a “quite fixed order language, with the general word order SVO (subject-verb-object)”. As for most languages with relatively restrictive word orders,

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

Vietnamese relies on the order of constituents to convey important grammatical information (Le et al., 2010).

1.5

Current state of Vietnamese TTS

Vietnamese TTS recently has been receiving more attentions due to its’ necessity in real-life applications. The Sao Mai Vietnamese reader (or ‘Sao Mai voice’ for short) of the Sao Mai Vocational and Assistive Technology Center for the Blind 8 , Ho Chi Minh city is considered the first (2004) and most common software on Windows for the blinds due to its ease of use. This project came from a World Bank prize in the competition of the Vietnam 2003 Innovation Day (Tran, 2007a, 2013). The concatenative synthesis was adopted for the synthesis engine of ‘Sao Mai voice’. Syllables were chosen as speech units. The syllable corpus of this software included about 16.000 syllables (isolately recorded): more than 7000 Vietnamese words and nearly 9000 loanwords. However, the main disadvantages of the Sao Mai voice are: (i) the low quality of synthetic speech (ii) the different voices for different text encodings. Two main reasons for its low quality are (i) the low-quality recording environment of corpus (from 1990s) (ii) the discontinuity at join points between isolately-recorded syllables. Since Vietnamese is a tonal language, the synthetic voice has more discontinuity issues with intonation (e.g. “out-of-tune”). Although the quality of this software is not good, it is still mainly used by the Vietnamese blinds on the platform of Windows until now. The reasons were found as follows: (i) it can be compatible to support any applications on Windows (ii) it can be integrated to JAWS 9 , the most popular screen reader (in English) for the Vietnamese blind (iii) it facilitates the Blinds to follow the content of text on screen or applications, such as reading by characters, by syllables (iv) there is currently no better Vietnamese TTS system targeting the blinds (Tran, 2013). There have been a few other works on Vietnamese speech synthesis using formant or concatenative synthesis techniques (Do and Takara, 2003, 2004) or Nguyen et al. (2004). The Hanoi Vietnamese dialect was chosen for both studies. In the work of Đỗ Trọng Tú (Do and Takara, 2003, 2004), a Vietnamese TTS system (VieTTS) was built as a parametric and rule-based speech synthesis system. Fundamental speech units of this system were halfsyllables with the level tone. VieTTS uses a source-filter model for speech production and a Log Magnitude Approximation (LMA) filter as the vocal tract filter. Tone synthesis of Vietnamese was implemented by using F0 patterns and power pattern control. The second work (Nguyen et al., 2004) integrated the Fujisaki model (Hiroya Fujisaki, 1984) into VnVoice, a concatenative Vietnamese TTS system, based on a set of rules to control the F0 contour. In general, the quality of the synthetic voices of VieTTS and VnVoice were acceptable but still had limitations of formant or concatenate synthesis techniques. The work of Nguyen et al. (2004) had a better quality but still met problems in controlling F0 and duration. Since Vietnamese is a tonal language with a number of lexical tones, we face a great challenge in building a high-quality TTS system. In the following subsections, we will present state-of-the-art work on Vietnamese TTS, using unit selection and HMM-based synthesis techniques. 8. http://www.saomaicenter.org/vi/tts/ 9. http://www.freedomscientific.com/Products/Blindness/JAWS

1.5. Current state of Vietnamese TTS

1.5.1

Unit selection Vietnamese TTS

Trần Đỗ Đạt (Tran, 2007b) built a unit-selection TTS system for Hanoi Vietnamese using di-phones and half-syllables as speech units, called ‘HoaSung’ (means “water-lily flower”). The corpus of HoaSung, called VNSpeechCorpus (hereafter called “VNSP” for short), was collected and filtered by different resources (e.g. stories, books, and web documents) from websites. It included various types of data: words with six lexical tones, figures and numbers, dialog sentences and short paragraphs. It comprised about 630 sentences in 37 minutes, recorded by a TV broadcaster from Hanoi. The CART model was chosen to construct a duration template (Tran et al., 2007), and the Time-Domain PSOLA (TD-PSOLA) algorithm was adopted for manipulating the pitch and timing of speech units. A linguistic feature set built for modeling units duration included: (i) phonetic features, e.g. articulation place/manner, positions of phonemes, (ii) context-based features: e.g. preceding/succeeding phoneme, positions of phoneme in the current syllable. An intonation model was proposed to generate the F0 contour of synthetic utterances. This model was built based on the results of the analysis on relations among factors that influenced the intonation in Vietnamese: (i) the tones that make up the sentence, (ii) the register of each tone, (iii) the influence of tonal coarticulation phenomena and (iv) the duration of the syllable (Tran and Castelli, 2010). The subjective results of HoaSung indicated that the system using half-syllables as speech units gave a better quality than the one using di-phones. HoaSung can be considered a quite complete TTS system with a rather high quality synthetic speech. However, we have found that the phone set of HoaSung did not cover the latest phonology system of modern Hanoi Vietnamese, such as the merge of [s] and [ù] or the appearance of new phonemes in loanwords. The main limitations reported in the work of (Tran, 2007b) were: (i) the inability to reproduce the important changes in fundamental frequency due to glottalization phenomenon (ii) the small corpus that was not able to synthesize syllables composed of half-syllables not in the corpus (iii) the issues in automatic analysis of text such as text normalization, word segmentation, POS tagger (iv) the lack of F0 modeling for sentence modes. The work of Le et al. (2011) partly addressed the last problem of HoaSung for yes/no questions without auxiliary verbs. Compared to the declarative intonation, in this type of question, the whole F0 contour was raised by a number of percentages of the F0 mean (normalized register ratio) and the contour of the final syllable was raised by a number of percentages of the F0 mean (increasing slope) (Le et al., 2011). This model was applied to HoaSung and gave some positive results. However, it did not work well in some particular final syllable tones, e.g. falling tones, curve tones due to the small analysis corpus. Moreover, although the duration relates to the F0 contour, it was not studied and modeled in this work. HoaSung was extended using the non-uniform unit selection technique (Do et al., 2011) to build the second version. The same speech corpus was used, but annotated at the syllable level with some necessary information such as phonemic elements, tone, duration, energy and other contextual features. The sentences in the text corpus were parsed into syntactic phrases, i.e. phrase-trees. In this work, speech units were not anticipatively determined, but varied according to the availability of the speech corpus. The main idea was to minimize the number of join points, which put a higher priority to longer available speech units. If there were lack of samples as syllables or above syllables (e.g. words, phrases) when searching units, the half-syllable corpus of HoaSung was used. The preliminary perceptual result showed an improved quality of synthetic speech of the new system. However, the test corpus was not well designed to cover all instances of combining speech units. Moreover, there was no connection between the process of choosing speech units as syllables or above and the process of choosing units as half-phones in calculating target cost and concatenation cost. As a result, the total

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

cost was not optimized for utterances needing half-syllables. ‘Voice Of Southern Vietnam’ (VOS) was developed by Vũ Hải Quân and his team at the AILab 10 , Ho Chi Minh University of Science. This system was first built using concatenative synthesis with phrases as speech units (Vu and Cao, 2010). The latest version of VOS used non-uniform unit selection synthesis with speech units as syllables or above and a very large corpus, a typical speech of the Southern dialect of Vietnam. The quality is better in the limited domain (e.g. football commentaries), which has only a few number of concatenation points. However, transitions between join sections of the waveforms did not sound smooth/continuous, especially for utterances synthesized by a number of speech units. This system targeted to build a new voice reader for the Vietnamese blind, however it has not been used in real-life due to its usability limitations. To the best of our knowledge, there is no publication related to the latest version of the system. A few other Vietnamese TTS systems (e.g. eSpeak 11 , vietTalk, vnVoice) have been built to support Vietnamese blind people in using personal computers or smart phones. However, most of these systems have been rarely used by blind users because of their drawbacks in quality and usability. Since 2014, vnSpeak 12 has become available as a TTS engine for Android platform. This system adopted the unit selection technique and provided a number of supporting functions for users to interact with smart phones. This system has received positive feedback from Vietnamese blind users.

1.5.2

HMM-based Vietnamese TTS

Many works on HMM-based speech synthesis for the tonal languages have been published, not only for the standard synthetic speech but also for the speech with different speaking styles or the expressive speech. For instances, for Mandarin, the work of Duan et al. (2010), Guan et al. (2010), Hsia et al. (2010), Qian et al. (2006), Yu et al. (2013), Zhiwei Shuang (2010) focused on basic problems of improving the naturalness of HMM-based synthetic speech. Mixed-language or bi-lingual speech synthesis was studied in the work of Qian and Soong (2012), Qian et al. (2008) while Li et al. (2010, 2015) worked with expressive speech. The HMM-based speech synthesis for Thai put attention in tone correctness improvement, such as the work of Chomphan (2011), Chomphan and Chompunth (2012), Chomphan and Kobayashi (2007, 2008), Moungsri et al. (2014); or in speaker-denpendent/indenpendent in Chomphan (2009), Chomphan and Kobayashi (2009). For the Vietnamese HMM-based speech synthesis, to the extent of our knowledge, there are only two following main groups: (i) from the Institute of Information Technology (IoIT, which belongs to the Vietnamese Academy of Science and Technology) (Dinh et al., 2013, Phan et al., 2013a, 2012, 2013b, 2014, Vu et al., 2009) and (ii) from the Yunnan university, China (He et al., 2011, Kui et al., 2011). Both groups followed the core architect of HTS to develop TTS systems for Hanoi Vietnamese. We assumed that the first publication on Vietnamese HMM-based speech synthesis was the work of IoIT (Vu et al., 2009). This system simply applied the HTS for Vietnamese with the training corpus including 3000 phonetically-rich sentences, semi-automatically labeled at phoneme-level. Hanoi Vietnamese phonetic and phonology and tonal aspects were considered when building the phone and feature sets for the system. Features at phoneme, syllable, 10. http://www.ailab.hcmus.edu.vn/ 11. http://espeak.sourceforge.net/. This is an open-source speech synthesizer that Google Translate uses to supports Vietnamese since 2010. Whereas, multilingual well-known TTS systems (cf. Section 1.2) do not support Vietnamese. 12. http://www.vnspeak.com/

1.6. Main issues on Vietnamese TTS

word (including Part-Of-Speech POS), phrase and utterance level were chosen. There were additional features of tone types of preceding, current and succeeding syllable, compared to the feature set of English. This work reported that the intelligibility of the synthetic utterances is approximately 100%, and the quality of synthesis speech ranges from fair to good (3.23 on a 5 point MOS scale) through the preliminary evaluations (number of subjects was not mentioned). It appears that the work of the group from the Yunnan university (He et al., 2011, Kui et al., 2011) also simply adopted the HMM-based synthesis technique for Vietnamese using the STRAIGHT synthesizer 13 . About 600 labeled sentences are used to train the HMM model. A preliminary evaluation (10 subjects) was carried out with the same conclusion on the synthetic voice as the work of Vu et al. (2009). To our knowledge, there were no more studies or experiments for further analysis or improvements. Several other publications of the first group, IoIT, presented a detail implementation of the HMM-based approach to the Vietnamese TTS with a 400-sentence training corpus Phan et al. (2013a, 2012). The preliminary evaluation only aimed at observing the similarity of spectrogram and pitch contours of natural speech signals and synthetic speech signals. It seems that those publications did not provide any new work, compared to the first (Vu et al., 2009). Further researches of IoIT Dinh et al. (2013), Phan et al. (2013b, 2014) targeted the same work, which focused on the importance of prosodic features. Additional intonation features adopted from English using the ToBI model were used in these studies, including: (i) phrasefinal intonation, and (ii) pitch accent. In the evaluation phase, a MOS test was performed with a natural reference and two TTS voices: (i) without POS and intonation features (ii) with POS and intonation features. Results showed that the voice with prosodic features was about 0.7 higher than the one without those features on 5-point MOS scale. However, there was no further study on the impact of individual POS or intonation features to the synthetic voice.

1.6

Main issues on Vietnamese TTS

The initial motivation of this work was to build a high-quality TTS system assisting Vietnamese blind people to access written text. The scope of this work was then narrowed to build a high-quality TTS system with unlimited vocabulary. Based on all the above analyses on the two state-of-the-art TTS techniques, the HMM-based approach was chosen to build a TTS system for Vietnamese. Beside the predominance on general quality, footprint and robustness; there exists a core part from HTS, and a number of supporting platforms to build an HMM-based TTS system. This section gives main issues that we encountered during the realization of our TTS system. General solutions of this research for these issues are also introduced.

1.6.1

Building phone and feature sets

In HMM-based speech synthesis, many contextual features (e.g., phone identity, locational features) are used to build context dependent HMMs. However, due to the exponential increase of contextual feature combinations, a decision-tree based context clustering is the most common technique to cluster HMM states and share model parameters among states in each cluster. Each node (except for leaf nodes) in a decision tree has a context related question. 13. http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/index_e.html

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

Acoustic attributes of phonemes or contextual features are used to build questions, such as “Is the current phoneme a semivowel?”, “Is the previous phoneme voiced?”. Hence, there is a need build a proper acoustic-phonetic unit set to develop an HMM-based TTS system for a specific language. This unit set is also essential for the automatic labeling of a training corpus, in which it is used to model and to identify clear acoustic events, which an expert phonetician would mark as boundaries in a manual segmentation session. Due to the automation in HMM clustering, the semantics of the contextual features (e.g. the importance or the weight of these features) may not be well considered. Hence, some crucial features of Vietnamese, such as lexical tones, may do not have proper priorities in building decision tree, which may lessen the impact of these features on improvement of the synthetic speech quality. To our best of knowledge, in other work or for other tonal languages, lexical tones may be explicitly modeled in a TTS system (Shih and Kochanski, 2000)(Do and Takara, 2003)(Tran and Castelli, 2010). In the Thai language, tone correctness of the synthetic speech may be improved by investigating the structures of the decision tree with tone information in the tree-based context clustering process of the HMM-based training (Chomphan and Kobayashi, 2008)(Chomphan and Chompunth, 2012)(Moungsri et al., 2014). To address above issues for Vietnamese TTS, in this work, we built an acoustic-phonetic unit set in tonal context, in which a new speech unit was proposed for an allophone with respect to lexical tones, called “tonophone”. The lexical tones hence might have been “modeled” with a highest priority in the context clustering process of the HMM-based training. Furthermore, previous Vietnamese TTS systems were mostly developed for Hanoi, a standard Vietnamese dialect. However, the phonetic analysis of those works did not cover the latest phonology system of modern Hanoi Vietnamese, such as merging [s] and [ù] or appearance of new phonemes in loanwords, such as the initial consonant [p]. In this research, a complete acoustic-phonetic tonophone set was built on the basis of a literature review on the latest phonetics and phonology of modern Hanoi Vietnamese.

1.6.2

Corpus availability and design

Several works on Vietnamese speech corpus were presented, but mainly for speech recognition (Le et al., 2004, 2005, Vu and Schultz, 2009, 2010, Vu et al., 2005). These works did not focus on designing the text corpus, but on collecting/recording the speech corpus, as well as on selecting speakers or on automatic alignment. In the work of Tran (2007b), a corpus for a unit selection TTS system was collected from different resources from the Internet (e.g. stories, books, web documents), and manually chosen by experts. It included 630 sentences with various types of data: words with six lexical tones, figures and numbers, dialog sentences and short paragraphs. However, due to the small size, this system was not able to synthesize a number of syllables composed of halfsyllables not in the corpus. The work of Vu et al. (2009) reported a 3000-sentence training corpus composing phonetically-rich sentences for spoken Vietnamese. Moreover, that corpus is not available for other researchers. Other studies reported several-hundred-sentence corpora without any investigation or design (Tran, 2007b)(Dinh et al., 2013)(Kui et al., 2011). As a result, to the best of our knowledge, we assumed that there lacks a work on analysis and design of a text corpus for Vietnamese TTS. Since Vietnamese is a tonal language, the training corpus should not only cover phonemic context (e.g. di-phones) but also tonal context. Based on other works on corpus design, we investigated on design phoneticallyrich and -balanced corpora for Vietnamese TTS using a huge raw text crawled from various sources. A training corpus for an HMM-based TTS system was designed to cover 100% ditonophones (i.e. an adjacent pair of “tonophones”).

1.6. Main issues on Vietnamese TTS

1.6.3

Building a complete TTS system

The work of Tran (2007b) presented a complete architecture of Vietnamese concatenative TTS, which composes of both high-level and low-level speech synthesis. However, there were still numerous issues in automatic analysis of text such as text normalization, word segmentation, or POS tagger. To our best of knowledge, most of previous research on HMM-based Vietnamese TTS (Vu et al., 2009)(Kui et al., 2011)(Dinh et al., 2013)(Phan et al., 2013b) adopted HTS (HMM-based Speech Synthesis System) 14 framework for experiment. They presented only the core architecture from HTS (Zen et al., 2007), which mainly presents training and synthesis parts. All processes in these two parts can be performed using existing tools from HTS or other frameworks. However, the text analysis or the natural language processing part was not investigated in detail. Although Vietnamese is an alphabetic script, there existed issues in automatic text analysis in the high-level such as text normalization, word segmentation, POS tagger due to a number of the language’s distinguishable characteristics from the occidental languages. Spaces and punctuations in the occidental languages can be used as the main predictors of word segmentation, yet in Vietnamese, there is no word delimiter or specific marker that distinguishes the boundaries between words. Blanks are not only used to separate words, but they are also used to separate syllables that make up words. Moreover, the Vietnamese language creates complex words by combining syllables that most of the time possess an individual meaning. As a result, there are ambiguities in word segmentation that need to be addressed. Real texts in Vietnamese often include many Non-Standard Words (NSW) that one cannot find their pronunciation by using “letter-to-sound” rules (e.g. numbers, abbreviations, date). In addition, there is a high degree of ambiguity in pronunciation (higher than for ordinary words) so that many items have more than one plausible pronunciation, and the correct one must be disambiguated by context. This raises a real problem in text normalization. Vietnamese is an “inflectionless language in which its word forms never change”, regardless of grammatical categories, which leads to a special linguistic phenomenon common in Vietnamese, called “type mutation”, where a given word form is used in a capacity that is not its typical one (a verb used as a noun, a noun as an adjective. . . ) without any morphological change” (Le et al., 2010). This property introduces a huge ambiguity in POS tagging. In this work, we presented a complete architecture of an HMM-based TTS system, composing three parts: natural language processing, training and synthesis part. Constituent modules in the natural language processing part were investigated and constructed. As a result, a complete HMM-based TTS system for Vietnamese was built in this work.

1.6.4

Prosodic phrasing modeling

HMM-based speech synthesis provides a statistical and machine learning approach, in which speech parameters and contextual features are force-aligned to build trained models. Each HMM also has its state-duration distribution to model the temporal structure of speech. As a result, prosodic cues such as intonation, duration can be well learned in context. This considerably increases the naturalness of the synthetic voice. The remaining problem in prosodic analysis is prosodic phrasing, including pause insertion and lower levels of grouping syllables. In an HMM-based TTS system, a pause is considered a phoneme; its duration hence can be modeled. However, the appearance of pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely well modeled with basic features. 14. http://hts.sp.nitech.ac.jp/

Chapter 1. Vietnamese Text-To-Speech: Current state and Issues

As aforementioned, the “type mutation” property of Vietnamese introduces a huge ambiguity in Part-Of-Speech (POS) tagging, hence in automatically identifying function or content words. As a result, although function words in occidental languages are good candidates to predict boundaries of prosodic phrasing, they may not be effectively used in automatic TTS. Besides, punctuations cannot be used as an only clue for pauses or breaks when reading Vietnamese text. Both syllables and words in Vietnamese are separated by spaces; hence it is not easy to determine the word boundaries. Vietnamese input text thus is a sequence of syllables, separated by spaces. This leads to a big issue in prosodic phrasing in Vietnamese TTS, which may need higher-level information from text – syntax. Due to the constraint with the lexical tones, the utterance-level intonation in Vietnamese language might be less important in prosodic phrasing than that in other intonational languages (e.g. English, French). In this research, we aimed at prosodic phrasing for the Vietnamese TTS using durational clues alone.

1.6.5

Perceptual evaluations with respect to lexical tones

We assumed that the lexical tones were important not only in building but also in evaluating TTS systems for Vietnamese. However, most related works carried out the MOS and/or intelligibility test, which were used for any language, to evaluate the quality of Vietnamese TTS systems. There is a lack of an investigation of perceptual evaluations with respect to lexical tones. In this work, beside some traditional perception tests (e.g. MOS test), tone intelligibility test was designed and performed for evaluating continuous synthetic speech of our TTS system. This test asked subjects to identify the most likely syllable they heard among a group of syllables bearing different tones in an utterance. A tone confusion pattern was also discussed for relations of tones. The intelligibility test was also carried out with a Latin square design, which eliminated the issue of duplicate contents of stimuli. The error rate of the tone level was also investigated.

1.7

Proposition and structure of dissertation

As aforesaid, this work targets to design and implement a high-quality Vietnamese TTS system using HMM-based approach. The major contributions are the following: • Proposing a new approach in building a tonophone set (i.e. allophones with respect to lexical tones) for Vietnamese TTS, based on the literature review on the Vietnamese phonetics and phonology; • Designing and recording a new corpus, called VDTS (Vietnamese Di-Tonophone Speech), to cover both phonemic and tonal contexts for Vietnamese TTS systems; • Designing an entire architecture (including a complete text analysis/natural language processing phase from text normalization, word segmentation, POS tagging, G2P conversion) and a contextual feature set for an HMM-based Vietnamese TTS system; • Building VTED (Vietnamese TExt-to-speech Development system), an HMM-based Vietnamese TTS system, following the proposed design and the new corpus VDTS; • Proposing and evaluating a novel prosodic phrasing model using syntactic blocks (with an automatic Vietnamese syntactic parser) to improve the rhythm of the synthetic voice of VTED;

1.7. Proposition and structure of dissertation

• Designing, carrying out and analyzing various perceptual evaluations of VTED including MOS test, Intelligibility test, Tone intelligibility test, and Pair-wise comparison test. The rest of this dissertation is organized as follows. Chapter 2 presents our literature review on phonetics and phonology of the modern Hanoi dialect of Northern Vietnamese (Hanoi Vietnamese). Different opinions on Vietnamese syllable structure are discussed; the final chosen hierarchical structure with elements is induced. The phonology and tone system of modern Hanoi Vietnamese language are then described in this chapter. Four main results essential in building and evaluating a Vietnamese TTS system as well as designing corpus are described: (i) a set of grapheme-to-phoneme rules, (ii) the Vietnamese phone set regarding lexical tones, in which a new speech unit was proposed: a tonophone, (iii) the acoustic-phonetic tonophone set, which provides acoustic attributes for each segment, and (iv) PRO-SYLDIC, an e-dictionary with transcriptions of all pronounceable syllables in the language. The proposal of corpus design for Vietnamese TTS is shown in Chapter 3, in respect to phonemic and tonal contexts. This corpus was recorded in a controlled well-equipped studio and pre-processed for a TTS system. In Chapter 4, a novel prosodic phrasing model using syntactic blocks, syntactic links and POS are described. The chapter then describes the evaluation of the pause prediction performance using Precision, Recall and F-score. Due to the importance of syntax to prosodic phrasing, syntax theory, Vietnamese syntax and Vietnamese syntactic parsing are also covered in this chapter (details are described in Appendix A). Automatic syntactic parsing approaches and several parsing types for the adopted Vietnamese syntax parser are also discussed in this appendix. This chapter also gives an introduction to another proposed prosodic phrasing model using syntactic rules, which is presented in detail in Appendix B. Chapter 5 first gives an introduction to HMM-based speech synthesis as well as its main processes, i.e. parameter modeling, parameter generation, vocoder. A design as well as the phone and feature sets of a complete Vietnamese HMM-based TTS system are then presented. This chapter then describes the implementation of such a system – VTED – under the platform of Mary TTS. Several synthetic voice versions of VTED using different training corpora and/or feature sets for the perceptual evaluations are provided. Chapter 6 shows our design and implementation of different perception tests on VTED. The perceptual results are then statistically analyzed for each test. Some GUI test screens and examples of test corpus are illustrated in Appendix C. A summary of the work is given in Chapter 7, which depicts several perspectives of this research.

Chapter 2

Hanoi Vietnamese phonetics and phonology: Tonophone approach Contents 2.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Vietnamese syllable structure . . . . . . . . . . . . . . . . . . . . . . .

2.3

2.4

2.5

2.6

2.7

2.8

2.2.1

Syllable structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.2

Syllable types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vietnamese phonological system . . . . . . . . . . . . . . . . . . . . .

52 55 56

2.3.1

Initial consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.2

Final consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.3

Medials or Pre-tonal sounds . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.4

Vowels and diphthongs . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vietnamese lexical tones . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4.1

Tone system

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4.2

Phonetics and phonology of tone . . . . . . . . . . . . . . . . . . . . . .

2.4.3

Tonal coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Grapheme-to-phoneme rules . . . . . . . . . . . . . . . . . . . . . . . .

2.5.1

X-SAMPA representation . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5.2

Rules for consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5.3

Rules for vowels/diphthongs . . . . . . . . . . . . . . . . . . . . . . . . .

Tonophone set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.1

Tonophone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.2

Tonophone set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.3

Acoustic-phonetic tonophone set . . . . . . . . . . . . . . . . . . . . . .

PRO-SYLDIC, a pronounceable syllable dictionary . . . . . . . . . .

2.7.1

Syllable-orthographic rules . . . . . . . . . . . . . . . . . . . . . . . . .

2.7.2

Pronounceable rhymes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7.3

PRO-SYLDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1. Introduction

2.1

Introduction Vietnamese, the official language of Vietnam, is spoken natively by over seventyfive million people in Vietnam and greater Southeast Asia as well as by some two million overseas, predominantly in France, Australia, and the United States. The genetic affiliation of Vietnamese has been at times the subject of considerable debate (. . . ). Scholars (. . . ) maintained a relation to Chinese, while Maspero (1912), despite noting similarities to Mon-Khmer, argued for an affiliation with Tai. However, at least since the work of Haudricourt (1953), most scholars now agree that Vietnamese and related Vietic 1 languages belong to the Mon-Khmer branch of the Austroasiatic family. –(Kirby, 2011, p. 381)– .

Studying the Vietnamese language, especially its phonetics and phonology, is necessary to understand a language as a means of communication between people; hence plays important roles in speech processing. This chapter recapitulates the phonetic and phonology of the modern Hanoi, which is widely considered as the standard language of Vietnamese. Section 2.2 presents some discussions of different scholar and finally gives our conclusion on Vietnamese syllable structure. The phonological system of Hanoi Vietnamese is described in Section 2.3 while the lexical tones are discussed in Section 2.4. Based on this literature review, main grapheme-to-phoneme rules are provided in Section 2.5. In Section 2.6, tonophone, a new speech unit i.e. a phone concerning a corresponding lexical tone, is introduced. The construction of the Vietnamese tonophone set with acoustic attributes is described. Section 2.7 gives an analysis of Hanoi Vietnamese rhymes and syllable orthographic rules in order to build an e-dictionary with transcriptions of pronounceable syllables. Those results were used for building our TTS system as well as designing corpora. The convention of phonetic notation in this work is adopted from the study of Laver (1994). In order to distinguish phonetic transcription from orthographic and other symbols, phonetic symbols are enclosed in square brackets, e.g. the orthographic representation “trăng” in Vietnamese (enclosed in double quotes) will be transcribed phonetically as [tC˘aN] (without a lexical tone) or [tC˘ aN-1] (with a lexical tone – level tone 1). This notation is actually transcribed for phonetic realization of phonemes, i.e allophones – “members of a given phoneme” (Laver, 1994, p. 42), called “allophonic transcriptions”. Whereas, the choice of symbols in a phonemic transcription, enclosed in slant brackets, is limited to one symbol per phoneme (e.g. /a/) (Laver, 1994, p. 550). To give a better illustration for examples, English meanings of Vietnamese syllables/words are provided and enclosed in round brackets and in italic format, e.g. in “trăng” (moon). Phonemes, allophones, phones and are represented using symbols of the International Phonetic Alphabet (IPA).

2.2

Vietnamese syllable structure

The analysis of syllable structure has a direct bearing on the analysis of the phonemic system: numerous nucleus/vowels, combination of glide and vowel, and also central for a tone system. 1. The Vietic branch is sometimes referred to as Việt-Mường, although this latter term is also used to refer exclusively to a sub-branch of Vietic containing Vietnamese and Mường.

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

In this section, several discussions on Vietnamese syllable structure are provided, and the concluded structure for Vietnamese syllables is presented.

2.2.1

Syllable structure

In Vietnamese, as presented in the previous section, the boundary between a phonetic unit (syllable) and a grammatical unit (morpheme) is the same. This characteristic cannot be found in inflectional languages, e.g. Indo-European languages. In addition, each syllable in Vietnamese has a stable and complete structure composed of perceptually distinct units of sound, i.e. phoneme. As a result, the role of syllables in Vietnamese is much different from that in Indo-European languages. Syllable structure in Vietnamese permits “only single consonants in onset and coda positions, and a single vowel or a diphthong in the nucleus” (Vogel et al., 2004). Most researchers preferred the hierarchical structure of Vietnamese syllables (Thompson, 1987)(Doan, 1977)(Vogel et al., 2004)(Doan, 1999a), however there are different ideas on composite parts and their relationship. Topical issues in this section include (i) the appearance of “medial” part, (ii) the presence of rhyme, (iii) the role of lexical tones in Vietnamese syllable structure. Syllable Onset

Syllable

Rhyme

(C)

(a) /w/ in onset by Thompson (1987)

Onset (C)

Rhyme Nucleus

Coda

(w) V (V)

(C)

(b) /w/ in nucleus by Vogel et al. (2004)

Figure 2.1 – The position of “medial” /w/ in Vietnamese syllables: (a) Thompson (1987) and (b) Vogel et al. (2004) Medial. There is one apparent complication the so-called “medial”, i.e. a glide, a /w/ that may appear between an onset and a nucleus. Scholars such as Vogel et al. (2004) provided an analysis on the basis of a moraic approach for Vietnamese syllables to argue that the glide /w/ is preferred to be in nucleus, not in onset – in contrast with the work of Thompson (1987), illustrated in Figure 2.1. Moreover, Vogel reinforced the proposal by giving some examples of a type of a word game: “Nói lái”, which “exchanges different parts of syllables in word sequence to form a sort of spoonerism”. On the basis of possibilities manifested in this game, the “medial” is always actually exchanged along with the nucleus, not the initial (Example 2). Vogel’s arguments, however, are strong supports for a claim that the “medial” is not part of the onset, but not enough to affirm that it belongs to the nucleus or another crucial part in the hierarchical structure of Vogel – rhyme. A good reason to analyze the medial /w/ as a part of rhyme is that it can appear in onset-less syllables in Vietnamese, such as “oan” [wan] being victim of a glaring injustice, or “uyên” [wi@n] in "uyên bác" – erudite. Đoàn Xuân Kiên (Doan, 1999a) also considered that the glide /w/ is not in onset, but controverted the existence of “medial” in the structure. The scholar argued that the medial should be considered a “semi-vowel” instead of a part of the structure, and adopted the hierarchical structure of Vietnamese syllable, without rhyme. Đoàn Xuân Kiên concluded that there is no persuasive argument on phonetics and phonology for the big role of rhyme

2.2. Vietnamese syllable structure

Example 2 The game “Nói lái” with the origin word: “Tuyên bố” [twien áo] (Vogel et al., 2004) • Switching the onset node: [twien áo] ⇒ [áwien to] • Switching the rhyme node: [twien áo] ⇒ [to áwien]

on the structure of Vietnamese syllables and that exist four parts in this structure: initial, nucleus, ending, and tone (no rhyme). However, rhyme does exist in the hierarchical structure of Vogel et al. (2004), which, as aforesaid, plays an important role in the “Nói lái” game (switching elements of syllables). Syllable

Initial

Tone Rhyme Medial Nucleus

Initial Ending

Rhyme

Tone

Medial Nucleus Ending

Figure 2.2 – The hierarchical structure of Vietnamese syllables by Doan (1977) “ Rhyme. Đoàn Thiện Thuật (Doan, 1977) presented main parts in the structure of Vietnamese syllables illustrated in Figure 2.2. The scholar also preferred the hierarchical structure and affirmed the importance of rhyme in the structure of Vietnamese syllables with some analyses on the basis of a moraic approach to the syllables or some word games such as “nói lái”, “-iếc hoá”, “láy” or “gieo vần”, illustrated in Example 3. In the game “-iếc hóa”, a variant of a word in oral conversation is created by adding a new syllable composed of the initial consonant of the original syllable and the rhyme /iek/, e.g. origin syllable: “toán” [twan] ⇒ new word: “toán tiếc” [twan tiek] (math). “Láy” (reduplication) is the process of creating a new word (called “từ láy” - reduplicated word) by repeating either a whole syllable or part of a syllable. Reduplication in Vietnamese can be applied on initial consonants, e.g. “nhạt nhẽo” [ñat ñEw] (insipid); rhymes, e.g. “lung tung” [luN tuN] (in utter disorder); or both of them, e.g. “tẻo teo”, [tEw tEw] (tiny). In Vietnamese poems, it is common to find repeated rhymes (strictly following prosody) of verses’ syllables, i.e. “gieo vần”. The analysis of “Đoàn Thiện Thuật” mentioned the equivocal characteristic of medial, which raises the ambiguity that the medial is a part of initial or rhyme. However, the number of instances proving that the medial belonging to the initial are uncommon. For instance, in “-iếc hoá”, “toán tuyếc” [twan twiek] (math), in which the repeated parts are initial+medial /tw/, is much less popular than “toán tiếc” [twan tiek], in which only the initial /t/ is repeated. The reduplicated word “lẩn quẩn” [l˘ 7n kw˘ 7n] (hover about), in which only the nucleus and the rarfinal consonant – not the whole rhyme are the repeated parts, is less popular than “luẩn quẩn” [lw˘ 7n kw˘ 7n], in which the whole rhyme iterates. In “gieo vần”, “qua” [kwa] and “mà” [ma], in which the repeated part is only nucleus – not the whole rhyme, is a rare example. Based on those analyses, the medial is finally concluded that it is a part of rhyme. Lexical tones. The last, but important and typical issues of the Vietnamese syllables, are

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

Example 3 The game “-iếc hoá”, “láy”, “gieo vần” • “-iếc hoá”: Rhyme is replaced by “-iếc” /iek/ to make a variant of a word in oral conversations, e.g. “toán” [twan] ⇒ variant of word: “toán tiếc” [twan tiek] (math); “khoan” [xwan] ⇒ variant of word: “khoan khiếc” [xwan xiek] (to drill) • “láy”: Reduplication of a whole syllable or part of a syllable – Reduplication of the initial, e.g. “nhạt nhẽo” [ñat ñEw] (insipid), “mênh mông” [meñ moN] (spacious). – Reduplication of the rhyme, e.g. “lung tung” [luN tuN] (in utter disorder), loắt choắt [lw˘ at tCw˘ at] (little). – Reduplication of both initials and rhymes (with or without tone), e.g. “tẻo teo”, [tEw tEw] (tiny), xinh xinh [siñ siñ] (cute). • “gieo vần”: Repeating rhymes of syllables of verses in a poem (strictly following prosody): Ao thu lạnh lẽo nước trong veo ([vEw]) Một chiếc thuyền câu bé tẻo teo ([tEw]) Sóng biếc theo làn hơi gợn tí Lá vàng trước gió sẽ đưa vèo ([vEw]) (The poem “Thu điếu” of the author Nguyễn Khuyến).

lexical tones and their interaction with other parts in the structure. Some researchers (Le, 1948)(Emeneau, 1951)(Vogel et al., 2004) either did not mention or did not consider Vietnamese tones to be a constituent of syllable structure, since they are not segments and hence cannot be treated as phonemes. However, most scholars such as Doan (1977), Doan (1999a) emphasized the role of lexical tones in the Vietnamese syllable structure. Tones were then treated on a different level from that of segments. In Vietnamese, tones, a mandatory part of syllables, are crucial factors for distinguishing syllables. For instance, “ba” (father) – level tone, “bà” (grandmother) – falling tone, “bã” (residue) - broken tone, “bả” (bait) – curve ton, “bá” (aunt) – rising tone and “bạ” (strengthen) – drop tone are distinct syllables with different meanings. There are two opinions of the role of tones in a syllable structure: (i) a tone is a prosodic feature, i.e. bringing melody of syllables, not a component part of syllables. (ii) a tone is a non-linear part of a syllable, with other linear parts. The first one is a typical characteristic of polysyllabic languages, where all phonemes are sequentially combined. The melody of syllables can be changed based on contexts. However, a Vietnamese syllable can only bring one stable tone, which makes it different from other syllables. The contribution of each tone could construct the morpheme and meaning of syllable. It is not advisable to arbitrarily modify the tone of a syllable, which may lead to its destruction or falsification. We concluded that for the relationship amongst the main parts of the structure, tones are non-linear or suprasegmental, i.e. covering and adhering to the whole or a part of syllable, while other parts of syllable are “linear” or segmental, i.e. continuously sequenced distinct segments (Doan, 1999a). Tones appear simultaneously with segmental phonemes to construct a complete structure of syllables. Tones in Vietnamese syllable structure play a typical and

2.2. Vietnamese syllable structure

distinguishable role to express perfectly a fully-constituted entity from intonational languages e.g. Indo-European languages. Đoàn Xuân Kiên discussed about the impact of the tone on the nucleus or on the syllable. Some scholars such as Le (1948) believed that Vietnamese tones mostly adhere to the nucleus, meanwhile others to the whole syllable (Cao, 1975)(Doan, 1977). Trần Đỗ Đạt and his team (Tran et al., 2005) did a perception test using Diagnosis Rhyme Test (DRT) method to discover the effect of the tone on the Vietnamese syllables. The study affirmed that the initial consonant does not carry the information of the tone, and does not take part in the construction the tone of the syllable. As a result, the impact of the Vietnamese tones was concluded only on the rhyme of syllables. Syllable

Initial

Medial

Rhyme Nucleus Tone

Initial

Rhyme|Tone

Medial Nucleus Ending

Ending (C1)

(w)

(C2)

Figure 2.3 – The concluded hierarchical structure of Vietnamese syllables. From all analyses of previous researches on the structure of Vietnamese syllables, we present the hierarchical structure as shown in Figure 2.3. There are two main parts in a syllable: an initial consonant and a rhyme. A tone appears simultaneously with three segmental elements of rhyme, i.e. medial, nucleus and ending. The nucleus and tone are compulsory while others are optional. As a result, the syllabic structure is (C1)(w)V(C2)+T, where C1 is an initial consonant, w is the semi-vowel /w/, V is a vowel or a diphthong, C2 is a final consonant or a semi-vowel /w j/, and T is a tone (1-4, 5a, 5b, 6a, 6b).

2.2.2

Syllable types

Based on the concluded syllable structure, it can be said that there are 8 structure-based types of Vietnamese syllables, illustrated in Table 2.1. The nucleus is mandatory, hence combined with other optional elements to form 8 groups: (i) nucleus alone (ii) initial+nucleus (iii) medial+nucleus (iv) nucleus+ending (v) initial+medial+nucleus (vi) initial+nucleus+ending (vii) medial+nucleus+ending (viii) initial+medial+nucleus+ending. Table 2.1 – Structure-based types of Vietnamese syllables INITIAL

Medial

v v v v v

v v

RHYME Nucleus Ending v v v v v v v v v v

Examples “a” /a-1/ (ah), “ủ” /7-3/ (keep warm) “lá” /la-5a/ (leaf), “chờ” /tC7-2/ (wait) “oẹ” /we-6a/ (retch), “uy” /wi-1/ (prestige) “ích” /ik-6b/ (helpful), “ấy” /˘ 7-5a/ (this) “loé” /lwe-6a/ (flash), “quá” /kwa-5a/ (over) “trẹo” /tCew-6a/ (sprain), “nhãn” /ñan-4 (longan) “nhuyễn” /ñwien-4/ (fine), “soát” /swat-5b/ (check)

2.3

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

Vietnamese phonological system

The Vietnamese phonology has been the subject of strong debates and has drawn the attention of many researchers (Doan, 1977)(Thompson, 1987)(Nguyen and Edmondson, 1998)(Doan, 1999a)(Michaud, 2004)(Michaud et al., 2006)(Haudricourt, 2010)(Kirby, 2011). This work gives a review of the phonological system for the modern Hanoi Vietnamese.

2.3.1

Initial consonants

Table 2.2 presents our adoption with 19 initial consonants for the modern Hanoi Vietnamese. According to Kirby (2011, p. 382), although some previous treatments such as Thompson (1987) recognized an unaspirated, unaffricated palatal stop /c/, in the speech of many younger Vietnamese native speakers from Hanoi, this segment is consistently realized as an affricate [tC]. In the initial position, during the production of both the palatal nasal [ñ] and the palatal affricate [tC] are produced, the “tongue body contacts the alveolar or post-alveolar region”. Numerous instances for initial consonants in the modern Hanoi Vietnamese are presented in Example 7. In this dialect, the phonemes “ch-” /c/ and “tr-” /ú/ (the first item in Example 7) are “pronounced alike” (Thompson, 1987) and “completely merged in modern Hanoi Vietnamese” to [tC] although some varieties of Vietnamese maintain a distinction in the phonetic realizations of orthographic “ch-” and “tr-” (Kirby, 2011). This habit was also taken with two other sets of phonemes: “x-” /s/ and “s-” /ù/ are merged to [s] (the second item in Example 7), and “d-” /z/, “gi-” /ü/ and “r-” /r/ are pronounced the same as [z] (the third item in Example 7). Examples in the four last items illustrate the rest of the initial consonants in Hanoi Vietnamese. Table 2.2 – Hanoi Vietnamese initial consonants Labial

Manner of articulation Place of articulation

Stop/Plosive Nasal Affricative Fricative Lateral-approximant

Bilabial

Coronal

Labiodental

á m

Dorsal

Dental

Alveolar

th n

Palatal

Glottal

Velar

k ñ

tC f

In a smaller number of loanwords (mainly French or proper names from a number of languages), /p r ù/ occur, e.g. “pê-đan” [pe âan] (pédale in French), “ga-ra” [ga ra] (garage in French). Hence, the /p/ is then adopted as a new phoneme in our work. However, the /r/ is mostly realized as [z] although some speakers, especially young ones who can speak foreign languages, maintain [r] for those loanwords. As a result, the /r/ is not in the initial consonant set of our TTS system.

2.3.2

Final consonants

Hanoi Vietnamese allows eight phonemes in the final position: three unreleased voiceless obstruents /p t k/, three nasals /m n N/, and two approximants /j w/ 2 (Kirby, 2011). In the final position, /t n/ are “canonically alveolar”. 2. “Whether these segments are transcribed as final approximants /j w/ or as semivowels is largely a matter of analytic perspective. From a phonological standpoint, these segments may be regarded as approximants

2.3. Vietnamese phonological system

Example 4 Initial consonants in the modern Hanoi Vietnamese • “ch” and “tr”:

“cha” [tCa] (father)

“tra” [tCa] (look up)

• “x” and “s”:

“xa” [sa] (far)

“sa” [sa] (fall)

• “d”, “gi” and “r”:

“da” [za] (leather)

“gia” [za] (increase)

“ra” [za] (out)

• “bi” [ái] (marble)

“mi” [mi] (eyelashes)

“fi” [fi] (gallop)

“vi” [vi] (tiny)

• “ti” [ti] (breast)

“thi” [th i] (compete)

“ni” [ni] (this)

“ly” [li] (glass)

• “đi” [âi] (go)

“nhi” [ñi] (pioneer)

“nghi” [Ni] (doubt)

“ky” [ki] (stingy)

• “khi” [xi] (when)

“ghi” [Gi] (write)

“hy” [hi] (in “hy hữu” - seldom)

Table 2.3 – Hanoi Vietnamese final consonants Labial

Manner of articulation Place of articulation

Stop Nasal Affricative Approximant

Bilabial

Coronal

Labiodental

Dental

Dorsal

Alveolar

Palatal

t m

Velar

Glottal

k N

n w

Some final consonants have variations in phonetic realization, as described in Kirby (2011, p. 383). Although the stops /N k/ following /i e ˘E/ have sometimes been phonetically realized as palatal [ñ c], they are actually pre-velar [N] and [k], with no point of alveolar contact. Folff ff lowing back rounded vowels /u o O/, the velar stops /k N/ are produced as doubly articulated > > labial-velars [kp Nm]. In our work, the orthographies “-anh, -ách” are transcribed as [˘EN], /˘Ekff/ since the vowels ff are shortened, from /E/ to [˘E]. Hence, the velar stops /N k/ are realized as pre-velar [N], [k] if ff ff they follows /i e ˘E/. There do exist a few instances of true velars following /E/, e.g. “xẻng” [sEN] (shovel) (Kirby, 2011). Example 5 Final consonants in the modern Hanoi Vietnamese • “chích” [tCikff] (inject) • “chậc” [tC˘ 7k] (well)

“trách” [tC˘Ekff] (blame) > “chốc” [tCukp] (instant)

• “chanh” [tC˘EN] (lemon) ff • “trang” [tCaN] (page)

“chênh” [tC˘EN] (tilted) ff > (common) “chung” [tCuNm]

“trăng” [tC˘aN] (moon)

• “chan” [tCan] (souse)

“chao” [tCaw] (swing)

“chai” [tCaj] (bottle)

“chiếc” [tCi@k] (a unit of) “chớp” [tC7p] (flash)

“chum” [tCum] (jar)

For a better illustration of those final consonants with several variations, some instances (consonants) on the grounds that they may not be followed by another consonant. However, these segments are articulated somewhat differently from the initial approximants, with a lesser degree of closure” (Kirby, 2011).

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

can be found in Example 5. The finals of “chích”, “trách”, “chanh”, “chênh” in the first and the third items of this example are realized as pre-velar: [tCikff], [tC˘Ekff], [tC˘EN], [tC˘EN]. Whereas, ff ff the phonetic realization of the finals of “chốc”, “chung” in the second and the forth item are > > Examples in other items have the standard realization as in labial-velars: [tCukp], [tCuNm]. the Table 2.3.

2.3.3

Medials or Pre-tonal sounds

A medial is the sound between the initial sound and the nucleus. This element is a lingual and semi-vowel segment, which impacts the timbre of syllables but has no syllabic characteristic (Doan, 1977). Vietnamese has only one medial, transcribed as /w/. This has the same structure as the vowel /u/, but does not get as a nucleus in syllables. For instance, in “chót” [tCOt] (final), [O] is the nucleus of the syllable [tCOt]. Meanwhile, in the syllables “choắt” [tCw˘ at] (small) or “chắt” [tC˘at] (decant), [˘a] is the nucleus and /w/ is the medial of the syllable [tCw˘ at]. The nucleus of a syllable is mandatory and brings the lexical tone in the orthography (“-ó-”, “-ắ-”), while the medial is optional and right before the nucleus that carries a lexical tone, hence also called “pre-tonal” sounds. The medial /w/ increases the volume of the syllable and also contributes to the tone of the syllable (Nguyen, 2007). With a medial glide /w/, a rhyme is produced with the rounding of the nucleus. This rounding is transcribed as a superscript [ w ] (Michaud et al., 2015). The medial sound hence never precedes the rounded vowels /u o O ˘O/, and never follows the labial consonants /b m f/ except for loanwords. For instance, in “mua” [mu@] buy, the nucleus is a diphthong /u@/ and there is no medial sound; meanwhile in “qua” [kw a] pass away, there exists a medial sound /w / and a vowel /a/ as a nucleus. The orthography of a medial sound is “-u-” if it follows the initial consonant /k/, e.g. “quê” [kw e] (hometown), “quay” [kw ˘Ej] (whirl), “quyền” [kw i@n] (power) or precedes the close or close-mid vowels, that is /i i@ e/, such as “tuy” [tw i] (however), “tuyến” [tw i@n] (line), “huề” [hw e] (draw). Its orthography is “-o-” if it precedes the open or open-mid vowels, which are /a ˘a E ˘E/, such as “hoa” [hw a] (flower), “xoăn” [sw ˘an] (curly), “hoe” [hw E] (reddish), “oanh” [w ˘EN] (oriole). ff

2.3.4

Vowels and diphthongs

The nucleus is the main and compulsory sound of the syllable. In Vietnamese, the nucleus is expressed by vowels or diphthongs. Hanoi Vietnamese distinguishes nine long vowels /i e E a W 7 u o O/, four short vowels /˘E ˘a 7 ˘ ˘O/ (Doan, 1977) and three falling diphthongs /i@ W@ u@/ (Kirby, 2011). Diphthongs have the same function as vowels in the syllable (Doan, 1977)(Nguyen, 2007). Table 2.4 – Hanoi Vietnamese vowels and diphthongs Back

Elevation of the tongue

Front

Central

Position of the tongue

Close (High vowel) Close-mid Open-mid Open (Low vowel)

i e E a

Unrounded

W 7

˘E ˘a

Rounded

u o

7 ˘ O

˘O

Table 2.4 illustrates the attributes of Vietnamese vowels and diphthongs. All vowels are

2.3. Vietnamese phonological system

unrounded except for the four back rounded vowels: /u, o, O, ˘O/. The vowel with the orthographic “ư” is considered to be close back unrounded /W/ (Doan, 1977, Thompson, 1987) although other researchers might indicate that it is more central than back (Brunelle, 2003, Han, 1966). Four short vowels /˘E ˘ a 7 ˘ ˘O/ together with their corresponding “long” vowels /E a 7 O/ reflect their “interpretation as a vowel pair distinguished by phonemic length” in spite of their small spectral differences. It has “not been established that these differences are perceptually or psychoacoustically salient”, hence they are transcribed as instances of the same vowel quality (Kirby, 2011). It is economical to use a diacritic for the four short vowels and leave nine long vowels unmarked (Michaud et al., 2015). The two obvious short vowels /˘ a 7 ˘/ can be easily found from the orthography “ă â” respectively. The /˘ 7/ is a close vowel, while the /˘a/ is an open one. One of the special cases of the orthography “a” is that it can be transcribed to /˘a/ in syllables with “-ay -au” rhymes. Another notable feature of the system is the presence of two other short vowels /˘E/ and /˘O/ corresponding to the orthography “a” in the rhymes “-anh, -ách” and “o” in the rhymes “-ong, -óc”. An instance of long-short vowel pairs is “xẻng” [sEN] (shovel) and “sảnh” [s˘EN] ff > (finish). (hall); or xoong [soN] (sauce pan) and “xong” [sONm] The three diphthongs /i@ W@ u@/ are actually centralizing (Kirby, 2011, Michaud et al., 2015), which brings out the coherence of the system better, as illustrated in Figure 2.4. Doan (1977) and Haudricourt (2010) considered them as front /ie/ or back /W7 uo/.

Figure 2.4 – Location of Vietnamese diphthong centroids (Kirby, 2011). . Many instances of vowels were also presented in the previous examples. In Example 6, the two first items illustrated three diphthongs with (the first line) or without (the second line) finals. Some exceptional cases of vowels are presented in the last two items. The orthographic “-a-” normally is transcribed as /a/ such as “rang” [zaN], “gian” [zan]. However, “-a-” following by “-nh”, or “-ch” is pronounced as [˘E], e.g. “ranh” [z˘EN], “rách” [z˘Ekff]. Similarly, “-a-” ff following by “-u”, or “-y” is pronounced as [˘a], e.g. “rau” [z˘aw] (vegetables), “ray” [z˘aj] (rail). Example 6 Vowels and diphthongs in the modern Hanoi Vietnamese • Diphthongs: “tuyết” [tw i@t] (snow)

“tuốt” [tu@t] (pluck off)

“tướt” [tW@t] (long)

• Diphthongs: “tia” [ti@] (ray)

“tua” [tu@] (fringe)

“tưa” [tW@] (fur)

• Vowels: “rang” [zaN] (roast)

“ranh” [z˘EN] (mischievous) ff “rau” [z˘aw] (vegetables)

“rách” [z˘Ekff] torn

• Vowels: “gian” [zan] (disloyal)

“ray” [z˘aj] (rail)

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

2.4

Vietnamese lexical tones

2.4.1

Tone system

The Vietnamese tone system belongs to the pitch-plus-voice quality type, i.e. the tone is not defined solely in terms of pitch: it is a complex bundle of pitch contour and voice quality characteristics (...). The length of the vowel, and the presence or absence of a final nasal, have no influence on which tones the syllable can bear: there is no need to distinguish “heavy” syllables and “light” syllables, or to posit a division of the rhyme into morae. In contrast to the tone systems of African languages (e.g., typically, Niger-Congo family), in Vietnamese, as in many Asian languages, there are neither tone spreading and floating tones nor downstep. Unlike in some varieties of Chinese (...), there is no tone sandhi in Vietnamese. –Michaud (2004, p. 121)– As presented in the Section 2.2, each syllable carries one lexical tone, mainly impacting all elements of its rhyme. In orthography, Hanoi Vietnamese, has six different lexical tones, adhering to the nucleus element in writing scripts. In our work, due to previous studies and ease of transcription, we adopted a numeric way for representing the tones: (1) level tone, e.g. “ta” [ta-1] (we); (2) falling tone, e.g. “tà” [ta-2] (declining); (3) curve tone, e.g. “tả” [ta-3] (describe), (4) broken tone, e.g. “tã” [ta-4] (diaper); (5) rising tone, e.g. “tá” [ta-5] (dozen); and (6) drop tone, e.g. “tạ” [ta-6] (quintal). In spite of the six lexical tones in the writing system, Vietnamese actually distinguishes eight tones in phonetic realization: (i) six tones for sonorant-final syllables, and (ii) two tones for obstruent-final syllables, whose final consonants end in an unreleased oral stop /p t k/. “The historical developments that led to the complex tone system of present-day Vietnamese are by now a textbook example of tonogenesis, the various stages of the process” (Michaud, 2004, p. 121). The obstruent-final syllables may carry either of two tones: rising or drop tones. For ease of comparison, the rising tone in sonorant-final syllables, i.e. syllables not ending in /p t k/, is represented as “5a”, while the one in obstruent-final syllables are represented as “5b”. This convention is similar to the drop tone: “6a” for the drop tone in sonorant-final syllables and “6b” for the drop tone in obstruent-final syllables. Michaud (2004) adopted the representation with four categories: tones A, B and C for three distinctive tones for sonorant-ending syllables, and obstruent-final syllables constituted a fourth set of syllables, without distinctive tone: category D (see Table 2.5). “At a later stage, a second tonal split (bipartition) involving the disappearance of the opposition between the voiced and unvoiced initial consonants created the current paradigm of six tones for syllables with final sonorants (tones A1 through C2) and two tones for syllables with final obstruents (tones D1 and D2; more strictly speaking, two architonemes)”. The tone system with two types of naming convention of Hanoi Vietnamese can be summarized in Table 2.5. Some constraints for tones and other elements of syllables are described as follows (Ferlus, 2001, p. 298). • Within syllables ending in vowels, all of the six tones can occur. • Within syllables in nasal finals (-m -n -nh/-ng) and ancient lateral final only tones derived from level (1), falling (2) tones, rising (5a) and drop (6a) can occur in genuine

2.4. Vietnamese lexical tones

Table 2.5 – Hanoi Vietnamese tones (Ferlus, 2001, p. 298) Initial consonants

Voiced finals

Voiceless finals

curve (3 or C1) broken (4 or C2)

rising (5b or D1) drop (6b or D2)

Final consonants Voiceless initials Voiced initials

level (1 or A1) falling (2 or A2)

rising (5a or B1) drop (6a or B2)

Vietic words. Tones corresponding to curve (3) and broken (4) tones only exist in borrowed words from Chinese, or in words of expressive origin. • The curve (3) and broken (4) tones are issued on syllables that are either vowel-final or with the ancient final fricative. • The tones in syllables with final plosives (-p -t -ch/-c) are realized with the same contour as the rising (5a) and drop (6a) tones, but they constitute a subsystem that contrasts, as a whole, with the subsystem in voiced final syllables.

2.4.2

Phonetics and phonology of tone

A schematic representation of the eight tone templates of Hanoi Vietnamese, based on data from one speaker of Michaud (2004), is illustrated in Figure 2.5. The widely cited description by Doan (1977), Kirby (2011), Michaud (2004), Michaud et al. (2006), Nguyen and Edmondson (1998), Thompson (1987) gives the following account, which is also summarized in Table 2.6. The terms “glottal stop”, “glottal constriction”, “creaky voice/laryngealization” and “glottalization”, used below to describe voice qualities of Vietnamese tones, are adopted and characterized phonetically from the work of Michaud (2004, p. 120). • Glottal stop is a gesture of closure that has limited coarticulatory effects on the voice quality of the surrounding segments, • Glottal constriction (also referred to here as glottal interrupt) is a tense gesture of adduction of the vocal folds that extends over the whole of a syllable rhyme, • Laryngealization (i.e. lapse into creaky voice), resulting in irregular vocal fold vibration, is not tense in itself, • Glottalization is used as a cover term for laryngealization and glottal constriction. –Michaud (2004, p. 120)– Tone 1 (A1), level tone (“ngang”), is symbolized in the writing system by the absence of any tone mark, e.g. “ba” [áa-1] (three), “khuya” [xwie-1] (late). Its contour is “nearly level in non-final syllables not accompanied by heavy stress, although even in these cases it probably trails downward slightly. It starts just slightly higher than the mid point of the normal speaking voice range” (Thompson, 1987). This tone “today in the capital is not often lax but usually modal in voice quality” (Nguyen and Edmondson, 1998).

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

Tone 2 (A2), falling tone (“huyền”), is represented by the grave accent ( ` ), e.g. “bà” [áa2] (grandmother), “tuần” [tw˘ 7n-2] (week). It “starts quite low and trails downward towards the bottom of the voice range. This tone is lax and often accompanied by a kind of breathy voicing (voiceless + modal), reminiscent of a sigh” (Thompson, 1987). For some speakers, the tone 2 is “even lax to the point of breathiness with somewhat lowered sub-glottal air pressure” (Nguyen and Edmondson, 1998).

Figure 2.5 – Eight tone templates of Vietnamese tones (Michaud, 2004): A1 (level tone 1), A2 (falling tone 2), C2 (broken tone 3), C1 (curve tone 4), B1 (rising tone in sonorant-final syllables – 5a), D1 (rising tone in obstruent-final syllables – 5b), B2 (drop tone in sonorantfinal syllables – 6a) and D2 (drop tone in obstruent-final syllables – 6b). Tone 3 (C1), curve tone (“hỏi”), is expressed by an accent made of the top part of a question ( ? ), e.g. “bả” [áa-3] (residue), “chuyển” [tCwien-3] (to move). It starts on a lowest F0 value among 8 tones and “varies between a High-Falling-Rising realization and a Falling realization with final laryngealization” (Michaud, 2004). “In final syllables, and especially in citation forms, this is followed by a sweeping rise at the end, and for this reason it is often called the ”dipping“ tone. However, non-final syllables seem only to have a brief level portion at the end, and this is exceedingly elusive in rapid speech” (Thompson, 1987). Although tone 3 is usually described as a low falling and then rising tone, not all Vietnamese speakers have the rising part. The curve tone starts with modal voice phonation, which moves increasingly toward tense voice with accompanying harsh voice (although the harsh voice seems to vary according to speaker). Tone 4 (C2), broken tone (“ngã”), is written as a tilde ( ˜ ), e.g. “bã” [áa-4] (residue), “quẫn” [kw˘ 7n-4] (muddle). It starts higher than the falling tone (2), even the level tone (1) and rising. The broken tone has “medial glottal constriction and ends on a high F0 value” (Michaud, 2004). It is accompanied by the rasping voice quality occasioned by tense glottal stricture. Curve and broken tones are “both tense but their tension is not alike and is not distributed across the syllable in the same way” (Nguyen and Edmondson, 1998). As for the rising tones (“sắc”), they are symbolized by the acute accent ( ´ ). Tone 5a (B1), the rising tone in sonorant-final syllables, e.g. “bá” [áa-5a] (residue), “choáng” [tCwaN5a] (swanky), starts higher than the falling tone (2) but lower than the level tone (1). It trails

2.5. Grapheme-to-phoneme rules

Table 2.6 – Vietnamese tones Tone 1 A1 2 A2 3 C1 4 C2 5a B1 5b D1 6a B2 6b D2

Name Ngang Level Huyền Falling Hỏi Curve Ngã Broken Sắc Rising Sắc Rising Nặng Drop Nặng Drop

F0 contour Level Slightly Falling Falling Falling-Rising Rising Sharply Rising Dropping Sharply Dropping

Duration Long Long Long Long Long Short Short Short

Phonation Modal Lax Tense Glottal Modal Tense Glottal Tense

upward and rising at the middle of syllables. Phonologically, tone 5a is produced with modal voice (Michaud, 2004). Tone 5b (D1) is the rising tone in obstruent-final syllables, e.g. “bát” [áat-5b] (bowl), “bắp” [b˘ ap-5b] (muscle), “bách” [b˘Ek-5b] (cypress), “bác” [bak-5b] (elder uncle). This tone starts on a highest F0 values among 8 tones and sharply rises. Tone 5b is tense and much shorter than other tones (Thompson, 1987) Tone 6a (B2) and tone 6b (D2), drop tone (“nặng”), are represented by a subscript dot ( . ). The drop tone is much shorter than other tones with a tendency to go lower. Tone 6a (B2) is the drop tone in sonorant-final syllables, e.g. “bạ” [áa-6a] (strengthen), “thường” [th uoN-6a] (frequent). It starts also high, slightly rising at the beginning, then drops very sharply and is almost immediately cut off by a strong glottal “constriction that is distinct from creaky voice” (Michaud, 2004). Syllables bearing tone 6a have the same rasping voice quality as the broken tone 4. Tone 6b (D2) is the rising tone in syllables having final stops /p t k/, e.g. “bạt” [áat-6b] (canvas), “bẹp” [bep-5b] (crushed), “bịch” [bik-5b] (basket), “bạc” [bak-6b] (silver). This tone drops a little more sharply than tone 2, but it is never accompanied by the breathy quality of that tone. It is also tense (Thompson, 1987). From an experimental point of view, the study of Michaud (2004) confirmed that “precise and reliable information on phonation type (voice quality) can be obtained from electroglottography” as well as that “voice quality is a robust correlate of tone in Vietnamese, showing less variability than F0 across reading conditions”. His experiment warranted the conclusion that tones 5b and 6b (i.e. the tones of syllables ending in /p t k/) are “not glottalized, either in final or non-final position”.

2.4.3

Tonal coarticulation

The above discussions on Vietnamese lexical tones were mostly for static characteristics. They are more complicated in dynamic states, in which syllables are produced in continuous speech, with phonetic coarticulation effects. Tran and Castelli (2010) did analysis on the influence of coarticulation effect on the variations of tones in continuous speech, and the F0 contour generation was proposed based on this influence of both adjacent tones. Brunelle (2003, 2009) reported that although tonal height coarticulation is bidirectional, progressive tonal coarticulation is much stronger than anticipatory coarticulation in Northern Vietnamese.

2.5

Grapheme-to-phoneme rules

Based on the literature review on Vietnamese phonetics and phonology and some further studies, we presented here some main G2P rules needed for building a pronunciation dic-

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

tionary for Vietnamese (Section 2.7), and building the G2P conversion module of our TTS system (Chapter 5).

2.5.1

X-SAMPA representation

Since IPA symbols are not appropriate representations in computer-based processing, SAMPA (Speech Assessment Methods Phonetic Alphabet) is adopted to work around the inability of text encodings to represent IPA symbols in TTS. In this work, the X-SAMPA 3 inventory (Gibbon et al., 1997), an extension of SAMPA for individual languages, was adopted for coding phonemes in our Vietnamese TTS system. X-SAMPA can cover the entire range of characters in the IPA. Table 2.7 and Table 2.8 illustrate Vietnamese phonemes in both IPA and X-SAMPA for ease of comparison. These mappings were developed on the basic idea of the work Tran (2007b, p. ), with a number of adaptions and extensions.

2.5.2

Rules for consonants

Vietnamese consonants have a set of well-defined grapheme-to-phoneme rules for both initial and final positions. Table 2.7 shows the initial and final consonants with graphemes (orthography) and their respective phonemes. Most of the graphemes have direct rules to convert to corresponding phonemes. They are “b, đ, x, s, g, gh, kh, l, v, th, d, gi, r, ph, tr, h, q, k” for initial, “t, p, n, m, nh” for both initial and final positions. Table 2.7 – Hanoi Vietnamese initial/final consonants: Grapheme (orthography) to phomeme No.

1 2 3 4 5 6 7 8 9 10 11 12 13

Graph -eme b đ x,s g, gh kh l v th d,gi,r ph tr, ch h c,k,q

Position initial initial initial initial initial initial initial initial initial initial initial initial initial

only only only only only only only only only only only only only

Phoneme IPA

X-SAMPA

á â s G x l v th z f tC h k

b d s G x l v t_h z f ts\ h k

No.

14 15 16 17 18 19 20 21 22 23 24 25

Graph -eme t p n ch c ch,c m nh nh ng,ngh ng ng

Position initial, final initial, final initial, final final after i,ê,a final after u,o,ô final except 17, 18 initial, final final after i,ê,a initial only initial final after u,o,ô final except 24

Phoneme IPA

X-SAMPA

t p n kff > kp k m N ff ñ N > Nm N

t p n k_+ kp k m N_+ J N Nm N

The remaining graphemes have well-defined rules with several exceptional cases as below: • For the grapheme “gi”: In initial positions, if it is followed by consonant, “ê” or nothing, “gi” is converted to /zi/; otherwise /z/ • For the grapheme “ng, ngh”: – In initial positions, “ng, ngh” is converted to [N] – In final positions, if the nucleus is a back rounded vowel /u o O/, “ng” is converted > otherwise [N] to [Nm]; 3. Extended Speech Assessment Methods Phonetic Alphabet

2.5. Grapheme-to-phoneme rules

• For the grapheme “nh”: – In initial positions, “nh” is converted to [ñ] – In final positions, “nh” is converted to [N] (“nh” is in final positions if and only if ff the nucleus is “i”, “ê” or “a”) • For the grapheme “ch”: – In initial positions, “ch” is converted to /tC/, “c” is converted to /k/; – In final positions, “ch” is converted to [kff] (“ch” is in final positions if and only if the nucleus is “i”, “ê” or “a”); • For the grapheme “c”: – In initial positions, “c” is converted to /k/; – In final positions, if the nucleus is a back rounded vowel /u o O/, “c” is converted > to /kp/; otherwise /k/.

2.5.3

Rules for vowels/diphthongs

Most vowels and diphthongs also have direct G2P rules, illustrated in Table 2.8. The graphemes “e”, “ê”, “i, y”, “oo”, “ô”, “ơ”, “u”, “ư”, “ă”, “â” can be respectively converted to the vowels [E], [e], [i], [O], [o], [7], [u], [W], [˘ a] and [˘ 7]. The diphthong [i@] can be one of the following orthographies: “ia”, “iê”, “yê”, “ya”. The graphemes “ua” or “uô” can be converted to [u@], while “ưa”, “ươ” are the orthographies of the diphthong [W@]. Table 2.8 – Hanoi Vietnamese vowels/diphthongs: Grapheme (Orthography) to phoneme Vowel type

Long vowel

Phoneme

Grapheme (Orthography)

IPA

X-SAMPA

a e ê i, y o, oo ô, ôô ơ u ư

a E e i O o 7 u W

a E e i O o 7 u M

Vowel type

Short vowel

Diphthong

Phoneme

Grapheme (Orthography)

IPA

X-SAMPA

ă, a (au, ay) â a (anh, ach) o (ong, oc) ia, iê, yê, ya ua, uô ưa,ươ

˘a 7 ˘ ˘E ˘O i@ u@ W@

a_X 7_X E_X O_X i@ u@ M@

There are only two exceptional vowels, that is “o” and “a”, having more complicated G2P rules as follows: • For the grapheme “o”: if it is followed by “ng” or “c”, the phoneme is [˘O]; otherwise [O] • For the grapheme “a”: if it is followed by “nh” or “ch”, the phoneme is [˘E]; followed by “u” or “y”, the phoneme is [˘a]; otherwise [a].

2.6

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

Tonophone set

In the HMM-based speech synthesis, many contextual features (e.g., phone identity, locational features) are used to build context dependent HMMs. However, due to the exponential increase of contextual feature combinations, model parameters cannot be estimated accurately with limited training data. Furthermore, it is impossible to prepare speech database that includes all combinations of contextual features and there is great variation in the frequency of appearance of each context dependent unit. To alleviate these problems, a decision-tree based context clustering is the most common technique to cluster HMM states and share model parameters among states in each cluster. Each node (except for leaf nodes) in a decision tree has a context related question. Acoustic attributes of phonemes or contextual features are used to build questions, such as “Is the current part of speech a noun?”, “Is the previous phoneme voiced?”. Nethertheless, due to the automation in HMM clustering, the semantics of the contextual features (e.g. the importance or the weight of these features) may not be well considered. Hence, some crucial features of Vietnamese, such as lexical tones, may do not have proper priorities in building decision tree, which may lessen the impact of these features on improvement of the synthetic speech quality. To our best of knowledge, in other work or for other tonal languages, lexical tones may be explicitly modeled in a TTS system (Shih and Kochanski, 2000)(Do and Takara, 2003)(Tran and Castelli, 2010). In the Thai language, tone correctness of the synthetic speech may be improved by investigating the structures of the decision tree with tone information in the tree-based context clustering process of the HMM-based training (Chomphan and Kobayashi, 2008)(Chomphan and Chompunth, 2012)(Moungsri et al., 2014). With the complexity of the eight lexical tones in the Vietnamese tone system (cf. Section 2.4 in Chapter 2), tone modeling for continuous speech has been a challenging problem. In this work, due to the importance of the Vietnamese lexical tones, we proposed a new speech unit – a “tonophone”, which takes into account the lexical tone. We assumed that this new speech unit could model tonal contexts of allophones at high level synthesis. This unit was also essential for corpus design as well as automatic labeling since these tasks required a basic speech unit in their processes.

2.6.1

Tonophone

In this work, to build a phone set for the TTS system, allophones – phonetic realization of phonemes – were used. For example, in the final position, the velar stops /N k/, following back > > rounded vowels /u o O/ are produced as doubly articulated labial-velars [kp Nm], following /i e ˘E/ are actually pre-velar [N] and [k]. As a result, the six phones for the two phonemes /N ff ff > k/ are [N k kp N kff]. ff As aforementioned, lexical tones in Vietnamese syllable structure play a typical and distinguishable role to express perfectly a fully-constituted entity from intonational languages e.g. Indo-European languages. The experimental result of Tran et al. (2005) affirmed that the initial consonant does not take part in the construction the tone of the syllable; hence the impact of the Vietnamese tone is only on the rhyme of the syllable. In conclusion, the Vietnamese tone is non-linear or suprasegmental, i.e. covering and adhering to the rhyme of the syllable, while other parts of the syllable are “linear” or segmental, i.e. continuously sequenced distinct segments. The lexical tone appears simultaneously with segmental phonemes of the rhyme to construct a complete structure of the syllable.

2.6. Tonophone set

Due to the crucial role of Vietnamese lexical tones not only in the bearing syllables, but also in phonemes in their rhymes, a “tonophone”, a new speech unit, was proposed as an allophone regarding the lexical tone, and hence adhered to the lexical tone when possible. In other words, to construct tonophones, the lexical tone was adhered to all allophones in the rhyme, while the initial consonant maintained its form without any information of the tone. “Tonophones” were used for emphasizing the role of lexical tones, and reflected their corresponding allophones in tonal contexts. We believed that this new speech unit might give us more precise analysis/design and better synthetic speech. For instance, a syllable “ngoèo” [NwEw-2] (in “ngoằn ngoèo” – zigzagging) carrying the broken tone 2, actually composes [N] as the initial, [w2 ] as the medial, [E2] as the nucleus and [w2] as the final. Its transcription was [Nw2 E2w2-2] when representing in tonophones.

2.6.2

Tonophone set

Table 2.9 shows how Vietnamese allophones combines with possible lexical tones to build the tonophone set, based on our literature review in the previous sections. As aforementioned, since the initial consonant does not carry the information of the lexical tone, 19 initial consonants did not adhere to any tone when forming the corresponding tonophones. Whereas, the medial [ w ] and 16 nucleus (including 3 dipthongs [i@ u@ W7]) were combined with 8 tones as these elements appear in both sonorant- and obstruent-ending syllables. Table 2.9 – Vietnamese tonophone set Syllable element Initial consonant Medial Nucleus Final consonant

Allophones p á t th â k m n ñ N tC fvszxGhl w

iaueW7OE 7 ˘ ˘O ˘ a ˘E ˘ a i@ u@ W@ > p t k kp kff > Nwj m n N Nm ff Total

Lexical tones

Tonophone #

(Not adhering)

19 x 1

1-4, 5a, 5b, 6a, 6b

1x8

1-4, 5a, 5b, 6a, 6b

16 x 8

5b, 6b 1-4, 5a, 6a

5x2 7x6 207

The obstruent-final syllables may carry either of two tones: the rising tone 5b or the drop tone 6b. Hence, only these two tones (5b, 6b) were embedded to 5 allophones that > are unreleased final stops [p t k kp kff]. Whereas, other 7 allophones at the final positions (including 2 semi-vowels [w j]) were combined with 6 tones 1-4, 5a, 6a. As a result, there are 48 Vietnamese allophones (without considering the lexical tones). A total of 207 tonophones were constructed for the tonophone set for our work.

2.6.3

Acoustic-phonetic tonophone set

An acoustic-phonetic unit set of the target language is an important input for a TTS system, especially for the HMM-based approach. It is intended to represent every speech segment that is clearly bounded from an acoustic point of view, as well as every speech segment that is phonetically significant, even if it is not clearly bounded. The two main usages of this set are in (i) HMM clustering using phonetic decision trees, and (ii) automatic labeling, i.e. automatic segmenting and force-aligning the speech corpus with the orthographic transcriptions.

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

In the HMM-based speech synthesis, as aforesaid, context clustering is an important process in the training phase to treat the problem of limitation of training data. Acoustic attributes of phones are crucial information to build questions for nodes and to construct decision trees. The construction of decision trees makes a great contribution to improve the quality of synthetic speech. With the increase in size of speech databases, manual phonemic segmentation and labeling of every utterance became unfeasible. Thus, automatic labeling was one important task to build an annotated corpus for TTS systems. The acoustic-phonetic unit set is used to model and to identify clear acoustic events, which an expert phonetician would mark as boundaries in a manual segmentation session. Therefore, it is essential to obtain a good labeling. Table 2.10 – Hanoi Vietnamese acoustic-phonetic tonophones – consonants No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19 20 21 22 23 24 25 26

Consonant

Place

b d th p, p5b, p6b t t5b, t6b k, k5b, k6b kff{5b, 6b} > kp{5b, 6b} f v h x G s z tC m, m{1-4,5a,6a} n n{1-4,5a,6a} N, N{1-4,5a,6a} N{1-4,5a,6a} ff> Nm{1-4,5a,6a} ñ l w{1-4,5a,6a} j{1-4,5a,6a}

Plosive Plosive Plosive Plosive Plosive Plosive Plosive Plosive Plosive Fricative Fricative Fricative Fricative Fricative Fricative Fricative Affricative Nasal Nasal Nasal Nasal Nasal Nasal Nasal Lateral-approximant Approximant Approximant

Manner Bi-labial Alveolar Dental Bi-labial Dental Alveolar Velar Pre-velar Labial-velar Labio-Dental Labio-Dental Glottal Velar Velar Alveolar Alveolar Palatal Bi-labial Dental Aveolar Velar Pre-velar Labial-velar Palatal Dental Labial-dental Dental

Voicing Voiced Voiced Voiceless Voiceless Voiceless Voiceless Voiceless Voiceless Voiceless Voiceless Voiced Voiceless Voiceless Voiced Voiceless Voiced Voiceless Voiced Voiced Voiced Voiced Voiced Voiced Voiced Voiced Voiced Voiced

Based on the phonetics and phonological system of Vietnamese, an acoustic-phonetic unit set for Vietnamese was built with main phonetic attributes for both consonants and vowels. For consonants, the attributes were: (i) place or articulation, i.e. labial, labio-dental, alveolar, retroflex, palatal, labial-velar, dental, and velar, (ii) manner of articulation, i.e. nasal, stop/plosive, fricative, affricative, approximant, and lateral-approximant, and (iii) voicing, i.e. voiced and voiceless. The phonetic attributes for vowels included: (i) position of tongue, i.e. front, central, and back, (ii) height, i.e. close (high vowels), close-mid, open-mid, and open (low vowels), (iii) length, i.e. short, long, and diphthong, and (iv) roundedness, i.e. rounded

2.7. PRO-SYLDIC, a pronounceable syllable dictionary

Table 2.11 – Hanoi Vietnamese acoustic-phonetic tonophones – vowels No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16

Vowel i{1-4,5a,5b,6a,6b} W{1-4,5a,5b,6a,6b} u{1-4,5a,5b,6a,6b} e{1-4,5a,5b,6a,6b} 7{1-4,5a,5b,6a,6b} 7 ˘{1-4,5a,5b,6a,6b} o{1-4,5a,5b,6a,6b} E{1-4,5a,5b,6a,6b} ˘E{1-4,5a,5b,6a,6b} a{1-4,5a,5b,6a,6b} a{1-4,5a,5b,6a,6b} ˘ O{1-4,5a,5b,6a,6b} ˘O{1-4,5a,5b,6a,6b} i@,W@{1-4,5a,5b,6a,6b} u@{1-4,5a,5b,6a,6b}

Position Front Back Back Central Back Back Back Front Front Front Front Back Back Central Central

Height Close Close Close Close-mid Close-mid Close-mid Close-mid Open-mid Open-mid Open Open Open-mid Open-mid Close Close

Length Long Long Long Long Long Short Long Long Short Long Short Long Long Diphthong Diphthong

Roundedness Unrounded Unrounded Rounded Unrounded Unrounded Unrounded Rounded Unrounded Unrounded Unrounded Unrounded Unrounded Unrounded Unrounded Rounded

and unrounded. We assumed that the phonetic features of phones and tonophones were similar. Based on the literature review in Section 2.3, a complete acoustic-phonetic tonophone set of Hanoi Vietnamese was built and is illustrated in Table 2.10 (consonants) and Table 2.11 (vowels).

2.7

PRO-SYLDIC, a pronounceable syllable dictionary

There was a need to build a syllable e-dictionary whose entries are syllables with their transcriptions. This dictionary was used in natural language processing (high-level speech synthesis) in our TTS system. The main purpose of this dictionary was used for transcribing Vietnamese text. It could be also used for filtering pronounceable syllables in order to extract non-standard words (i.e. tokens that cannot be directly transcribed to phonemes, e.g. numbers, dates, abbreviations). Therefore, pairs of syllable orthography and transcription were automatically generated mainly based on (i) the G2P rules (Section 2.5), (ii) syllable-forming orthographic rules, and (iii) the list of rhymes in Table 2.12. Tonophones were used as a speech unit in the transcriptions, such as “nhuyễn” /ñw4ie4n4-4/ (fine). However, for the simplicity, transcriptions in this section were represented in allophones without regard to lexical tones, such as “nhuyễn” /ñwien-4/ (fine).

2.7.1

Syllable-orthographic rules

Some following syllable-orthographic rules were found in the Vietnamese language. These rules were essential to build the list of rhymes and the PRO-SYLDIC. • For initial consonants: – I1: “ngh”, “ng” (/N/); “gh”, “g” (/G/); or “k”, “c” (/k/): if the nucleus is /i e E/, the initial consonant is “ngh”, “gh”, or “k” respectively; otherwise, it is “ng”, “g”, “c” respectively;

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

– I2: The labial onsets are never accompanied by a secondary labial articulation [ w ], for example “hoa” [hwa] (flower) exists but “boa” [áwa] does not in the language. • For medial (pre-tonal) sounds /w /: – M1: The orthography “u” either follows the grapheme “q” or precedes narrow/quitenarrow nucleus, i.e. /i e 7 7 ˘ i@/, i.e. “i”, “y”, “ê”, “ơ”, “â”, “yê”, “ya”, e.g. “(q)uang”, “uyêt”, “uya”, “(q)uyt”, “uơ”, “uân”; – M2: The orthography “o” always precedes open or open-mid vowels, i.e. /E a ˘a/, e.g. “oe”, “oan”, “oăt”. • For final consonants: – F1: The semi-vowel /j/ never follows the front nucleus /i i@ e E/, while /w/ never follows rounded nucleus /u u@ o O/ – F2: The orthography of the semi-vowel /w/ is “o” if the nucleus is /a/ or /E/, “u” for other cases, e.g. “ao”, “eo”, “âu”, “iu”; – F3: The orthography of the semi-vowel /j/ is “y” if the nucleus is /˘a/ or /˘ 7/; “i” for other cases, e.g. “ay”, “ây”, “ai”, ”ui”; – F4: The orthography of the stop /k/ is “ch” if the nucleus is /i/, /e/ or /˘E/, “c” for other cases, e.g. “ích”, “ếch”, “ách”, “ác”, “ấc”, “iếc”.

2.7.2

Pronounceable rhymes

As presented in Section 2.2, the eight structure-based types of syllables are: (i) nucleus alone, (ii) initial+nucleus, (iii) medial+nucleus, (iv) nucleus+ending, (v) initial+medial+nucleus, (vi) initial+nucleus+ending, (vii) medial+nucleus+ending, and (viii) initial+medial+nucleus+ending. Rhymes was concluded to compose of medial, nucleus and ending, hence support four types: (i) nucleus alone, (ii) medial+nucleus, (iii) nucleus+ending, and (iii) medial+nucleus+ending. A syllable may optionally contain an obstruent, nasal, or approximant coda. The structure of rhymes is (w )V(C), where w is the glide [ w ], V is a vowel or a diphthong, C is a final consonant, which can be one of /p t k N m n/ or a semi-vowel /j w/, and T is a tone (1-4, 5a, 5b, 6a, 6b). Table 2.12 presents a phonetic analysis of pronounceable rhymes in Hanoi Vietnamese. This table was created from the idea of Michaud et al. (2015) for Phong Nha dialect of Vietnamese. It was developed using our review on the Hanoi Vietnamese phonetics and > phonology. For ease of representation, the phonetic realizations of /k/ or /N/, i.e. [kp kff] or > N], are also located in the same lines as /k/ or /N/ respectively. Due to the limited space, [Nm ff a haft of nucleus (i.e. /i e E ˘E i@ W 7 7 ˘/) can be observed in the first haft rows of the table (10 first rows including headers), while others (i.e. /u o O ˘O u@ W@ a ˘a/) can be found in the second part of the table (10 last rows including headers). The first column shows the structure of rhymes, in which “V” denotes a vowel or a diphthong. Each row presents possible rhymes without (left) and with (right) the glide medial [ w ] for each structure differentiating from the final consonant (the absence of final consonant or one of /k N t n p m j w/). For example, the rhyme “oen” [w En] is located in the right (because of the medial [ w ]) of the “E” column (hence the first half of the table), and the “Vn w Vn” row. As presented as an orthographic rule for the medial (the rule M2), the orthography “o” always precedes open or open-mid vowels, i.e. /E a ˘a/, e.g. “oe”, “oac”, “oăn”. However,

2.7. PRO-SYLDIC, a pronounceable syllable dictionary

if the initial consoant is /k/, its orthography must be “q”. In this case, the orthography “o” for the medial [ w ] must be replaced by “u”. For instance, from the rhyme “oac”, an example of a syllable with the initial consonant /k/ is “quác” [kw ak-5b] (quack), while the one with the initial consonant “t” is “toác” [tw ak-5b] (cleave). The following rhymes exist in Vietnamese: “(q)ue, (q)uet, (q)uen, (q)ueo, (q)ua, (q)uac, (q)uang, (q)uanh, (q)uach, (q)uat, (q)uan, (q)uay, (q)uăc, (q)uăng, (q)uăt, (q)uăn, (q)uăp, (q)uăm, (q)uau, (q)uao, (q)uai”, whereas “(q)uec, (q)ueng, (q)uep, (q)uem, (q)uap, (q)uam” are pronounceable but do not exist in the language. The main difference of this table from previous studies was that there are some nonexistent yet pronounceable rhymes (with a star * in the right). For example, the rhyme “oep” does not appear in any meaningful Vietnamese syllables/words, however, based on the Vietnamese G2P rules in Section 2.5, this can be transcribed to [w ep]. The reason to maintain these rhymes is that the input of a Vietnamese TTS system may include numerous loanwords that includes nonexistent but pronounceable syllables, as well as newly appeared words from teenagers or Internet users. For instance, the vietnamese-style pronounce for the word “website” may be “goép sai” [Gw ep-5b saj-1] or sometimes “oép sai” [w ep-5b saj-1] depending on the speakers. In fact, “goép” or “oép” does not exist in Vietnamese. As aforesaid, the list of rhymes was used for generating the syllable dictionary for transcribing Vietnamese text input, which may be from a number of sources (e.g. Internet, stories). Those rhymes provided a great mean to transcribe all pronounceable syllables for a TTS system. A dash (–) indicates that the combination at issue is not pronounceable in the language. Table 2.12 – Hanoi Vietnamese pronounceable rhymes, *: not exist but pronounceable. The medial orthography “o” (e.g. “oanh” [w EN]) is changed to “u” if the initial is /k/ (its orthogff raphy must be “q”), e.g. “loanh quanh” [lw EN qw EN] (to go around); some rhymes do not ff ff exist yet are pronounceable: (q)uec, (q)ueng, (q)uep, (q)uem, (q)uap, (q)uam Structure V wV Vk w Vk VN w VN Vt w Vt Vn w Vn Vp w Vp Vm w Vm Vj w Vj Vw w Vw

i ich inh it in ip im – iu

Structure V wV V wV VN w VN Vt w Vt Vn w Vn Vp w Vp Vm w Vm Vj w Vj Vw w Vw

u uc ung ut un up um ui –

2.7.3

i uy uych uynh uyt uyn uyp uym* – uyu u – – – – – – – – –

ê êch ênh êt ên êp êm – êu

e uê uêch uênh uêt uên uêp* uêm* – –

o ô – ôc – ông – ôt – ôn – ôp – ôm – ôi – – –

e ec eng et en ep em – eo

E oe oec* oeng* oet oen oep* oem* – oeo

˘ E – – ach oach anh oanh – – – – – – – – – – – –

O o ooc oong ot on op om oi –

˘ O – – – – – – – – –

– oc ong – – – – – –

– – – – – – – – –

i@ ia uya iec – iêng – iêt – iên uyên iêp uyêp* iêm uyêm* – – iêu –

W ư – ưc – ưng – ưt – ưn – ưp* – ưm – ưi – ưu –

7 ơ uơ – – ơng* – ơt – ơn – ơp – ơm – ơi – – –

u@ ua – uôc – uông – uôt – uôn – uôp – uôm – uôi – – –

W@ ưa – ươc – ương – ươt – ươn – ươp – ươm – ươi – – –

a ac ang at an ap am ai ao

7 ˘ – âc âng ât ân âp âm ây âu

a oa oac oang oat oan oap oam oai oao

– – uâng uât uân uâp* uâm* uây uâu* ˘ a

– ăc ăng ăt ăn ăp ăm ay au

– oăc oăng oăt oăn oăp oăm oay oau*

PRO-SYLDIC

The total number of rhymes in Table 2.12 is 170, in which 62 rhymes ending in /p t k/ (called sonorant-final rhymes). As aforementioned, sonorant-final syllables can only carry the rising

Chapter 2. Hanoi Vietnamese phonetics and phonology: Tonophone approach

and drop tones (5b, 6b), the 62 sonorant-final rhymes can also bear these two tones, making a total of 124 sonorant-final rhymes with tones. The 108 obstruent-final rhymes can carry six tones (1-4, 5a, 6b), making a total of 648 obstruent-final rhymes with tones. The purpose of the PRO-SYLDIC (PROnounceable SYLlable DICtionary) was to build the transcriptions for all pronounceable syllables in Vietnamese, hence all 19 initial consonants were combined with a total of 772 rhymes with tones. There did exist many nonexistent syllables in some combinations, however they were useful to transcribe loanwords or newly appeared syllables. For example, a loanword “boa” [áwa] in ‘tiền boa’ from French ‘pourboire’ (tip) appeared sometimes in the real input text although it does not exist in the language due to the violation of the rule I3 “The labial onsets are never accompanied by a secondary labial articulation [ w ]”. As a result, there were totally 21,648 syllables with tones (orthography) in the PROSYLDIC, including rhymes without initial consonants. Some of the complexities of the orthography in the language were also covered. For instance, consider combinations of the initial consonant /k/ with some rhymes. Due to the rule I2 (the orthography of /k/ is “c” if the nucleus is not “i”, “e”, “ê”, or “y”), the rhyme “ua” [u@] (the second haft of the table: the “u@” column and the “V w V” row) is combined with “c” to become the syllable “cua” [ku@-1] (crab). Meanwhile, the syllable with the initial /k/ of the rhyme “oa” [w a] (the second haft of the table: the “a” column and the “V w V” row) is “qua” [kw a-1] (pass away).

2.8

Conclusion

This chapter presents our literature review was done on (i) the syllable structure, (ii) the phonological system, and (iii) the lexical tones for Hanoi Vietnamese, a sort of standard Vietnamese. In the hierarchical structure of Vietnamese syllable, lexical tones, a non-linear or suprasegmental part, appear simultaneously with segmental elements of rhyme, i.e. medial, nucleus and ending. There are 19 initial consonants and 12 phones in the final position. Hanoi Vietnamese distinguishes one medial rounding glide, nine long vowels, four short vowels, and three falling diphthongs. The Vietnamese tone system, which belongs to the pitch-plus-voice quality type, has (i) a six-tone paradigm for sonorant-final syllables: level tone 1 (A1), falling tone 2 (A2), curve tone 3 (C1), broken tone 4 (C2), rising tone 5a (B1), and drop tone 6a (B2); and (ii) a two-tone paradigm for obstruent-final syllables: rising tone 5b (D1), drop tone 6b (D2). The broken tone 4 has medial glottal constriction while the rising tone 6a drops very sharply and are almost immediately cut off by a strong glottal constriction at the end. The two tones 5b and 6b are not glottalized, either in final or non-final position. Based on the literature study, several tasks were performed for building our TTS system as well as for designing a new corpus. First, grapheme-to-phoneme rules were developed for transcribing Vietnamese consonants and vowels/diphthongs. Many graphemes can be directly converted to phones without any ambiguity, such as “b-” to [á], “ch-, tr-” to [tC], “-m” to [m], “ê” to [e]. Well-defined rules were found for more complicated cases/variants. For instance, for the grapheme “a”, if it is followed by “nh” or “ch”, the phoneme is [˘E]; if it is followed by “u” or “y”, the phoneme is [˘a]; otherwise, the phoneme is [a]. The full G2P rules were used for both transcribing the raw text for corpus design and building the G2P conversion module of our TTS system. Second, due to the great importance of lexical tones, a “tonophone” – an allophone in tonal context, was proposed as a new speech unit for our work. In this research, to build the tonophone set of the system, the lexical tone was taken into account and adhered to all allophones in the rhyme, and the initial consonant maintained its form without any information

2.8. Conclusion

of the tone. As a result, a tonophone set with 207 tonophones was constructed from 48 Vietnamese allophones. This unit set includes: (i) 19 initial consonants without tone information, (ii) medial and 16 nucleus adhering to eight tones, (iii) unreleased final stops adhering to two tones 5b, 6b, and (iv) other final consonants adhering to six tones 1-4, 5a, 6a. An acousticphonetic tonophone set of Vietnamese was also built for (i) HMM clustering using phonetic decision trees, and (ii) automatic labeling, i.e. automatic segmenting and forced aligning the speech corpus with the orthographic transcriptions. Based on the literature review, main phonetic attributes were specified for both consonants and vowels on this acoustic-phonetic unit set, such as place or articulation or manner of articulation for consonants, position of tongue or height for vowels. And finally, PRO-SYLDIC, a Vietnamese syllable transcription e-dictionary was constructed for filtering pronounceable syllables in text normalization as well as transcribing texts. Pairs of syllable orthography and transcription in the dictionary were automatically generated mainly based on (i) the G2P rules, (ii) the syllable-orthographic rules, and (iii) the list of rhymes. A table of 170 Vietnamese rhymes with not only existent but also pronounceable ones in the language was designed. The reason to maintain all pronounceable rhymes is that the input of a Vietnamese TTS system may include numerous loanwords that includes nonexistent but pronounceable syllables, as well as newly appeared words from teenagers or Internet users. The PRO-SYLDIC was constructed by combining 19 initial consonants with 772 rhymes bearing tones, making a total of 21,648 pronounceable syllables (orthography).

Chapter 3

Corpus design, recording and pre-processing Contents 3.1 3.2

3.3

3.4

3.5

3.6

3.7

3.8

Introduction . . . . . . . . . . . . . . . . . . . Raw text . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Rich and balanced corpus . . . . . . . . . . . 3.2.2 Raw text from different sources . . . . . . . . Text pre-processing . . . . . . . . . . . . . . . 3.3.1 Main tasks . . . . . . . . . . . . . . . . . . . 3.3.2 Sentence segmentation . . . . . . . . . . . . . 3.3.3 Tokenization into syllables and NSWs . . . . 3.3.4 Text cleaning . . . . . . . . . . . . . . . . . . 3.3.5 Text normalization . . . . . . . . . . . . . . . 3.3.6 Text transcription . . . . . . . . . . . . . . . Phonemic distribution . . . . . . . . . . . . . 3.4.1 Di-tonophone . . . . . . . . . . . . . . . . . . 3.4.2 Theoretical speech unit sets . . . . . . . . . . 3.4.3 Real speech unit sets . . . . . . . . . . . . . . 3.4.4 Distribution of speech units . . . . . . . . . . Corpus design . . . . . . . . . . . . . . . . . . 3.5.1 Design process . . . . . . . . . . . . . . . . . 3.5.2 The constraint of size . . . . . . . . . . . . . 3.5.3 Full coverage of syllables and di-tonophones . 3.5.4 VDTS corpus . . . . . . . . . . . . . . . . . . Corpus recording . . . . . . . . . . . . . . . . . 3.6.1 Recording environment . . . . . . . . . . . . 3.6.2 Quality control . . . . . . . . . . . . . . . . . Corpus preprocessing . . . . . . . . . . . . . . 3.7.1 Normalizing margin pauses . . . . . . . . . . 3.7.2 Automatic labeling . . . . . . . . . . . . . . . 3.7.3 The VDTS speech corpus . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77 78 78 78 79 79 80 80 81 81 82 83 83 83 84 84 86 86 88 89 90 91 91 92 93 93 93 95 95

3.1. Introduction

3.1

Introduction

Over the last two decades, with the rise of corpus-based speech synthesis (e.g. unit selection and HMM-based speech synthesis), speech database has greatly contributed to the quality of synthetic voice. This leads to the necessity of providing a proper speech corpus for the TTS system training and testing. Several works on Vietnamese speech corpus were presented, but mainly for speech recognition (Le et al., 2004, 2005, Vu and Schultz, 2009, 2010, Vu et al., 2005). These works did not focus on designing the text corpus, but on collecting/recording the speech corpus, as well as on selecting speakers or on automatic alignment. The VNSP corpus (short for “VNSpeechCorpus for synthesis”) of the unit selection TTS system of Tran (2007b) was collected from different resources from the Internet (e.g. stories, books, web documents), and manually chosen by experts. It included various types of data: words with six lexical tones, figures and numbers, dialog sentences and short paragraphs. It comprised about 630 sentences in 37 minutes, recorded by a TV broadcaster from Hanoi. The work of Vu et al. (2009) reported that a 3000-sentence training corpus was constructed according to a set of phonetically-rich sentences for spoken Vietnamese. However, to the best of our knowledge, there lacks a thorough work on analysis and design of a text corpus for the Vietnamese TTS, especially for the HMM-based approach. A number of papers have targeted at text corpus design for speech processing in various languages. Many researchers proposed methods to design a phonetically balanced corpus, such as Uraga and Gamboa (2004) for Mexican Spanish, Oh et al. (2011) for Korean (speech coding), or Abushariah et al. (2012) for Arabic. Some used a phonotactic approach to design a text corpus with a full coverage of phonemes and allophones in every possible context (Uraga and Gamboa, 2004), or even used an enormous phonetically rich and balanced source from Web (Villaseñor-Pineda et al., 2004). Due to the special requirement of speech coding, the work of Oh et al. (2011) proposed a method based on a similarity measure, which calculated how close the phoneme distribution occurring from natural conversation was to that of the designed text corpus. Most of the works adopted the greedy search algorithm to select the best candidate sentences. Some others considered the design of speech database for TTS systems as a set-covering problem (Chevelu et al., 2008, Francois and Boëffard, 2001). It seems to us that the greedy algorithm is robust and reliable enough to design a corpus for a general TTS system, with unlimited vocabulary as our initial motivation. Other tonal languages, such as Mandarin Chinese, were also investigated on the corpus design for data-driven TTS systems (Chou et al., 2002, Tao et al., 2008, Zhu et al., 2002). The corpus of Tao et al. (2008) was delivered to Blizzard Challenge 2008 as the common corpus for the Mandarin speech synthesis evaluation among all participants. 5,000 phonetic context balanced sentences were finally chosen by an automatic prompt selection with the greedy search algorithm and several criterions from raw text. Syllables were considered as a speech unit in the design with 12 factors for the prompt selection, including the previous, current and next lexical tones. This chapter describes our proposal on the corpus design for the Vietnamese TTS systems including for the HMM-based approach. The initial motivation was to design a phonologicallyrich and -balanced corpus in both phonemic and tonal contexts. However, to manually build such a corpus consumes a great human effort for the selection task. Section 3.2 describes a huge raw text, which was crawled from different sources from the Web and other sources. Section 3.3 presents the processing of the raw text (e.g. text cleaning, text normalization, text transcription) to be ready for the further tasks.

Chapter 3. Corpus design, recording and pre-processing

As aforesaid, since Vietnamese is a tonal language, speech units in this work were considered not only in phonemic context but also in tonal context. Section 3.4 provides an analysis of two new speech units: (i) a “tonophone”: an allophone adhering with a lexical tone when possible (cf. Section 2.6 in Chapter 2), and (ii) a “di-tonophone”: an adjacent pair of “tonophones”. Our work targets to design a phonetically-rich and -balanced corpus in terms of tonophones and di-tonophones. To provide a reference for the design process, the phonetic distributions of the raw text are described in this section. Section 3.5 shows the design process and results of several corpora using the greedy algorithm and all information from above preparations. The recording environment and quality control of resulting corpora were described in Section 3.6. Some treatments (e.g. automatic labeling, correcting breath noises) on speech corpus for TTS systems are finally given in Section 3.7.

3.2 3.2.1

Raw text Rich and balanced corpus

In a TTS system, a speech corpus plays an important role in generating good acoustic models, hence producing a high-quality synthesizer. In some systems, such as concatenative systems, if there lack some essential acoustic units to synthesize a specific sentence, the quality of the synthetic speech will be degraded. Although HMM-based speech synthesis is more robust if the phonetic balances of the text are not ideal, a poor training corpus may cause a bad or even unintelligible synthetic speech. As a result, the speech corpus for a TTS system, especially with unlimited vocabulary, must be phonetically-rich and -balanced (VillaseñorPineda et al., 2004) (Abushariah et al., 2012). To address the first criterion, a speech corpus can be considered a phonetically rich one if it contains all the phones and has a good coverage of other speech units (e.g. di-phones, triphones) of the language. In other words, it should provide a good coverage of one or several speech units. A full coverage of phones ensures their availability for the synthesis process, which helps a TTS system produce an intelligible synthetic speech. A speech corpus with a good coverage of bigger speech units, e.g. di-phones or triphones, provides more suitable templates in different contexts. This ameliorates the quality of synthetic speech: more intelligible, more natural, etc. A phonetically balanced corpus maintains the phonetic distribution of the language. In other words, if phone a has a higher frequency than phone b in the language, it should appear more often than phone b in the speech corpus. With such a corpus, a more common sentence would be synthesized with better quality than the less common one. As a result, the quality of the TTS system will be improved at least for common cases in spite of a finite amount of corpus.

3.2.2

Raw text from different sources

To manually build a phonetically-rich and -balanced corpus consumes a great human effort for the selection task. In this work, a big raw text was crawled from the Web and other sources. This resource could be considered as a phonetically-rich and balanced source, and might represent for the Vietnamese language in terms of phonetic distribution due to its variety and its large size. A variety of sentence types, sentence modes, sentence lengths, etc. may introduce the corpus a vast range of context, hence improve the quality of TTS systems. Therefore, five major sources were used for building a big raw text: (i) e-newspapers,

3.3. Text pre-processing

(ii) e-stories, (iii) existing sources, (ii) Vietnamese e-dictionary (VCL), and (v) special design. E-newspapers. The two main sources of e-newspapers came from: (i) the project for the blind “Tâm hồn Việt Nam” (Vietnamese souls), and (ii) the training corpus for a VietnameseFrench statistical machine translation system (Do et al., 2009). One of the most important tasks in the “Vietnamese souls” project was to automatically crawl from different enewspapers to a unique source 1 for the Vietnamese blind. The final resource for our work from this project was extracted from different topics of some well-known e-newspapers (such as http://dantri.com.vn, http://vnexpress.net, http://vietnamnet.vn). There were a total of 3,795 articles with 132,514 sentences. The work of Do et al. (2009) proposed a document alignment method for mining a comparable Vietnamese-French corpus. The first result contained about 12,100 parallel document pairs and 50,300 parallel sentence pairs. However, we obtained about 142,305 Vietnamese sentences in the final bilingual corpus. E-stories and existing sources. Seven e-stories collected from web pages provided us about 13,856 sentences in paragraphs, and 20,523 sentences in dialogs. Existing resources included 630 sentences of the VNSP corpus (i.e. VNSpeechCorpus for synthesis - the old one), 10,368 sentences of the VietTreebank, 5,000 sentences from Vietnamese Wikipedia. The VCL dictionary and special design. The VCL dictionary is a Vietnamese e-dictionary for natural language processing of the VLSP project, hosted by the Vietnam Lexicography Centre (Vietlex). The VCL dictionary comprised about 35,000 Vietnamese words with their definitions and examples in the XML-based structure for ease of use. The examples in VCL were sentences (usually short) containing the target words. Some words either had examples with incomplete-sentences (i.e. words or phrases) or were lack of examples. Special design was done for these cases.

3.3 3.3.1

Text pre-processing Main tasks

As mentioned in Section 3.2, we design a synthesis corpus by selecting the richest and most balanced sentences from a raw text, which may represent for the language in terms of phonetic distribution. This resource was huge and was collected from various sources, hence there was a need to pre-process to make it to be suitable for the design process. Figure 3.1 illustrated the procedure of raw text pre-processing, including five main tasks: (i) Sentence segmentation, (ii) Tokenization, (iii) Text cleaning, (iv) Text normalization, and (v) Text transcription. First, texts were segmented into sentences for further treatments. These sentences were tokenized into syllables or Non-Standard Words (NSWs – which cannot be directly transcribed to phonemes, e.g. numbers, dates, abbreviations, currency). Each sentence was then examined to be not “clean” or “too long” for removing. The three first tasks of the text pre-processing were performed during the text collection for ease of storing and management. The next task was text normalization, in which NSWs were then processed and expanded to speakable syllables. Finally, the normalized text was transcribed into tonophones to provide a suitable input for next steps. The details of these tasks will be described in the next subsections. 1. http://tamhonvietnam.net

Chapter 3. Corpus design, recording and pre-processing

Start

Sentence segmentation

Tokenization

Text cleaning

Text normalization

Text transcription

Stop Figure 3.1 – Main tasks in raw text pre-processing.

3.3.2

Sentence segmentation

Text from these resources needed to be segmented into sentences for ease of management. Regular expressions were mainly used for segmenting sentences. Some ambiguous cases were separately treated with particular strategies or heuristics. Some had to be manually corrected. Each sentence was then split into tokens (syllables, abbreviations, etc.) by spaces, punctuations, etc. Each sentence was assigned a unique code, including two parts: (i) four letters indicating the sources: NEWS (e-newspapers), STOR (e-stories), VCLD (VCL Dictionary), VLSP (VNSpeech corpus for synthesis), WIKI (Vietnamese wiki), SPEC (special design); and (ii) six digits indicating its position in each source, starting from 1: 000001, 000002, etc. The sentence number of each source in the raw text presented in Section 3.2 was calculated after this task. The total sentence number of the raw text was 349,095 sentences from five major sources.

3.3.3

Tokenization into syllables and NSWs

As aforementioned, Vietnamese is an isolating language, in which the boundary of syllable and morpheme is the same, each morpheme is a single syllable. Each syllable usually has an independent meaning in isolation, and polysyllables can be analyzed as combinations of monosyllables (Doan, 1977). Hence, a syllable in Vietnamese is not only a phonetic unit but also a grammatical unit (Doan, 1999b). Besides, Vietnamese text is actually a sequence of syllables, separated by spaces. As a result, each sentence had to be tokenized into syllables

3.3. Text pre-processing

and NSWs. Spaces and punctuations were used as the best delimiters for this task.

3.3.4

Text cleaning

Since the raw text was collected from various sources, mainly from the Web, there existed “unclean” or unsuitable sentences that cannot be used in designing corpus. Sentences having more than 70 syllables were considered “very long” sentences and hence were removed from the raw text. Sentences having unreadable (e.g. control) symbols or wrong encoding were also removed. After text cleaning, the raw text bank of 349,095 sentences was reduced to 323,934 “clean” sentences, which means that about 7% were unsuitable and removed. Table 3.1 – The final raw data for Vietnamese corpus design Source E-newspapers E-stories VCL dictionary VNSP VietTreebank Wiki Special design TOTAL

Sentence # in paragraphs

Sentence # in dialogs

Syllable #

Mean length (syllables/sentence)

255,145 13,601 15,600 433 7,723 4,146 78 296,704

0 20,402 3,900 197 1,836 793 117 27,230

9,432,669 450,655 155,274 8,930 216,725 109,373 2,905 10,377,903

37.0 13.3 8.0 14.1 22.7 22.1 14.9 32.0

Table 3.1 shows some information of various sources in the final raw text. E-newspapers occupied the highest proportion (about 91%) of the raw text. The mean of sentence length (number of syllables) was also the largest, about 37 syllables per sentence, and only contained sentences in paragraphs, not in dialogs. Example sentences (for Vietnamese words) in the VCL dictionary were short in general, averaging 8 syllables per sentence. E-stories included more dialogs than paragraphs (about 1.5 times), hence about 13.3 syllables per sentence on average (dialogs includes short sentences in general). Other resources ranged from 14 syllables to 23 syllables per sentences. They contained more paragraphs than dialogs. The minimum sentence length of the raw text was 1 syllable, and the average sentence length was 32 syllables. There are total more than 10 billions syllables in the raw text.

3.3.5

Text normalization

The Vietnamese real text included Non Standard Words (NSW), which cannot be directly transcribed to phonemes, e.g. numbers, dates, abbreviations, currency. The pronunciation of these NSWs cannot be found by applying “letter-to-sound” rules. Such NSWs include numbers; digit sequences (such as telephone numbers, date, time, codes. . . ); abbreviations (e.g. “ThS” for “Thạc sĩ”); words, acronyms and letter sequences in all capitals (e.g. “GDP”); foreign proper names and place names (such as “New York”); roman numerals; URL’s and email addresses. Normalization of such words, called text normalization, is the process of generating normalized orthography from text containing NSWs. The text normalization process adopted the main idea from our previous work (Nguyen et al., 2010), which normalized NSWs to the appropriate form so that it became speakable. However, due to the vast and various sources, only basic processing tasks were done on the raw text.

Chapter 3. Corpus design, recording and pre-processing

These NSWs first were identified by filtering tokens using the PRO-SYLDIC dictionary (cf. Section 2.6 in Chapter 2). Those candidates were then classified into corresponding categories using regular expressions. These NSWs were expanded to full text according to their categories. Numbers, dates, times, currencies, measures, etc were expanded by well-defined rules. For example, a date “13/04/1994” in Vietnamese (the format “dd/mm/yyyy”) was expanded to “ngày mười ba tháng tư năm một nghìn chín trăm chín mươi tư” (day thirteen month fourth year one thousand nine hundreds ninety four) by the following rules: • A date starts with the word “ngày” (day); • The day is expanded to a normal number; if the day is smaller than 11, it is preceded by “mùng”; • The month is expanded to a normal number; except that “4” or “04” is expanded to “tư” (forth); • The year is expanded to a normal number. Abbreviations were expanded by looking up an abbreviation dictionary (435 entries), such as “ĐHBKHN” was expanded to “Đại học Bách Khoa Hà Nội” (Hanoi University of Science and Technology), “CLB” was expanded to “câu lạc bộ” club. We also built a loanword dictionary (2,821 entries), which comprised pairs of loanword and the corresponding Vietnamese words, e.g. “London Luân-đôn” [lw 7 ˘n-1 áon-1], “Ronaldo Rônan-đô” [âo-1 nan-1 âo-1]. The remaining cases, whose full text could not be found by any explicit expansion rules or in any dictionaries, were expanded to a list of words for each letter or character (i.e. a character sequence), e.g. “WTO” was expanded to “vê-kép tê ô” [ve-1 kep-5b te-1 o-1], “NT320” was expanded to “nờ tê ba hai không” [n7-2 te-1 áa-1 haj-1 > xoNm-1].

3.3.6

Text transcription

After the raw text was cleaned and normalized, syllables in each sentence were transcribed using tonophones as a speech unit. As aforesaid in Section 2.7 in Chapter 2, the PROSYLDIC dictionary was constructed to cover all Vietnamese pronounceable syllables with tones. Its entries included both orthographic and transcript form using tonophones. Hence, the PRO-SYLDIC dictionary was used for this text transcription task. This dictionary covered not only meaningful syllables in Vietnamese, such as “hoa” w1 [h a1-1] (flower), “cua” [ku@1-1] (crab), “qua” [kw1 a1-1] (pass away); but also nonexistent but pronounceable syllables such as the loanword “boa” [áw1 a1-1] in ‘tiền boa” from French ‘pourboire’ (tip) or the newly appeared orthography syllable “iêu” instead of “yêu” [i@1w1-1] ( to love). Since the sources of the raw text were mostly pulled from sources on the Web, they were in different representations. For ease of use with a common format, each source was finally stored in a text file with the following structure: (i) each sentence was in one line, (ii) each line had four columns separated by the special symbol “ ˜ ”: sentence code ˜ original text ˜ normalized text ˜ transcribed text.

3.4. Phonemic distribution

3.4

Phonemic distribution

As presented, the raw text could be considered as a phonetically-rich and balanced source, and might represent for the Vietnamese language in terms of phonetic distribution due to its variety and its large size. This section presents a new and important speech unit in our work, a di-tonophone – an adjacent pair of tonophones. “Theoretical” speech units, which constructed from MEA-SYLDIC – a dictionary of meaningful syllables, are also presented, while “real” speech units were extracted from the raw text.

3.4.1

Di-tonophone

In continuous speech, units were produced with and affected by the preceding and succeeding ones. To cover the transition between two phones, di-phones is usually used in speech synthesis. Using di-phones as a base speech unit, the pronunciation of each phone varies based on the surrounding phones. As a result, di-phones are usually analyzed and play an importance role in corpus design. As presented in Section 2.6 in Chapter 2, a new speech unit – tonophone – was proposed as an allophone regarding the lexical tone of the bearing syllable. To build tonophones, all allophones in rhymes were adhered to the lexical tone, while the initial consonant did not have to combine any information of the tone. “Tonophones” were used for emphasizing the role of lexical tones, and reflected their corresponding allophones in tonal contexts. As aforementioned, due to the importance of Vietnamese lexical tones, a new speech unit concerning the lexical tone of the bearing – a “tonophone” was proposed. To build tonophones, all allophones in rhymes were adhered to the lexical tone, while the initial consonant did not have to combine any information of the tone. “Tonophones” were used for emphasizing the role of lexical tones, and reflected their corresponding allophones in tonal contexts. The tonophone set of the Vietnamese language was constructed with 207 units (cf. Section 2.6 in Chapter 2). In the corpus design, we use a “di-tonophone” as a basic speech unit, which can be defined as an adjacent pair of tonophones. It appears that both phonemic and tonal contexts can be “modeled” in the “di-tonophones”. For instance, in the sentence “Trời đẹp quá!” [tC72j2 âE6bp6b kw5a a5a] (What a beautiful day!), consuming that there are two empty phones (#) at the beginning and at the end of the sentence, the following di-tonophones were found: [#-tC], [tC-72], [72-j2], [j2-â], [â-E6b], [E6b-p6b], [p6b-k], [k-w5a ], [w5a -a5a], and [a5a-#].

3.4.2

Theoretical speech unit sets

As presented in Section 2.7 in Chapter 2, the PRO-SYLDIC dictionary included all Vietnamese pronounceable syllables with tones although many of them does not exist in the language. These entries were useful for a number of loanwords or newly appeared syllables. We assumed that we could build the theoretical di-tonophone set for our work based on a dictionary that included meaningful (i.e. existent) Vietnamese syllables. MEA-SYLDIC – a MEAningful SYLlable DICtionary. A preliminary analysis was done on the VCL dictionary (cf. Section 3.2) and the raw text. A total of 7,043 meaningful distinct orthographic syllables (and 5,792 distinct transcriptions) were found in the VCL dictionary. Other sources of the raw text provided more loanwords that did not exist in the language, hence a total of 7,355 meaningful distinct orthographic syllables were constructed for a new dictionary – MEA-SYLDIC. The entries of this dictionary included meaningful

Chapter 3. Corpus design, recording and pre-processing

hence existent Vietnamese syllables in pairs: 7,355 distinct orthographies corresponding to 6,074 distinct transcriptions. This dictionary was used for building speech unit sets (e.g. tonophone set, di-tonophone set) and their distributions, which can be considered a reference for the corpus design. Theoretical unit sets. Theoretical di-phone/di-tonophone set in this work was built based on the MEA-SYLDIC dictionary. Each syllable in the dictionary was combined in pairs with others, e.g. [a1 a2 a3 ]–[b1 b2 ], for generating the theoretical speech units. Before the first phone of the left syllable [a1 a2 a3 ] and after the last phone of the right syllable [b1 b2 ], we considered an empty phone [#] starting or ending an utterance. For the syllable pair [a1 a2 a3 ]–[b1 b2 ], the following di-phones were: [#-a1 ], [a1 -a2 ], [a2 -a3 ], [a3 -b1 ], [b1 -b2 ], [b2 -#]. For instance, for the syllable pair “gần – quên” [G˘ 72n2] – [kw1e1n1] (nearly – forget): • the di-phones were [#-G], [G-˘ 7], [˘ 7-n], [n-k], [k-w], [w-e], [e-n], [n-#] • the di-tonophones were [#-G], [G-˘ 72], [˘ 72-n2], [n2-k], [k-w1], [w1-e1], [e1-n1], [n1-#] Following the above method to build di-phones/di-tonophones from the dictionary, there were 1,139 theoretical di-phones while 18,507 theoretical di-tonophones were extracted.

3.4.3

Real speech unit sets

Since the di-tonophone set was automatically generated from all the combinations of syllable pairs using the MEA-SYLDIC dictionary, many theoretical ones did not exist in the raw text (i.e. the Vietnamese real text). As presented, we assumed that the raw text could represent the language in terms of phonetical richness and balance. Therefore, in this work, the speech unit sets and the phonemic distribution of the raw text were considered as real ones and could be used a reference for the corpus design process. Table 3.2 presents the unit numbers of different sets in theory (using the MEA-SYLDIC dictionary) and the raw text (real Vietnamese texts). Since a number of syllable combinations (or with the empty phone #) did not exist in the language, the number of theoretical ditonophones (by dictionary) was nearly twice the one in the raw text. Other speech units in the raw text had the same numbers as in the dictionary. The total number of transcribed syllables was 6,074 2 .

3.4.4

Distribution of speech units

Based on the above building of speech unit sets, we calculated the distribution of units for further tasks. Table 3.3 lists the top 9 frequent (p1-p5) and rare (r5-r1) phones, tonophones, di-phones and di-tonophones in the raw text. In the raw text, [G] was the rarest phone while > the rarest tonophone was [˘O] with the broken tone 4. The phones [kp ˘O E ˘E] were also the rarest ones. The broken tone 4, the curve tone 3 and the drop tones (6a, 6b) seemed to be rare, especially combining with [u@ ˘O e]. The phone [a] was the most frequent and the [j] was the fifth common in the raw text. Disregarding the lexical tones, the phones [k n t] (and also [m N tC]), which can be both initial and final consonants, were also the most popular ones. As for tonophones, since the lexical tones were adhered only to the rhymes, the initial consonants [k tC â t] and some other initial ones (e.g. [v h z th m l á s n]) were the most common in the raw text. The vowel [a] with the level tone was the fifth frequent in the raw text. The more detail distribution showed that the level tone 1 and the falling tone seemed 2. The number of orthographic syllables was 7,355.

3.4. Phonemic distribution

Table 3.2 – Number of speech units in theory and in the raw text # 1 2 3 4 5 6 7 8 9 10 11

Dictionary (theory)

Factor Number Number Number Number Number Number Number Number Number Number Number

of of of of of of of of of of of

sentences distinct phones distinct tonophones phones distinct initials/rhymes initials/rhymes distinct di-phones di-phones distinct di-tonophones distinct syllables syllables

48 207 674 1,139 18,507 6,074 -

Raw text

(real) 323,934 48 207 28,329,368 674 20,400,713 1,139 28,653,194 10,339 6,074 10,377,903

Table 3.3 – Distribution of top 9 frequent (p1-9) and rare (r9-1) speech units of the raw text #

Phone

p1 p2 p3 p4 p5 p6 p7 p8 p9 r9 r8 r7 r6 r5 r4 r3 r2 r1

a k n t j m o N i f z x p ˘E E ˘O > kp G

Freq. 2,367,636 1,831,035 1,828,931 1,396,160 1,311,072 1,131,790 1,003,581 980,619 950,371 292,890 271,039 259,775 258,420 243,998 218,326 193,806 109,232 76,905

Tonophone

k tC â t a1 h th m a2 74 ă4 75b E6b e6b u@6a ˘O3 u@4 ˘O4

Freq. 1,120,216 978,603 888,552 795,718 714,008 622,791 565,268 551,041 544,490 5,305 4,707 4,566 3,968 2,315 1,946 1,690 612 229

di-phone

a-j > o-Nm i@-n a-n i-J 7 ˘-n a-k w -a a-m 7-˘O E-˘O u@-u@ i@-W i@-O u-˘O W@-˘O u@-˘O i@-˘O

Freq.

Ditonophone

Freq.

406,534 363,465 339,192 318,623 291,065 280,005 267,868 262,602 257,528 3 3 3 2 2 2 2 1 1

> o1-Nm1 ˘a1-m1 v-a2 a5b-k5b a1-j1 k-a5b h-a1 a6a-j6a n-ă1 j5a-w2 j3-W@1 n6a-˘ 72 j6a-a6a t5b-E2 > kp5b-i2 > kp5b-i3 n5a-E6a m2-u@3

210,693 196,935 187,790 167,220 157,150 121,863 121,820 121,506 115,870 1 1 1 1 1 1 1 1 1

> m ˘a to be the most popular ones, especially combined with popular phones (e.g. [a n j Nm N o i]). The final consonants [p t k] were also common (with the rising tone 5b or the drop tone 6b). The most frequent di-tonophones in the raw text were popular rhymes, such as “ông” > [o1-Nm1], “am” [˘ a1-m1], “ác” [a5b-k5b], “ai” [a1-j1]. The di-tonophone “và” [v-a2] was the > were third common. Despite regardlessness of the lexical tones, the di-phones [a-j] and [o-Nm] still the most frequent. Some combinations of two nucleus, vowels or diphthongs (especially the rare ones), were rare such as [i@-˘O], [u@-˘O], [W@-˘O]. The unusual combinations of lexical

Chapter 3. Corpus design, recording and pre-processing

> tones provided rare di-tonophones, such as [m2-u@3], [n5a-E6a], [kp5b-i3], etc. There were a number of di-tonophones with small frequencies. Table 3.4 shows the numbers of di-phones and di-tonophones having the frequency from one to six. Nearly 1,200 ditonophones appeared only once and 615 twice in the raw text. The numbers of di-tonophones with the frequency of three to six ranged from 381 down to 148. Only 2 to 4 di-phones had small frequencies. Table 3.4 – Number of di-phones/di-tonophones having small frequencies Frequency Once Twice Three times Four times Five times Six times

3.5

di-phone # 2 4 6 2 2 3

Di-tonophone # 1,199 615 381 257 216 148

Corpus design

Due to the simplicity and effectiveness, the greedy algorithm was adopted to search the best candidate among the subset of the raw data. This section describes the whole corpus design as a number of iterations of selection process, whose output was the best candidate in terms of phonetic-richness and -balance at the current state of the uncovered units and their distributions. The selection process stopped when an expected constraint reached. Three corpora were then constructed by our proposed design process with different speech units and targets: (i) SAME: a new corpus with the same size as the old corpus VNSP in terms of syllable number and using di-tonophones as speech units, (ii) VSYL: a new corpus with 100% syllable coverage (i.e. complete syllable coverage), and (iii) VDTS: a new corpus, which had 100% di-tonophone coverage (i.e. complete di-tonophone coverage). The purpose of first corpus was to examine the performance of the algorithm by comparing the distribution of different speech units of the old corpus and the new corpus with the same size. The second one was designed for the non-uniformed unit selection in which syllables can be used as the best speech unit in terms of both quality and system/corpus size. The last one was designed for our TTS system, in which tonophones were used as speech units. We believed that with the design of 100% di-tonophones, the transitions between any two tonophones were completely covered, hence the quality would be much more improved.

3.5.1

Design process

The speech unit sets and their distributions of the raw text (cf. Section 3.4) were used in the selection process as well as weighting candidate sentences. Figure 3.2 illustrates the process of corpus design, which includes a number of selection iterations to build a target corpus with an expected constraint. The input data of the selection process were: (i) e: the expected coverage (e.g. 100% for full coverage) or condition (e.g. max size of the target corpus), (ii) R: the raw text including transcribed sentences, and (iii) U : the uncovered speech unit set with frequency. The initial value of U was the whole speech unit set with frequency of the raw text as mentioned in

3.5. Corpus design

Start Preselect candidate sentences containing the rarest/mostfrequent uncovered unit

Yes

Max weight ← -1

Get a candidate sentence

Yes

Calculate the sentence weight

Sentence weight > Max weight

More candidate sentences?

Yes Max weight ← Sentence weight; Chosen sentence ← Current sentence

More uncovered units and not reach a given coverage or condition?

Update uncovered units (Remove all uncovered units appearing in the chosen sentence)

No Stop Figure 3.2 – Corpus design: repetitions of selection processes.

Move the chosen sentence to the Corpus

Chapter 3. Corpus design, recording and pre-processing

the previous section. The output of each selection process was a chosen sentence, which was considered as the best candidate in terms of phonetic-richness and -balance. The criteria for the whole design process was to have: (i) the highest coverage possible of a given speech unit (e.g. di-phone) with (ii) the smallest corpus possible (i.e. a smallest number possible of chosen sentences). The output of the design was a set of chosen sentences T –the target corpus, with its distribution. First, from the raw text R, sentences that included the rarest speech unit of the uncovered unit set U were chosen as a set of n candidate sentences C = S1 , S2 , . . . , Sn . We assumed that in the phonetically-rich and -balanced raw text, sentences containing rare speech units might include more common ones, but the possibility of vice versa was much smaller. However, if the coverage of the target unit is not 100% (e.g. 70%), the most frequent unit may be chosen to optimize the corpus size. The corpus design may need to be performed twice to find out the best solution for choosing the rarest or most frequent uncovered speech unit. The weight of each candidate sentence Si in C was then calculated to choose the most phonetically rich sentence among n sentences C = S1 , S2 , . . . , Sn . At the time of selection, the richness of one sentence could be represented by its number of uncovered distinct units. However, in general, the longer sentences had more speech units, hence more distinct ones. Therefore, the weight of one sentence was normalized by its number of all distinct units, as illustrated in Equation 3.1. The sentence with the maximum weight Sc was considered as the richest one and moved to the target corpus T . W eight(Si ) =

Nui Nai

(3.1)

where • Si : The sentence i in n candidate sentences C • Nui : Number of uncovered distinct units appearing in the sentence Si • Nai : Number of all distinct units in the sentence Si After having chosen the best candidate sentence Scj , the uncovered speech unit set with frequency U was updated. All distinct speech units in Scj were removed from U . If this uncovered unit set had more elements and the given coverage/condition was not reached, the selection process was repeated to build a new set of candidate sentences C and so on. This process stopped when U was empty or the target corpus T including m sentences Sc1 , Sc2 , . . . , Scm satisfied the given condition or coverage c. If the U was empty at the end of the selection, T had a full coverage (100%), meaning that it covered the whole speech unit set of the raw text.

3.5.2

The constraint of size

To examine the performance of the proposed design process, we carried out a design that considered di-tonophones as speech units with a constraint of the target corpus size. The output was a new text corpus with a similar size (i.e 24,164 number of phones) to the old one – VNSP (630 sentences, 8,930 syllables). Since the bounded size of the target corpus was small and the number of di-tonophones appearing once in the raw data was considerable (i.e. 1,199 times), we assumed that in the selection process, sentences containing the most frequent speech unit should be chosen as candidates for weight calculation. If the rarest speech unit were considered first, sentences including those single-occurrence di-tonophones would

3.5. Corpus design

Table 3.5 – New corpora designed with the same size as the old one VNSP. SAME: candidate sentences containing the most frequent uncovered unit, SAME-B: candidate sentences containing the rarest one #

Factor

VNSP (old)

SAME (new)

SAME-B (new)

1 2 3 4 5 6 7 8 9 10 11 12

Number of sentences Mean length (syllables/sentence) Coverage of phones Coverage of tonophones Number of phones Coverage of initials/rhymes Number of initials/rhymes Coverage of di-phones Coverage of di-tonophones Number of di-phones Coverage of syllables Number of syllables

630 17.1 100.0% 95.1% 24,164 65.3% 17,504 74.5% 29.6% 24,686 24.8% 8,930

983 9.6 100.0% 100.0% 24,117 84.0% 17,778 91.7% 52.4% 25,100 44.6% 9,478

334 27.1 100.0% 100.0% 24,021 73.4% 17,433 88.5% 37.2% 24,355 33.0% 9,048

have been the unique candidates and hence have been chosen. This would have reduced the coverage of the target corpus. In fact, in order to confirm our assumption, we did the corpus design twice with two different preselection conditions of candidate sentences: (i) the rarest speech unit, and (ii) the most frequent one. The Table 3.5 provides the total numbers and coverages of different speech units of the two new corpora “SAME”, “SAME-B” and the old one “VNSP”. The selection process was iterated while the syllable number of the target corpus (“SAME” or “SAME-B”) did not exceed the size of “VNSP” in terms of phones (i.e. 24,164 phones). Since one sentence was chosen in each iteration, the phone numbers of the two new corpora were slightly smaller than that of the old one (0.2-0.6%), while the syllable numbers were a bit larger (1.3-6.1%). The coverage of the new corpus “SAME” was much higher than the old one. There was no or small difference of the phone or tonophone coverages, yet wide gaps (about 17-22%) of the other unit coverages between these two corpora. The di-tonophone coverage of the new corpus reaches 52.4%, while that of the old one was only 29.6%. The coverage of the “SAME-B” corpus was rather higher than the old one, only about 7-14%.

3.5.3

Full coverage of syllables and di-tonophones

The corpus of high-quality TTS systems should have a good coverage of speech units. With unlimited vocabulary, some TTS systems even require a corpus with a full coverage (100%) of the target speech unit. For instance, in a non-uniform unit selection TTS system for Vietnamese – HoaSung (Do et al., 2011), due to a small corpus (VNSP – 630 sentences), the half-syllable corpus of Tran (2007b) was used when there were lack of syllables or above syllables when searching units. However, the appearance of half-syllable units sometimes degraded the quality of the synthetic speech at discontinuous points. Moreover, even with the half-syllable corpus, there was still lack of many instances of that unit leading a failure of the synthesis process or an non-intelligible speech. As a result, there was a need to design a corpus having a complete

Chapter 3. Corpus design, recording and pre-processing

syllable coverage (i.e. 100% syllable coverage) to ensure the stability and the quality of the synthetic voice. For VTED, a Vietnamese HMM-based TTS system (Nguyen et al., 2013b, 2014a,b), tonophones were used as a speech unit for training and synthesis. This system needed a corpus with a good coverage in both phonemic and tonal contexts. To completely record all the transitions between any two tonophones, it was necessary to design a corpus with 100% di-tonophone coverage. Based on the requirement of the two above system, the proposed design process was performed for two corpora for different speech units: (i) VSYL (Vietnamese SYLlable speech) corpus with 100% syllable coverage, and (ii) VDTS (Vietnamese DiTonophone Speech) corpus with 100% di-tonophone coverage. As presented in Section 3.4, the number of di-tonophones appearing once in the raw data was considerable (1,199 times). It means that in order to cover the complete di-tonophone set, all sentences containing these once-occurence di-tonophones must be included in the target corpus. Hence, the rarest uncovered speech unit was considered in the preselection of candidate sentences 3 . A similar process was run for the VSYL corpus.

3.5.4

VDTS corpus

As presented above, the VDTS corpus was designed with a full coverage of di-tonophones. Obviously, the VDTS corpus had 100% coverage of phones, tonophones, and di-phones. Its coverages of initial/rhymes and syllables were 95.1% and 70.2% respectively. VDTS was the target corpus that we used for our TTS system as a new training corpus. We expected that using a corpus with a complete di-tonophone coverage, the quality of the synthetic speech would be much improved since the transitions between any two tonophones were completely recorded. Table 3.6 – VSYL – the corpus with a complete syllable coverage, and VDTS – the corpus with a complete di-tonophone coverage #

Factor

1 2 3 4 5 6 7 8 9 10 11 12

VSYL corpus

VDTS corpus

(100% syllable) 2,297 14.4 100.0% 100.0% 90,219 100.0% 59,978 92.3% 57.0% 92,516 100.0% 33,033

(100% di-tonophone) 3,947 21.5 100.0% 100.0% 223,806 95.1% 161,897 100.0% 100.0% 227,753 70.2% 84,769

Table 3.6 shows the results of these two corpora that had full coverages of syllables/ditonophones. The VSYL corpus has 100% coverage of phones/tonophones, syllables and ini3. We run the design process twice, and the result was: the corpus designed with the most frequent uncovered unit for the preselection of candidate sentences had bigger size than that designed with the rarest one

3.6. Corpus recording

tial/rhymes. However, its di-tonophone coverage was only about 57.0%. The VSYL corpus had nearly 2,300 sentences while there were nearly 4,000 sentences in the VDTS corpus (100% di-tonophone coverage). The numbers of phones of VDTS was about three times that of the VSYL one. The VDTS corpus had a good syllable coverage (70.2%) and initials/rhymes coverage (95.1%).

3.6

Corpus recording

In this work, beside more than 4,700 sentences including the VDTS corpus (the new training corpus for our TTS system) and some sentences for special purposes or the evaluation phase; the text content of the VNSP corpus (the old one from the previous studies) was also recorded for comparison. A total of 5,338 sentences were recorded at LIMSI, France by a female nonprofessional native speaker from Hanoi, aged 31 (named Nguyen Thi Thu Trang; Thu-Trang for short). The speaker had left Hanoi 2 months at the time of the recording. Although she was not a professional speaker, she had a natural and quite pleasant reading style with a suitable prosodic representation. She was a lecturer hence she was able to maintain the voice quality during the recording session.

3.6.1

Recording environment

The recordings took place in a studio at the LIMSI-CNRS Laboratory, Orsay, France. The recording studio included a soundproof vocal booth and a control station. In the soundproof booth, there were the following equipments: (i) a condenser microphone with an omnidirectional polar pattern, (ii) a Glottal Enterprises EG2 glottograph, (iii) an iPad allowing the speaker to access text content of sentences, and (iv) a loudspeaker. The speaker position was controlled in the beginning of each session by measuring a fixed distance (about 30 cm) from the mouth of the speaker to the microphone. A round anti-pop filter was located in front of the microphone.

Figure 3.3 – Soundproof vocal booth. The iPad screen was put in a suitable and straight position for the speaker. The anti-pop filter was in front of the microphone. The control station, operated by a recording supervisor, included: (i) a computer (iMac

Chapter 3. Corpus design, recording and pre-processing

21.5-inch, Mid 2011), (ii) a high-quality sound card (RME Fireface 400), (iii) and a headphone. The audio and EGG 4 signals were recorded directly into the computer using the software Pro Tools 10 5 through the sound card at a sampling rate of 48,000 Hz and 24-bit quantization. They were eventually converted to 48,000 Hz, 16-bit PCM files. Due to private reasons, only audio files were used in this work.

3.6.2

Quality control

The recordings were done in one-hour sessions with a 5 minutes interval every half hour so that the speaker throat could have an enough relax to ensure the recorded speech quality. The speaker recorded from two or four (rarely) sessions per day. Each recording session produced, on average, 200 utterances, hence about 17 minutes of recorded speech. About 7.7 speech hours (462 minutes) corresponding to 5,338 utterances were recorded; hence 27 recordings sessions were conducted. The speech quality was controlled during and after the recording sessions. To facilitate the quality control and the further analysis, the sentence-by-sentence recordings were conducted. To provide references of voice level and quality, in the beginning of each session the speaker listened to some recordings of the previous sessions. The audio feedback and the supervisor instructions were routed to the speaker’s loudspeaker and supervisor’s headphones. They could also communicate to each other by gestures through a glass window between the booth and the control station. Supervision during the recording sessions. The supervisor was a Vietnamese native speaker. He was trained to use the recording software as well as other constraints for a goodrecorded speech. The supervisor operated the recording software to monitor the sound level, to start and stop the recorder and to erase the error utterances. It was also the responsibility of the supervisor to verify if the speaker was producing the right sound level, either by moving away from the predefined position or by starting to become tired. The supervisor also had to check if the speaker read all the words in the sentence with the proper pronunciation using an adequate rhythm and intonation. When there was any problem, he stopped the recording and explained the issues to the speaker. They may listen to that sound again to clarify the problems and confirm the right way to read that sentence. The supervisor canceled the previous one to override with a new utterance. One of the most challenging sessions involved reading loan words or rare words, and long sentences. Verification after the recording sessions. To reduce errors in the speech corpus as much as possible, the audio files were periodically checked after several sessions. The speech rate, voice level and quality were first compared between the unchecked sessions and checked sessions. This could detect if there were any change or errors in the recording conditions for different sessions. The errors might be caused by the supervisor when he forgot canceling the error utterances, started too late or stopped too early. This might lead to the audio files having too short margin pauses or the speech being improperly cut. The utterances with errors were discarded and the sentences were re-scheduled for future recording sessions. 4. ElectroGlottoGraph, also called laryngograph 5. The industry-standard audio production platform: http://www.avid.com/US/products/family/pro-tools

3.7. Corpus preprocessing

3.7

Corpus preprocessing

Corpus preprocessing had to be done to build a “clean” and annotated speech corpus for TTS systems. This section presents the three major post-recording tasks: (i) normalizing the beginning and ending pauses, (ii) labeling the continuous speech according to the phonetic transcription, and (iii) processing the wrong labeling of breath noises.

3.7.1

Normalizing margin pauses

Each recording file corresponding to one sentence was named increasingly by an incremental value of “1”. The first step was to rename these recording files to new names corresponding to sentence codes (cf. Section 3.3). Each file was trimmed to margin of at-most-200ms pauses in the beginning and at the end. These tasks were automatically done using a Praat 6 script. Since the recordings and quality control were made by humans, there were still some error audio files. For instance, the verbal content of some utterances and the corresponding text content mismatched, such as lacking or redundant syllables, or even wrong syllables. The margin’s pauses of some audio files were too short (e.g. 0 times as much importance to Recall as precision – F0.5 : weight precision twice as much as Recall – F2 : weight precision twice as much as Recall – F1 or F score: the same weight for precision and Recall, for short F

4.2.3

Syntactic parsing evaluation

In syntax parsing, the evaluation technique that is currently the most widely-used was proposed by the Grammar Evaluation Interest Group (Grishman et al., 1992), and is often known as “PARSEVAL”. It is basically a relaxation of full identity as the success criterion to one which measures similarity of an analysis to a test corpus analysis. The original version of the scheme utilised only phrase-structure bracketing information from the annotated corpus and compares bracketings produced by the parser with bracketings in the annotated corpus. Due to the ease of comparison, the evaluation method from the work of Collins (1999) and Bikel (2004), is adopted in this work, which measures how much the elements (constituents or dependents) in the hypothesis parse tree look like the constituents in a hand-labeled gold reference parse. In other word, the method compares elements produced by the parser

107

4.3. Vietnamese syntactic parsing

with elements in the annotated corpus (TreeBank) and computes the number of matched element M E with respect to the number of elements P E returned by the parser (expressed as P recision, Formula 4.7) and with respect to the number CE in the corpus (expressed as Recall, Formula 4.8) per sentence. ME P recision = (4.7) PE Recall =

ME CE

(4.8)

where • M E: Number of Matched Elements • P E: Number of Elements returned by the Parser • CE: Number of Elements in the Corpus

4.2.4

Pause prediction evaluation

In pause prediction, Precision (P ) was the probability that a (randomly selected) predicted pause was an actual (correct) pause in corpus., i.e. the fraction of the number of correct predicted pauses to the total number of actual pauses in corpus (Formula 4.9). Recall (R), the probability that an (randomly selected) actual pause in corpus is predicted, was calculated as the number of correct predicted pauses over the total number of actual pauses in corpus (Formula 4.10) (Taylor, 2009). CP P recision = (4.9) PP Recall =

CP AP

(4.10)

where • PP: Number of Predicted Pauses • CP: Number of Correct predicted Pauses • AP: Number of Actual Pauses in the middle of utterances in corpus

4.3

Vietnamese syntactic parsing

This section recapitulates the syntax theory, Vietnamese grammatical categories and syntactic structure, and the current state of Vietnamese syntax parsing. The adoption of the state-ofthe-art technique for Vietnamese syntactic parsing is described with a parser – VTParser for the TTS system. A detail of these is presented in Appendix A.

4.3.1

Syntax theory

Grammar, composing syntax and morphology, helps us analyze and describe the word and sentence patterns of a language by formulating a set of rules with respect to those patterns. Morphology is the study of the form and structure of a given language’s morphemes and

108

Chapter 4. Prosodic phrasing modeling

other linguistic units. Whereas, studying syntax provides us how to construct sentences, and a number of possible arrangements of the elements in sentences (Kroeger, 2005). Grammatical categories, a natural first step toward allowing grammars to capture word generalizations, covers not only the Part Of Speech (POS), e.g. noun, verb, preposition but also types of phrase, e.g. noun phrase, verb phrase, prepositional phrase. Parts of speech are termed as lexical categories or word classes whereas non-lexical categories or phrasal categories means types of phrase. Two major aspects of sentence syntactic structure are phrase structure grammar and dependency grammar. The first aspect concerns the organization of the units that constitute sentences, hence also referred as constituency structure grammar, e.g. Sentence → Prepositional phrase + Noun phrase + Verb phrase. The second one, dependency grammar, concerns the function of elements (i.e. dependency relations) in a sentence such as subject, predicate or object, which have traditionally been referred to as grammatical relations or relational structure. Grammatical categories. To classify words into “grammatical categories” is a natural first step toward allowing grammars to capture generalizations. The term “grammatical category” now covers not only the Parts Of Speech (POS), e.g. nouns, verbs, prepositions but also types of phrase, e.g. noun phrases, verb phrases, prepositional phrases. Parts of speech are termed as “lexical categories” in contemporary linguistics or traditionally referred as “word classes”, whereas “non-lexical categories” or “phrasal categories” means types of phrase (Valin, 2001)(Kroeger, 2005). The most important lexical categories are nouns, verbs (V), adjectives (A), adverb (R) and prepositions (E). Nouns can be categorized in numerous ways, e.g. proper nouns (Np, i.e. proper name), common nouns (N, i.e. not refer to unique individuals or entities). Pronouns (P) are closely related to nouns, and “traditionally characterized as substitutes for nouns or as standing for nouns”. Phrase structure grammar. A sentence does not consist simply of a string of words; and not the case that “each word is equally related to the words adjacent to it in the string” (Valin, 2001). Words in a sentence may be grouped into grammatical units of various sizes. One crucial unit is the clause, “the smallest grammatical unit which can express a complete proposition”. A sentence may consist of just one clause or several clauses. A single clause may contain several phrases, another important unit. A single phrase may contain several words, which may contain several morphemes. “Each well-formed grammatical unit (e.g. a sentence) is made up of constituents which are themselves well-formed grammatical units", such as clauses, phrases, etc. There are only a limited number of basic types of units, which is adequate for a large number of languages: sentence, clause, phrase, word, and morpheme. This kind of structural organization is called a part–whole hierarchy: each unit is entirely composed of smaller units (Kroeger, 2005, p. 32-33). There are two basic ways in which one clause can be embedded within another: coordination vs. subordination. In a coordinate structure, two constituents belonging to the same category are conjoined to form another one of that category. In a coordinate sentence, two (or more) main clauses (or independent clauses – S) occur as daughters and co-heads of a higher clause. A dependent clause (or subordinate clause – SBAR, i.e. complement clauses, adjunct or adverbial clauses and relative clauses) is one that functions as a dependent, rather than a co-head. This combination of words cannot stand-alone or form a complete sentence, but provides additional information to finish the thought (Kroeger, 2005). The term “phrase” in linguistics has a more precise meaning other than “any group of words”. That is a group of words that function as a constituent (i.e. a unit for purposes of

4.3. Vietnamese syntactic parsing

109

word order) within a simple clause. Phrases may be classified into different categories, such as noun phrases (NP), verb phrases (VP). There exists one word in most phrases being the most important element of the phrase, called the head (H) of the phrase. The category of phrase heads in general gives name to the phrase. The following example illustrates this hierarchical structure in Vietnamese: [S [N P [N Cô giáo (The teacher)] [N p tiếng Anh (English)] [SBAR mà (who) [N P [N anh (you)]] [V P đã [V gặp (met)]] [N P [N hôm qua (yesterday)]] SBAR ] N P ] [V P đang [V đọc] (is reading) [N P [N sách (books)]] [P P [P trong (in)] [N P [N thư viện (the library)]N P ]P P ] V P ] S ]. Dependency structure grammar. Another important aspect of sentence structure need to be considered, namely “grammatical relations”. Those are the syntactic function of elements such as subjects or objects in a sentence. Therefore, this type of syntax is referred to as “relational structure”. This is also termed as “dependency structure” since it actually encompasses the dependency relation. Aside from the predicate itself, the elements of a simple clause, i.e. clausal dependents, can be classified as either adjuncts or arguments, illustrated in Figure A.2. Adjuncts (ADT) are elements that are “not closely related to the meaning of the predicate but which are important to help the hearer understand the flow of the story, the time or place of an event, the way in which an action was done, etc”. Adjuncts can be omitted without creating any sense of incompleteness. Arguments are those elements that are “selected by the verb”; they are “required or permitted by certain predicates, but not by others”. In order to be expressed grammatically, arguments must be assigned a grammatical relation within the clause. There are two basic classes of grammatical relations: obliques (or indirect arguments) vs. terms (or direct arguments). Terms (i.e. subject–SUB, primary object–OBJ, secondary object–OBJ2) “play an active role in a wide variety of syntactic constructions”, while obliques (OBL) are “relatively inert” (Kroeger, 2005, p. 62).

Figure 4.2 – Classification of clausal elements (Kroeger, 2005, p. 62).

Some clausal dependents are illustrated in the following example for Vietnamese: [S [ADT Tối qua (last night)] [SU B Kiên (Kien)] [P RD đã tặng (gave) [OBJ một bó hoa hồng (a bouquet of roses)] [OBL cho mẹ của anh ấy (to his mother)] P RD ] S ].

110

4.3.2

Chapter 4. Prosodic phrasing modeling

Vietnamese syntax

In order to address problems of syntactic parsing (i.e. syntactic analysis, cf. Section A.2, Appendix A), a common way is to construct a treebank. A treebank is simply a collection of sentences (normally a large sample of sentences, also called a corpus of text), where each sentence is provided by a complete syntactic analysis. Treebank solves the knowledge acquisition problem (i.e. designing out a grammar to cover all syntactic analysis of natural language) by finding the grammar underlying the syntax analysis. Obviously, there is no set of syntactic rules or linguistic grammar, as well as there is no list of syntactic constructions provided explicitly in a treebank. In fact, the parser can infer a set of implicit grammar rules to cover a large amount of syntactic analysis that does not exist in treebank. Concerning the problem of explosion of rule combinations, since each sentence in a treebank has been given its most plausible syntactic analysis, some supervised learning methods can be used to train a scoring function over all possible syntactic analyses of that sentence. For a given sentence that is not seen in the training data, a statistical parser can use this scoring function to return the syntax analysis that has the highest score, which is taken to be the most plausible analysis for that sentence. The syntactic parsing for each sentence should have been annotated by human expert to guarantee the most plausible analysis for that sentence. Before the annotation process, an annotation guideline is typically written in order to ensure a consistent scheme of annotation throughout the treebank. This section presents VietTreebank, a Vietnamese TreeBank (Nguyen et al., 2009), and the Vietnamese syntax that the VietTreebank used and followed for the annotation. Vietnamese TreeBank. Vietnamese treebank (VietTreebank) (Nguyen et al., 2009) was constructed as a result of a national project in Vietnam, VLSP (Vietnamese Language and Speech Processing) 2 . The construction of this corpus included five major phases: (i) investigation, (ii) guideline preparation, (iii) tool building, (iv) raw text collection, and (v) annotation. Raw texts were collected from the Youth online daily newspaper, with a number of topics including social and politics. To the best of our knowledge, despite various existing issues, up till now, VietTreebank has been the only corpus used in natural language processing for Vietnamese. Table 4.2 – VietTreebank corpus (Nguyen et al., 2009, p. 14) Data set POS tagged Syntactically labeled

Sentences # 10,368 9,633

Words # 210,393 208,406

Syllables # 255,237 251,696

POS tag set. Since Vietnamese word order is quite fixed, a phrase structure representation was chosen for syntactic structures in VietTreebank. There were three annotation levels including word segmentation, POS tagging, and syntactic labeling. The word segmentation identified word boundary in sentences. The POS tagging assigned correct POS tags to words. The syntactic labeling recognized both phrase-structure tags and functional tags. Table 4.2 shows the sizes of the two data sets in this corpus: (i) The data set tagged with POSs: 10,368 sentences, and (ii) The data set annotated with syntactic labels: 9,633 sentences. In this work, we adopted the Vietnamese POS tag set from the work of Le et al. (2010), 2. http://vlsp.vietlp.org:8080/

111

4.3. Vietnamese syntactic parsing

illustrated in Table 4.3. This complete tag set was designed to use for annotating the Vietnamese treebank (Nguyen et al., 2009). Table 4.3 – Vietnamese POS tag set (Le et al., 2010, p. 14) No. 1. 2. 3. 4. 5. 6. 7. 8. 9.

Category Np Nc N P Nu V A R L

Description Proper noun Classifier Common noun Pronoun Unit noun Verb Adjective Adverb Determiner

No. 10. 11. 12. 13. 14. 15. 16. 17. 18.

Category M E C CC I T Y Z X

Description Numeral Preposition Subordinating conjunction Coordinating conjunction Interjection Auxiliary, modal words Abbreviation Bound morpheme Unknown

Major lexical categories in Vietnamese are noun (including common noun N, classifier Nc, proper noun Np, unit noun Nu, pronoun P), verb (V), adjective (A), adverb (R) and preposition (E). Minor ones are conjunction (subordinating C, coordinating CC), determiner (L), numeral (M), interjection (I), auxiliary/modal words (T), and bound morpheme (Z). Proper nouns (Np) can be Vietnamese proper names, e.g. “Hà Nội”, “Nguyễn Khuyến”, or loanwords, e.g. “Luân-Đôn” (London), “Ê-li-da-bét” (Elizabeth). Examples for common nouns (N), i.e. not refer to unique individuals or entities, are “bàn” (table), “mèo” (cat), “ghế” (chair), etc. Beyond the classical POS used in Western languages (noun, verb,. . . ), there does exist the presence of classifiers, which are commonly found in Asian languages. Classifiers are independent words considered as nouns, which “occupy a special position in the noun phrase, but do not seem to contribute to the meaning of the noun phrase in any definite way” (Kroeger, 2005). The classifier may possibly categorize referents (normally nouns) based on their attribute such as shape, function, or animacy. Unlike European languages, in general, Vietnamese common nouns are required to be accompanied by a classifier, and vice versa since the meaning of a Vietnamese classifier cannot be specified in isolation. Vietnamese is one of several Asian languages with a complex numeral classifier system. In English, most nouns need to be chosen between a singular and a plural (e.g. table vs. tables) whereas Vietnamese nouns “do not in themselves contain any notion of number or amount. In this respect they are all somewhat like English mass nouns such as milk, water, flour, etc.” (Thompson, 1987, p. 193). Vietnamese classifiers can be used in “anaphoric construction where classifiers are considered as a pronoun to replace the omitted head noun” (means “one”), such as “cái lớn” (a big one). Two most commonly used classifiers in Vietnamese language rare “con” (for animate, nonhuman objects) and “cái” (for inanimate objects) (Dao, 2011). Major Vietnamese classifiers are presented in Appendix A. Nouns can also be accompanied by a determiner (L), such as “mấy cái chìa khoá” (some keys), “nhiều cửa sổ” (many windows), “những ngôi nhà” (houses), “chút tiền” (a little money); or a numeral (M) such as “ba chiếc kẹo” (three candies). Pronouns in Vietnamese may substitutes for nouns, such as “đó”, “đấy”, “ấy”, “kia” (that), “đây”, “này” (this). First- and second-person pronouns are much more complicated than other European languages, since they depends on the relationship, gender or ages of speakers and listeners. For instance, the pronoun pair of “I-you” in English can be “tôi-bạn” between two persons with a general relationship (i.e. ignoring gender, ages. . . ), “mẹ-con” between mothers and childs, “ông-cháu”

112

Chapter 4. Prosodic phrasing modeling

between grandfathers and grandchilds, “mày-tao” between two persons with a close or negative relationship. The genders and ages also plays an important role to decide pronouns for the second-persons, such as “chị” (sister) for older females, “cô” aunt for much older females and “bà” grandmother for much much older females; while “anh” (brother), “chú” (uncle) and “ông" (grandfather) for males respectively. The last sub-type of nouns is unit noun, which shows a unit or a measure, such as “phút” (minute), “mét” (meter), “km/h”, ect. ‘Bound morphemes” (Z) designate syllables that are not supposed to appear alone and should only be seen as part of a compound word, and this tag is normally only ever used to deal with cases when the segmentation of the corpus has been done improperly. Some examples of adjectives in Vietnamese include “to” (big), “dài” (long) or “mỏng” (thin) for sizes; “tròn” (circle) or “vuông” (square) for shapes; “đắng” (bitter), “tươi” (fresh) or “cay” (spicy) for tastes; “xấu” (ugly), “mềm” (soft) or “chính xác” (correct) for qualities. Some common Vietnamese adverbs are “vẫn” (still), “chưa”, “không” (not), “quá” (too), “rất” (very), “thật” (really). Syntactic structure. Two types of syntactic structure, i.e. constituency and dependency structure grammar, were annotated in VietTreebank. However, the constituency representation, i.e. phrase structure, was chosen as the main structure using brackets since Vietnamese has a quite fix word order. Dependency relations were annotated by functional labels for corresponding constituents. Independent clauses, i.e. main clauses, were labeled as “S” whereas “SBAR” was annotated for dependent clauses, i.e. subordinate clause (including complement clauses, adjunct or adverbial clauses and relative clauses). A phrase includes one or more heads (phrase head–H, a functional label generally giving name to the phrase), preceding and succeeding supplement elements. For instance, the common noun “người” (person), which determines the phrase name (noun phrase - NP), is the phrase head of “một người cao lớn” (a tall and big person). The preceding supplement element, “một” (a), is a numeral (M) while the succeeding one, “cao lớn” (tall and big), is an adjective (A). Phrases whose head words are common nouns, classifiers, proper nouns, unit nouns, or pronouns are noun phrases. Some other main phrasal categories are PP (prepositional phrase), VP (verb phrase), AP (adjective phrase) and RP (adverb phrase). In addition, QP was also adopted for numeral phrases; UCP refers a phrase including two or more head elements in different categories, connected by a coordinating conjunction (CC). Other phrases were labeled as XP, such as expressions or other unclassified phrases. Some examples of Vietnamese phrases are shown below. Main functional labels. i.e. dependency relations, in VietTreebank are “SUB” for subjects, “PRD” for full predicates, “H” for phrase heads, “DOB” for direct objects, “IOB” for obliques. Adjuncts are annotated by a list of labels, which shows their semantic functions that is “TMP” for time (temporal adjunct), “LOC” for location (locative adjunct), “MNR” for manner (modificative adjunct), “CND” for condition (conditional adjunct), “PRP” for purpose (causal adjunct), etc. Other semantic functions of adjuncts are annotated as “ADV”. Phrase structure in VietTreebank is represented by brackets, which are straightforwardly converted to hierarchical trees. Functional labels are labeled as properties of constituent elements’ nodes. Figure 4.3 illustrates an example of the sentence “Men theo con đường mòn, chúng tôi đến một khu đất trước dãy núi Sen” (Skirting a rut, we went to a piece of land before a row of the mountain Sen) using (a) brackets (b) a hierarchical tree.

113

4.3. Vietnamese syntactic parsing

(S (VP-ADV (V-H Men) (PP-MNR (E theo) (NP (Nc-H con) (N đường mòn)) ) ) (, ,) (NP-SUB (P-H chúng tôi)) (VP (V-H đ´ ên) (NP-DOB (M môt) (N-H ˙ khu) (N đ´ ât) (PP-LOC (E-H trước) (NP (N-H dãy)(N núi)(Np Sen)) ) ) ) (. .)

(a)

)

VP|ADV

V|H

Men (Skirting)

PP|MNR

E|H

(b) theo Nc|H (along) con (a)

P|H chúng tôi (we) N đường mòn (rut)

V|H

đến (went M to) một (a)

Nc|H khu (piece of)

PP|LOC

đất E|H (land)

trước (be- Nc|H N Np fore) dãy núi Sen (a (moun(Sen) row tain) of)

Figure 4.3 – An example of a sentence annotated in VietTreebank using: (a) brackets and (b) a hierarchical tree.

114

Chapter 4. Prosodic phrasing modeling

Example 7 Some examples of Vietnamese phrases • NP: những [N c quả] bóng màu xanh (green [classifier] balls) • PP: “[E trên] mặt đất” (on the ground), “[v čE từ] năm 1990” (since 1990), “[E của] tổ quốc ta” (of our fatherland) • VP: “hay [V đi chơi] với bạn bè” (often go out with friends), “[V bắt đầu] làm việc từ sớm” (start working early) • AP: “rất [A đẹp]” (very beautiful), “[A giỏi] về thể thao” (good at sport) • QP: “hơn [M 200]” (more than 200) • RP: “[R vẫn] chưa” (still not) • UCP: “vải [U CP [AP rẻ] và [N P chất tốt] ]” (cheap and good quality clothes) • XP: “ba cọc ba đồng” (fixed and modest–for income)

4.3.3

Syntactic parsing techniques

In natural language processing, the syntactic analysis (hereafter called syntactic parsing) may vary from low to high levels. The lowest level can be referred as simply part-of-speech tagging for each word in the sentence. Shallow parsing (also known as “chunking”, “light parsing”) decomposes of sentence structure into constituents but not specify their internal structure nor their role in the main sentence. The highest level parsing, i.e. the full parsing, can recover not only the phrase structure of a sentence, but also can identify the sentence structure dependency between each predicate in the sentence and its explicit and implicit arguments. In syntactic parsing, ambiguity is a particularly onerous issue since the most probable analysis has to be chosen from an exponentially large number of alternative analyses. As a result, parsing algorithms plays an important role to handle such ambiguity, hence decides the quality of a parser corresponding to different levels from tagging to full parsing. A detail of main syntactic parsing techniques is presented in Appendix A. Generative models. The main idea of generative models is that in order to find the most plausible parse tree, the parser has to choose between the possible derivations, each of which can be represented as a sequence of decisions. Probabilistic Context-free Grammars (PCFG) model is the simplest classical instance of generative models, where the parse tree has the highest joint probability with the input sentence. The most popular and classical generative model is Lexical Probabilistic Context-free Grammars (LPCFG) of Collins (1999). Its idea is to extend the history of a parse tree by adding more information of phrase head words. On the test set, that is the section 23 of English Penn Treebank (Marcus et al., 1993), the LPCFG parser can reach F score of 88.2%, while F score of the parser with the naive PCFG is 73%. The current state-of-the-art generative parsing model in terms of accuracy belongs to the well-known Berkeley parser (Petrov and Klein, 2007). These authors assumed that if it is possible to split each constituent label (even POS) in Treebank in a good manner, a high accuracy can be obtained. This method is called “Latent Variables PCFG”, which uses the Expectation-Maximization (EM) algorithm to find the best manner to split their grammar,

4.3. Vietnamese syntactic parsing

115

reaching F score of 90.1%. In syntactic parsing, Berkeley parser has been considered as one of the strongest one because it does not need any grammar information, only the Treebank corpus, making it easily apply into any languages. Discriminative models. The definition of PCFG means that various rule probabilities had to be adjusted in order to obtain the right scoring of parses. Meanwhile, the independence assumptions in PCFG, which are dictated by the underlying CFG, often leads to bad models. These models cannot use information vital to the decision of rule scores leading to high scoring plausible parses. Such ambiguities can be modeled using arbitrary “features” of the parse tree. Discriminative methods provide us with such a class of models. Even in common machine learning, the performance of the discriminative models is usually better than that of the generative models. Collins (2002) created a simple framework that described various discriminative approaches to train a parsing system (and also chunking or tagging). This framework was called a global linear model (Collins, 2002). Commonly, a conditional random field (Lafferty et al., 2001) could be used to define the conditional probability as a linear score for each candidate and a global normalization term. However, a simpler global linear model can be obtained by ignoring the normalization term (thus much faster to train). Many experimental results in parsing have shown that this simpler model often provides the same or even better accuracy than the more expensively trained normalized models. Advanced parsing methods Beside the above learning models, there are a number of advanced methods that utilized the external information to boost the performance of parsing systems to higher levels. Socher et al. (2013) used the deep learning technique, which was based on the recurrent neural network, and reach the F score of 90.5%. Charniak and Johnson (2005) proposed a general framework called Re-ranking parser. This framework first used a baseline generative parser (such as one in Collins (1999) or in Petrov and Klein (2007)) to produce top k-best candidate parse trees, and then used a discriminative model with a set of strong and rich features to re-rank them and pick out the best one. This work used maximum entropy model as a discriminative re-ranker for the baseline system, which could achieve a high F score of 91.5% on test set of English Treebank. Huang (2008) improved the strategy for the re-ranking parsers that could encode more candidate parse trees in the first phase and utilize the averaged perceptron model to perform the re-ranking phase, reaching up to F score of 91.8% on English test set. However that is not whole story, McClosky et al. (2006) even extended the idea of reranking parser by injecting more unsupervised features from large external text corpus, making the parser become a self-trained system that could achieve a F score of 92.4% on the test set. Currently, the self-trained parser has been considered as the state-of-the-art parsers in terms of F score on the English test set.

4.3.4

Adoption of parsing model

Averaged perceptron. A well-known discriminative model, Perceptron, was adopted for the Vietnamese syntactic parsing in our TTS system. A perceptron (Rosenblatt, 1988) originally introduced as a single-layered neural network. In structured prediction problem such as parsing, perceptron could be considered as the most widely-used model due to its simplicity and efficiency. Comparing to the generative model or other discriminative models, it is much simpler while still keeping a competitive accuracy (Carreras et al., 2008, Collins and Roark,

116

Chapter 4. Prosodic phrasing modeling

2004, Zhu et al., 2013). Perceptron could be trained by using the online learning, that is, processing examples one at a time, during which it adjusts a weight parameter vector that can then be applied on input data to produce the corresponding output. The weight adjustment process awards features appearing in the truth and penalizes features not contained in the truth. After the update, the perceptron ensures that the current weight parameter vector is able to correctly classify the present training example. A detail learning algorithm of the original perceptron, voted perceptron, and averaged perceptron are described in Appendix A. Although the original perceptron learning algorithm is simple to understand and to analyze, the incremental weight updating suffers from over-fitting, which tends to classify the training data better, at the cost of classifying the unseen data worse. Also, the algorithm is not capable of dealing with training data that is linearly inseparable. Freund and Schapire (1999) proposed a variant of the perceptron learning approach, called the voted perceptron algorithm. Instead of storing and updating parameter values inside one weight vector, its learning process keeps track of all intermediate weight vectors, and these intermediate vectors are used in the classification phase to vote for the answer. The intuition is that good prediction vectors tend to survive for a long time and thus have larger weight in the vote. Compared with the original perceptron, the voted perceptron is more stable, due to maintaining the list of intermediate weight vector for voting. Nevertheless, to store those weight vectors is space inefficient. Also, the weight calculation, using all intermediate weight parameter vectors during the prediction phase, is time consuming. The averaged perceptron algorithm (Freund and Schapire, 1999) is an approximation to the voted perceptron that, on the other hand, maintains the stability of the voted perceptron algorithm, but significantly reduces space and time complexities. It turns out that the perceptron, especially averaged one, is one of the most powerful model in parsing and in resolving different problems in natural language processing. Collins and Roark (2004) reported that a incremental parsing with the use of averaged perceptron could reach a comparable F score (86.6%) comparing to the generative model (86.7%). Zhu et al. (2013) shew that perceptron-based parser could achieve F score of 90.4%, which outperformed the currrent state-of-the-art generative parser (Petrov and Klein, 2007) without using any latent variables. Carreras et al. (2008) proposed a way of using Tree Adjoining Grammar (TAG) with the use of perceptron algorithm, which could produce a parsing accuracy of 91.1%, certainly one of the state-of-the-art accuracy in parsing technique. Shift-reduce parsing with averaged perceptron. We adopted the parsing model and a syntactic parser of Le et al. (2015) on constituency parsing for our TTS system. Similar to any practical problem, there are two criteria for a parser in speech synthesis: the accuracy and the parsing speed. Therefore, it is necessary to select a system that can balance both of them. Some experiments were done in the work of Le et al. (2015) to compare the state-ofthe-art parsers in terms of their performance on the test set of English Treebank (Marcus et al., 1993).

Table 4.4 presents the result of such experiments, including both their averaged speed and F-score. The result shows that the generative parsers had lower performances than the discriminative ones. Only Petrov and Klein (2007) could achieve a high accuracy, but their speed was quite slow (6.1 sentence/s), especially when applying into practical problems such as speech synthesis. The most accurate parsers belonged to a group using the advance models. However, due to the complexity of these models, their speeds were also very slow and they

117

4.3. Vietnamese syntactic parsing

Table 4.4 – F-score of the adopted parsing system on English Test Set comparing with stateof-the-art parsers Model Advanced models

Generative models Discriminative models

System Charniak and Johnson (2005) McClosky et al. (2006) Socher et al. (2013) Huang (2008) Petrov and Klein (2007) Collins (1999) Sagae and Lavie (2005) Sagae and Lavie (2006) Carreras et al. (2008) Zhu et al. (2013) Zhu et al. (2013) + semi Charniak (2000) Adopted parsing model

Speed (sentence/s) 2.1 1.2 3.3 N/A 6.1 3.5 3.7 2.2 N/A 32.4 11.6 5.7 13.6

91.2 92.2 90.3 92.2 90.1 88.1 86.0 88.1 90.7 90.2 91.1 89.5 90.9

91.8 92.6 90.7 91.2 90.3 88.3 86.1 87.8 91.4 90.7 91.5 89.9 91.2

91.5 92.4 90.5 91.7 90.2 88.2 86.0 87.9 91.2 90.4 91.3 89.5 91.1

even required external resources outside of the scope of Treebank, which was expensive to prepare. In the work of Le et al. (2015), the parsing system of Zhu et al. (2013) was selected as a baseline system to extend due to the following reasons. First, that system could achieve a state-of-the-art accuracy with a fast parsing speed. Second, the parser had been trained using an averaged perceptron, a global linear model, which was a simple yet fast training model and could be easily to perform an online training. In addition, the baseline system was based on the shift-reduce parsing algorithm Zhang and Clark (2009), which could perform in linear time complexity with richer feature set than the traditional chart-based parsing algorithm (Carreras et al., 2008, Charniak, 2000, Collins and Roark, 2004). As a result, this system could achieve a good balance between the parsing speed and the F-score accuracy. However, it relied on an inexact search such as beam-based method (Zhu et al., 2013) in both training and parsing phase. This bottleneck might lead to a search error, causing some lost on accuracy. In order to reduce the search error, Le et al. (2015), our adopted parsing model, proposed a method in which an exact search was performed instead of the approximation method. The main idea in this work was to use dynamic programming (Huang and Sagae, 2010) and A* search (Klein and Manning, 2003) to guarantee the optimality. The system of Le et al. (2015) was the first parsing system that could perform an exact search for shift-reduce parsing with the averaged perceptron algorithm. As shown in Table 4.4, the fastest model with 32.4 sentences/s has F-score of 90.2%. The adoption system could achieve a high F-score of 91.1% with a high parsing speed (13.6 sentences/s). Zhu et al. (2013) can achieve a little bit better accuracy (91.3%) than the adopted system but more external semi-supervised features had to be used to fulfill the blank of search errors (hence lower speed at 11.6 sentence/s).

4.3.5

VTParser, a Vietnamese syntactic parser for TTS

Despite a considerable gap in quality to other common languages such as Chinese, Japanese or English (e.g. F-score for English = 90%-92%); there have been at least three most popular

118

Chapter 4. Prosodic phrasing modeling

syntactic parsers for Vietnamese: vnLTAGParser (Le et al., 2012), LPCFG parser (Le et al., 2009) and Berkeley parser for Vietnamese (Nguyen et al., 2013a). The vnLTAGParser, adopting the Lexicalized Tree-Adjoining Grammars for both constituency and dependency parsing, had the F-score of 73.21% (dependency) and 69.33% (constituency). The best F-score in Le et al. (2009) was around 78% (constituency). The last system, which was originally the Berkeley parser but applied for Vietnamese language (Nguyen et al., 2013a), provided a F-score of 73.21% (dependency). There were two major reasons for such “less-than-stellar” performance. First, Vietnamese Treebank corpus was not sufficient, stable and accurate enough to be able to produce a good parsing model. Second, most of the existing Vietnamese were relied much on the generative parsing models, which has been empirically proven to be less accurate than most of the state-of-the-art parsing models. VTParser, a Vietnamese syntactic parser for TTS, was built based on the parsing model of Le et al. (2015) for the constituency parsing, trained with the VietTreebank corpus, with some adaptions for the Vietnamese language. An experiment had been performed to evaluate the effectiveness of VTParser in the Vietnamese parsing, compared to three above well-known Vietnamese parsers including: (i) the Vietnamese Berkeley parser (Petrov and Klein, 2007), (ii) the LPCFG parser (Le et al., 2009), and (iii) the vnLTAGParser (Le et al., 2012). These four syntactic parsers in this experiment were trained using the VietTreebank corpus with the same train-test split as the work of Le et al. (2009). Table 4.5 – Results of experiment comparing between different Vietnamese parsers System LPCFG parser Berkeley parser vnLTAGParser VTParser

Speed (sentences/s) 4.3 6.2 N/A 13.6

F-score 78.2% 71.5% 69.33% 81.6%

Table 4.5 presents the experimental result, showing that the VTParser had outperformed all other Vietnamese parsers in both parsing accuracy and speed. This parser was about twice quicker than the quickest one (vnLTAGParser), and about 3.4% higher than the one having the best accuracy (LPCFG parser). Based on the different approaches of prosodic phrasing modeling, presented in Chapter 4, we proposed three types of syntax parsing for our TTS system as follows. • Standard constituency parsing: Sentences are parsed into syntax trees. Leaves of these trees are grammatical words named by POS categories while ancestor nodes are syntactic phrases named by phrasal categories. This is a standard parsing of phrase structure. • Unnamed constituency parsing: Sentences are parsed into syntax trees. Leaves of these trees are grammatical words named by POS categories while ancestor nodes are unnamed syntactic phrases. This type of syntactic parsing was proposed especially for our work on prosodic phrasing modeling using phrase structure but not phrase names (cf. syntactic blocks in Section 4.5). • Constituency parsing with grammar-functional labels: Sentences are parsed into syntactic trees as the standard constituent parsing. However, there is additional infor-

119

4.4. Preliminary proposal on syntactic rules and breaks

Table 4.6 – Experimental results of three syntax parsing types for Vietnamese Syntax parsing type Standard constituent parsing Unnamed constituent parsing Constituent parsing with functional labels

P 81.14% 84.43% 70.56%

R 82.72% 85.40% 72.33%

F-score 81.61% 84.61% 71.11%

mation for some nodes in syntax trees. Some phrase nodes or word leaves are assigned with some special necessary grammar-functional labels: main clause (S), subordinate clause (SB), adjuncts (ADT) for all semantic functions, head of phrase (H), subject (SUB) and predicate (PRD). The averaged perceptron algorithm was also used to train using VietTreebank for the grammar-functional labeling. This type of parsing was used for our preliminary analysis on syntactic rules and break levels (cf. Section 4.4). The experimental result showing the performance of those parsing strategies is illustrated in Table 4.6. The “unnamed constituent parsing” had the highest precision (84.43%) and F-score (85.40%) while the quality of the constituent parsing with grammar-functional labels was lower than that of the unnamed one about 14%. There was a small gap between the accuracy of the standard constituent parsing and the unnamed one (about 3% difference).

4.4

Preliminary proposal on syntactic rules and breaks

This is our first analysis and proposal on predicting break indices using syntactic information for Vietnamese TTS systems. It is believed that there is an interface between syntax and prosodic structure (Martin, 2010, Nespor and Vogel, 1983, 2007, Selkirk, 2011, 1980). However, in this preliminary study, we did not investigate much on the theory of prosodic syntax interface. Some syntactic rules were proposed for predicting levels based on the preliminary study on the theory, and mostly based on our observations on prosodic and syntactic structure. We assumed that break indices above word and below sentence were from “2” to “4” (cf. Section 5.4, Chapter 5 for other levels).

4.4.1

Proposal process

In this work, the VNSP-Broadcaster corpus, recorded by a broadcaster – an existing small one with 630 sentences – was used for analyses and proposal. As presented in the Section 4.2, in this corpus, audio files were manually transcribed, time-aligned at the syllable level, and annotated for perceived pauses in TextGrid files. Text files were manually parsed to constituent syntax trees with additional grammar-functional labels. Main tasks for proposal of hypotheses with syntactic rules linked to relevant break indices are illustrated in Figure 4.4. Hypotheses including syntactic rules to predict prosodic boundaries with corresponding break indices were proposed based on our observation in corpora. These hypotheses were applied to syntactic trees of text corpus to specify prosodic boundaries, which were then automatically annotated into TextGrid files of audio corpus to identify prosodic phrases. Last syllables and next pauses of these predicted phrases were measured and analyzed. Durations of last syllables and next pauses of predicted phrases were measured. Final lengthening of last syllables was calculated based on Z-score normalization, linked to syllabic structures and tone types. Statistical analyses were carried out to find a correlation between syntactic element boundaries and pause duration as well as final lengthening. This

120

Chapter 4. Prosodic phrasing modeling

Start

Proposal of syntactic rules linked to break indices

Boundary prediction by applying hypotheses

Measurement at boundaries

Analysis and verification

Acceptable precision?

Yes Stop Figure 4.4 – General approach for prosodic phrasing modeling using syntactic rules. process was repeated with new or fine-tuned syntactic rules until an acceptable precision of boundary prediction was obtained. Constituents in phrase structure grammar or dependents in relational structure were used as primary elements of syntactic rules. break indices for these rules were proposed based on the possibility of pause appearance and pause length at predicted boundaries. After several iterations, we discovered some features for fine-tuning, i.e. number of syllables of dependents/constituents, children or parents of dependents/constituents.

4.4.2

Proposal of syntactic rules

Formal symbols were proposed to formally express syntactic rules for further automatic processing in boundary prediction and fine-tuning. A detail of these proposed symbols is presented in Appendix B. After having studied the theory of syntax-prosody interface and observed relations between syntax and pause appearance in the corpus, we proposed some hypotheses on syntactic rules with corresponding break indices. Two types of rules were discovered: (i) Constituent syntactic rules between two constituents in phrase structure grammar and (ii) Functional rules between two dependents in relational structure. Proposed rules with syntactic constituents and with syntactic dependents are presented respectively in Table B.2 and Table B.3, Appendix B. The highest break level in middle of sentences (“4”) were set if either the left constituent is or contains a clause (S, SB), i.e. the rules HC1 and HC2, or both left and right dependent elements were predicates (P RD), i.e. the rule HD1, or head elements (H), i.e. the rule HD2.

4.4. Preliminary proposal on syntactic rules and breaks

121

Other decisions were made on the basis of syntactic element names (e.g. adjuncts ADT ) or/and number of syllables in the left or right elements. Smaller break indices (“2” and “3”) may appear after some special POS or syntactic phrases, e.g. prepositional phrases P P , conjunction C. Syntactic rules were refined using number of syllables, parents or children of syntactic elements. For instance, we found that there was a boundary between a phrase having at least 7 syllables, and a phrase having at least 4 syllables (HC3). These number limits were optimized through several iterations from proposal to evaluation. For instance, the formal representation for the syntactic rule HC1 is “SB; .{1, }(child : S|SB)−”. It means that there is a boundary between a subordinate clause (SB) or any constituents having a clause child (“child : S|SB”) AND any constituent, such as “[Người đàn ông [mà bà gặp hôm qua ở nhà tôi]SB ]N P − là một người rất tốt bụng” ([The man [you met yesterday at my home]SB ]N P − is a really kind person). With the rule HC5: “P P >= 3 − C; [AN V ]P ”, we assumed that there was a boundary between a prepositional phrase having at least 3 syllables (“P P >= 3”) AND a conjunction (“C”) or an adjective/noun/verb phrase (“[AN V ]P ”), such as “Đó là kết quả của những buồn vui [trong tình yêu của riêng mình]P P − [và]C cả những tâm sự của khán giả dành cho tôi” (That is the results of joyfulness and sadness [in their own love]P P − [and]C also their confidings given to me). For dependent rules, we only investigated on some typical cases. For example, there were usually a boundary between two predicates (HD1: “P RD − P RD”), e.g. “Lão [có nhà ở ngoại ô]P RD − [có ô tô hạng sang và vài người giúp việc]P RD ” (He [had a house in a suburban area]P RD − [had a luxury car and several housekeepers]P RD ). Another instance of dependent rules we found is the rule HD5: “2 x, = 6− means that there is a boundary between a main clause having at least 6 syllables (S>=6) AND any constituent. E.g. “[Mùa thu lá vàng rơi đầy trên từng góc phố]S ) − còn mùa xuân thì hoa nở muôn nơi” ([In autumn yellow leaves fall down all over streets]S − and in spring flowers bloom everywhere). HC3: .{1, }P >= 7−.{1, }P >= 4 means that there is a boundary between a phrase having at least 7 syllables AND a phrase that having 4 syllables. E.g. “Nhưng [kết cấu của

243

B.3. Syntactic rules

Table B.2 – Constituent syntactic rules Break level 4

Rule code HC1

Formal representation SB; .{1, }(child : S|SB)−

HC2

S >= 6−

HC3

.{1, }P >= 7−.{1, }P >= 4

HC4

.{1, }P −C + .{1, } >= 5

HC5

P P >= 3−C; [AN V ]P

HC6

3 =5). E.g. “Cậu [luôn ý thức bảo vệ mẹ mình]V P − [và]C [thường xuyên giúp đỡ bạn bè trong lúc khó khăn]V P ” (He [always stays up conscious to protect his mother]V P − [and]C [usually helps his friends in difficulties]V P ). HC5: P P >= 3−C; [AN V ]P means that there is a boundary between a prepositional phrase having at least 3 syllables (PP>=3) AND a conjunction (C) or an adjective/noun/verb phrase ([ANV]P). E.g. “Đó là kết quả của những buồn vui [trong tình yêu của riêng mình]P P − [và]C cả những tâm sự của khán giả dành cho tôi” (That is the results of joyfulness and sadness [in their own love]P P − [and]C also their confidings given to me). HC6: 3 =4−H

HD4

2= 4−H means that there is a boundary between a head element having at least 4 syllables AND a head element. E.g. “Quán này có rất nhiều món ngon như [Cơm Tám Mễ Trì]N P −H − [cơm rang các món]N P −H − [các món nhậu về cá]N H−H ” (This restaurants have many delicious dishes, such as [the Tam Me Tri rice]N P −H − [varied dishes of fried rice]N P −H − [fish dishes for carouse]N P −H ). “Chúng ta thiếu những cán bộ [năng lực chuyên môn cao]N P −H − [có uy tín quốc tế]V P −H để tiến hành những nghiên cứu lớn” (We are lacking of researchers with [high quality specialty]N P −H − [good international impact]N P −H to perform critical researches). HD4: 2 âw an-2 th i-2 tCO[kOn-2 koNm-1 1 z˘ aN-2 hO-6a â7n-1 zan-3 tCi-3 fan-3 âoj-5a k˘ Ek-5b lam-2 ku@-3 z7j-5a tCu-3 von-5a âaN-1 f7t-5b l7-2 moj-6a Nw i@n-1 tăk-5b an> > 1 tw an-2 tCONm-1 koNm-1 vi@k6b]

To whom that responsibility belongs? And the unions also argue that they simply oppose the capital owners in ignoring all safety rules at work.

Nhà này rộng bao nhiêu mét?

7 8

Xa quá! Thế em đã biết nấu chưa?

Trong những ngày vui này còn có tiết mục mừng tuổi, chúc Tết

Lần sau chị mua nữa nhé!

Do vậy, muốn nâng được một trọng lượng nhất định thì chỉ còn cách là làm rộng hết cỡ có thể diện tích cánh máy bay để có được lực nâng cần thiết.

Bởi trước khi rời nước, anh chị đã ký khế ước với quốc gia Việt Nam là sẽ về nước phục vụ sau khi tốt nghiệp.

âO-5a

Penguins’ adorable swaying walking style is a smart way to get that.

> [ña-2 năj-2 zoNm-6a áaw-1 ñi@w-1 mEt-5b] [sa-1 kw a-5a] [th e-5a em-1 âa-4 ái@t-5b n˘ aw-5a tCW@-1] > [tCONm-1 ñWN-4 N˘aj-2 vuj1 năj-2 kOn-2 kO-5a ti@t-5b muk-6b mWN-2 tu@j-3 tCuk5b tet-5b] [l˘ 7n-2 săw-1 tCi-6a mu@-1 nWa-4 ñE-5a]

How big is this house?

[âO v˘ 7j-6a mu@n-5a n˘ 7N âu@k> lW@N-6a 6b mot-6b ıtCONm-6a ñ˘ 7t-5b âiN-6a th i-2 tCi-3 kOn-2 ff > k˘ Ek-5b la-2 lam-2 zoNm-6a het5b k7-4 kO-5a tt e-3 zi@n-6a tikff k˘ EN-5a măj-5a áăj-1 âe-3 kO-5a âu@k-6a lWk-6b n˘ 7k-6b n˘ 7N-1 k˘ 7n-2 th i@t-5b] [á7j-3 tCu@k-5b xi-1 z7j-2 nu@k5b ˘ EN tCi-6a âa-4 ki-5a xe-5a ff > u@k-5b v7j-5a kw okp-5b za-1 vi@t-6b nam-1 la-2 sE-4 ve-2 nW@k-5a f*ck-6b vu-6a săw xi-1 tot-5b Ni@p-6b]

Therefore, to elevate a certain weight, the only way is to increase the area of the wings as much as possible to get the necessary lifting force.

Too far! Do you know how to cook? During these happy days, there are also some activities like giving lucky money and wishing best wishes. Buy this next time, please!

Because before leaving the country, you had signed an agreement with the Vietnamese government that you would serve the country after graduation.

263

C.4. Test corpus examples

Table C.3 – Test corpus examples of Intelligibility test # 1

Vietnamese text Ông là ếch ngồi đáy giếng.

Transcription > [oNm-1 la-2 ekff-5b Noj-2 â˘aj5a zi@N-5a]

Loanh quanh một lát, bọ muỗm đã mệt phờ.

Chúng mọc um tùm và trở thành nơi trú ngụ khá tốt cho loài chim này.

Thân nỏ được làm bằng gỗ nghiến với dạng thớ mịn nhưng quánh, dẻo.

Đội đã liên hệ để đưa hai liệt sĩ này về an táng ở nghĩa trang liệt sĩ quê nhà.

Đây là những yếu tố gây áp lực tăng giá đối với thị trường trong nước.

Hai mắt Mừng tự dưng nhoè ướt. Triển lãm trưng bày theo bốn mảng đề tài Cổ vật Phật giáo, kỷ lục Phật giáo, mỹ thuật Phật giáo, ảnh nghệ thuật về Phật giáo.

[lw ˘EN-1 [kw ˘EN-1 mot-6b latff ff 5b áO-6a mu@m-4 âa-4 met6b f7-2] > [tCuNm-5a mOk-6b um-1 tum2 va-2 tC7-3 th ˘EN-2 n7j-1 tCuff 5a Nu-6a xa-5a tot-5b ku@-3 kak-5b lw aj-2 tCim-1 n˘aj-2] [th 7 ˘n-1 nO-3 âu@k-6a lam2 á˘aN-2 Go-4 Ni@n-5a v7j-5a âaN-5a th 7-5a min-6a ñWN-1 [kw ˘EN-5a âEw-3] ff [âoj-5a âa-4 li@n-1 he-5a âe-3 âW@-1 haj-1 li@t-6a si-4 n˘aj2 ve-2 an-1 taN-5a 7-3 Ni@-4 tCaN-1 kw e-1 Na-2] [â˘ 7j-1 la-2 ñWN-4 i@w-5a to5a G˘ 7j-1 ap-5b lWk-6b tăN1 za-1 âoj-5a v7j-5a th i-6a > nW@k-5b] tCW@N-2 tCONm-1 [haj-1 măt-5b mWN-2 tW-6a âWN-1 ñw E-2 W@t-5b] [tCi@n-3 lam-4 tCWN-1 áăj2 th Ew-1 áon-5a maN-3 âe-2 taj-2 ko-3 vat-5b fat-6a zaw5a ki-3 luk-6a fat-6b zaw-6a mi-4 thw 7 ˘t-5b ˘EN-3 Ne-6a veff 2 fat-6b zaw-5a] [nO-5a th O-5a â˘ 7w-2 za-1 vE-3 măt-6b f7n-5a f7-3 săj-1 áW5a áW-2 nOj-5a] [th 7 ˘j-5a th e-5a th ˘aN-2 áE-5a áew-6a áW@-6a mam-5a moj2 â˘ 7j-2 Nun-5a vaw-2 lW@j-4 k˘ 7w zoj-2 âW@-1 tCO-1 hăn5a] > [nO-5a âikff-6b kaj-5a áuNm6a fe-6a âen-5a tCu@N-2 lam-2 áOn-6a hEw kew-1 in-2 it-6b] [âi@t-6b s˘ON-1 hO-6a tCiN ff Nw i@n-4 hw e-6a t7j-5a i@t > ki@n-5a vu@ hi@n-3 toNm-1]

Nó thò đầu ra - vẻ mặt phớn phở, say bứ bừ nói: "chúng nó phắn hết rồi". Thấy thế, thằng bé bệu bạo khóc, nhe hàm răng đầy bựa mám mồi đầy ngún vào lưỡi câu rồi đưa cho hắn.

Nó dịch cái bụng phề phệ đến chuồng làm bọn heo kêu ìn ịt.

Diệt xong họ Trịnh, Nguyễn Huệ tới yết kiến vua Hiển Tông.

English meaning You are the frog in the well (see no further than your nose) After flying around for a while, beetles were exhausted. They overgrew and became good shelter for this bird species. The archery’s body was made by garnishing wood in smooth yet consistent and elastic lines. The team had managed to bring these two martyrs back to their home cemetery. These are the factors that pressure prices to go upward in the domestic market. Two eyes of Mung suddenly swept and blurred. The exhibition displays four topics: Buddhist antiques, record Buddhism, Buddhist artwork and Buddhist photography. He popped his head out, ecstatic drunk, he said “They’re all gone”. Seeing this, the boy cried and bared the teeth filled with plaque then handed him the hook. He moved the potbelly to the stable, making the pigs to oink. After slaying the Trinh families, Nguyen Hue encountered King Hien Tong.

264

Appendix C. VTED design, construction and perceptual evaluations

Table C.4 – Test corpus examples of Pair-wise preference test using syntactic rules # 1

Vietnamese text Khách nước ngoài đến Việt Nam ăn Tết thường cảm thấy thú vị

Những ca khúc của ông mang tính tự sự lắng đọng

Tôi nhận ra rằng lúc mình đang ở đỉnh cao không bao giờ thiếu người quan tâm

Những giao dịch đó có thực hay không thì chưa ai có thể trả lời một cách chính xác

Cháu đã đeo một chiếc nhẫn rất cứng vào dương vật khiến nó bị thít chặt lại

Mãnh có cha luôn say rượu đánh đập vợ con

Thời tiết xấu máy bay không xuống được

Sự vươn lên và thành công của những người ấy làm tôi nể phục.

Cậu luôn ý thức bảo vệ mẹ mình và thường giúp đỡ bạn bè trong lúc khó khăn

Tại thành phố Hồ Chí Minh có nhiều đường dây cá độ bóng đá rất chuyên nghiệp.

Điều này gây ra cảm giác đau ở cẳng chân sưng nề ở mắt cá chân.

Ngoài ra tại đây còn nhận tổ chức liên hoan sinh nhật đám cưới.

Transcription [xăk-5b nW@k-5b Nw aj-2 âen5a vi@t-6b nam-1 ăn-1 tet-5b th W@N-2 kam-3 th u7j-5a th u5a vi-6a] > [ñWN-4 ka-1 xukp-5b ku@-3 > oNm-1 maN-1 tiN-5a tW-6a >ff sW-5a lăN-5a âONm-6a] > [toj-1 ñ˘ 7n-6a za-1 zăN-2 lukp> 5b miN-2 7-3 âiN kaw-1 xoNmff ff 1 áaw-1 z7-2 th i@w-5a NW@j-2 kw an-1 t˘ 7m-1] [ñWN-4 zaw-1 zikff-6a âO-5a kO> ti -2 5a th Wk-6b hăj-1 xoNm-1 tCW@-1 aj-1 kO-5a th e-3 tCa-3 l7j-2 mot-6b kuEkff-5b tCiN-5a ff sak-5b] [tC˘aw-5a âa-4 âEw-1 mot-6b tCi@k-5b ñ˘ 7n-4 kim-1 kW@N-1 z˘ 7t-5b kWN-5a vaw-2 zW@N1 v˘ 7t-6b xi@n-5a nO-5a ái-6a th it-5b tC˘at-6b laj-5a] [m˘EN-4 kO-5a tCa-1 lu@n-1 ff săk-1 zW@w-6a â˘EN-5a â˘ 7p-6b ff v7-6a kOn-1] [th 7j-2 ti@t-5b s˘ 7w-5a măj-5a > áăj-1 xoNm-1 su@N-5a âW@k6a] [sW-6a vW7n-1 len-1 va-2 > th ˘ aN-2 koNm-1 ku@-3 ñWN-4 ff NW@j-2 7 ˘j-5a lam-2 toj-1 ne> 3 f*ckp-6b] [k˘ 7w-5a lu@n-1 i-5a th Wk-5a áaw-3 ve-6a mE-6a miN-2 vaff > 2 th W@N-2 zukp-5a â7-4 áan> > 6a âE-2 tCONm]-1 lukp-5b xO5a xăn-1 taj-6a th ˘EN-2 fo-5a ho-2 tCiff 5a miN-1 kO z˘ 7t-5b ñi@wff 2 âW@N-2 z˘ 7j-1 ka-5a âo-6a > áONm-5a âa-5a z˘ 7t-5b tCw i@n1 Ni@p-6b [âi@w-2 năj-2 G˘ 7j-1 kam-3 zak-5b âăw 7-3 k˘aN-3 tC˘ 7n-1 sWN-1 ne-2 7-3 m˘at-5b ka-5a tC˘ 7n-1] [Nw aj-2 za-1 taj-6a â˘ 7j-1 kOn2 ñ˘ 7n-6a li@n-1 hw an-1 siN ff ñ˘ 7t-6b âm-5a kW@j-5a]

English meaning Foreigners who come to Vietnam for the Tet holiday often find it interesting. His songs are narrative and sentimental. I realized that when I was at the top, it’s never lack of people who were interested in me. No one can really answer whether those transactions are real or not.

The child wore a very hard ring on his penis causing it to be tightened.

Manh has a drunken father who always beat his wife and children. The plane could not land because of bad weather. I have great admiration to the rise and success of those people. He always protects his mother and help friends in difficult situations.

In Ho Chi Minh City, there are many professional football gambling services.

This causes pain in the legs and swelling in the ankles.

Besides, they also organize weddings and birthday parties in here.

265

C.4. Test corpus examples

Table C.5 – Test corpus examples of Pair-wise preference test using syntactic-blocks, -links and POSs # 1

Vietnamese text Lời chúc càng tự nhiên chân thành càng được yêu thích

Cậu hay ăn hiếp những bạn yếu thế hơn mình và biết cách luồn lách sao cho có lợi về mình

Cấm lưu thông tự do thảo quả là chủ trương của nhà nước hay của một ngành nào đó mà dẫn đến nông nỗi này

Cặp bài trùng này đã cướp trắng của đồng bọn 170kg thuốc phiện

Một lần nữa hồ sơ tội ác của Lượng dầy thêm

Người thứ ba cũng có chút ít lỗi vô ý chính là bạn đấy Hiên ạ

7 8

Còn anh? Ông làm ơn chỉ cho chúng tôi trường tiểu học Giảng Võ được không?

Cấm lưu thông tự do thảo quả là chủ trương của nhà nước hay của một ngành nào đó mà dẫn đến nông nỗi này.

Riêng hai hồ nước lớn sẽ là nơi truyền bá cho môn thể thao mới nhất Việt Nam: đi bộ trên nước.

Học trò tôi rất đông nhưng mở lớp dậy chính thức thì tôi chưa làm được.

Anh thử tìm trong cặp hay trong túi xem?

Transcription [l7j-2 tCuk-5b kaN-2 tW-6a ñi@n-1 tC7n-1 th ˘EN-2 kaN-2 ff âW@k-6a i@j-1 th i[k]-5b] ff [k˘ 7w-6a hăj-1 ăn-1 hi@p-5b ñWN-4 áan-6a i@w-5a th e5a h7n-1 miN-2 va-2 ái@t-5b ff kuEkff-5b lu@n-2 lak-5a saw-1 tCO-1 kO-5a l7j-6a ve-2 miN-2] ff > [k˘ 7m-5a lWw-1 th oNm-1 tW6a zO-1 th aw-3 kw a-3 la-2 tCu3 tCW@N-1 ku@-3 ña-2 nW@k5a hăj-1 ku@-3 mot-6b to-3 tCWk-5b naw-2 âO-5a ma-2 > noj-4 năj-2] z˘ 7n-4 noNm-1 > [kăp-5b áaj-2 tCuNm-2 năj-2 âa-4 kW@p-5b tC˘aN-5a ku@-3 > áOn-6a mot-6b tC˘am-1 âoNm á˘ 7j-3 mW@j-1 ki-1 lo-1 Gam-1 th u@k-5b fi@n-6b] [mot-6b l˘ 7n-2 nW@-4 ho-2 s71 ku@-3 lW@N-6a â˘ 7j-2 th em1] > [NW@j-2 th W-5a áa-1 kuNm-4 kO-6a tCut-5b it-5b loj-4 vo-1 i-5a tCiN-5a la-2 áan-6a â˘ 7jff 6a hi@n-1 a-6a] [kOn-2 ˘EN-1] ff > lam-2 [oNm-1 7n-1 tCi-3 tCO-1 > tCuNm-5a toj-1 tCW@N-2 ti@w3 hOk-6a zaN-3 vo-4 âW@k-6b > xoNm-1] > [k˘ 7m-5a lWw-1 th oNm-1 tWh w 6a âO-1 t aw-3 k a-3 la-2 tCu-3 tCW@N-1 kW@-3 ña-2 nu@k-5b hăj-1 ku@-3 mot-5b N˘EN-2 naw-2 âO-5a ma-2 z˘ 7nff > noj-4 năj-2] 4 âen-5a noNm-1 [zi@N-1 haj-1 ho-2 nW@k-5b sE-4 la-2 n7j-1 tCw i@n-2 ba5a tCO-1 mon-1 th e-3 th aw-1 m7j-5a ñ˘ 7t-5b vi@t-6b nam-1 âi-1 áo-6a tCen-1 nW@k-5b] > [hOkp-6b tCO-2 toj-1 z˘ 7t-5b > âoNm-1 ñWM-1 m7-3 l7p-5b â˘ 7j-6a tCiN-5a th Wk-5b th i-2 ff toj-1 tCW@-1 lam-2 âW@k-6b] > [˘EN-1 th W-3 tim-2 tCONm-1 ff tw i-5a suEkff-5b hăj-1 tCON-1 ff tw i-5a sEm-1]

English meaning The more natural and sincere the wishes, the more they are favored. He often bullies the weaker and cheats others to get benefits.

Banned free circulation of cardamom is policy of state or other sector that lead to this consequence.

This matching pair stole 170 kg drugs from their accomplices.

Luong’s criminal record increases once more time. The third person who is accountable for some unintentional faults is you, Hien. And you? Can you show me the way to Giang Vo primary school?

Banned free circulation of cardamom is policy of state or other sector that lead to this consequence.

The two large lakes will be our locations to spread the latest sport in Vietnam: walking on water. I have many students but I haven’t opened any official classes yet. Try to look for it in your briefcase and bag!

Bibliography Abushariah Mohammad A., Ainon Raja N., Zainuddin Roziati, Elshafei Moustafa, and Khalifa Othman O. Phonetically Rich and Balanced Text and Speech Corpora for Arabic Language. Journal Language Resources and Evaluation, 46(4):601–634, December 2012. ISSN 1574-020X. doi: 10.1007/s10579-011-9166-8. Anumanchipalli Gopala Krishna, Prahallad Kishore, and Black Alan W. Festvox: Tools for creation and analyses of large speech corpora. In Workshop on Very Large Scale Phonetics Research, UPenn, Philadelphia, 2011. Apel Jens, Neubarth Friedrich, Pirker Hannes, and Trost Harald. Have a break! Modelling pauses in German speech. In KONVENS, pages 5–12. KONVENS, 2004. Beckman Mary E. and Hirschberg Julia. The ToBI Annotation Conventions. Technical report, Ohio State University, 1994. Bikel Daniel M. On the parameter space of generative lexicalized statistical parsing models. PhD thesis, University of Pennsylvania, United States, 2004. Brunelle Marc. Tonal coarticulation effects in northern vietnamese. In Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS 15), pages 2673–2676, Barcelona, 2003. Brunelle Marc. Northern and southern vietnamese tone coarticulation: A comparative case study. Journal of Southeast Asian Linguistics, 1:49–62, 2009. Campbell Nick. Automatic detection of prosodic boundaries in speech. Speech communication, 13(3):343–354, 1993. Campbell W. Nick. Syllable-based segmental duration. Talking machines: Theories, models, and designs, pages 211–224, 1992. Cao Xuân Ha.o. Le problème du phonème en vietnamien, 1975. Carreras Xavier, Collins Michael, and Koo Terry. Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 9–16, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. ISBN 978-1-905593-48-4. Charniak Eugene. A maximum-entropy-inspired parser. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, NAACL 2000, pages 132–139, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.

268

Bibliography

Charniak Eugene and Johnson Mark. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 173–180, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. doi: 10.3115/1219840.1219862. Chevelu Jonathan, Barbot Nelly, Boëffard Olivier, and Delhay Arnaud. Comparing SetCovering Strategies for Optimal Corpus Design. In Proceedings of the International Conference on Language Resources and Evaluation, Morocco, 2008. Chistikov Pavel and Khomitsevich Olga. Improving Prosodic Break Detection in a Russian TTS System. In Speech and Computer, pages 181–188. Springer, 2013. Chomphan Suphattharachai. Towards the Development of Speaker-Dependent and SpeakerIndependent Hidden Markov Model-Based Thai Speech Synthesis. Journal of Computer Science, 5(12):905–914, December 2009. ISSN 15493636. doi: 10.3844/jcssp.2009.905.914. Chomphan Suphattharachai. Analysis of Decision Trees in Context Clustering of Hidden Markov Model Based Thai Speech Synthesis. Journal of Computer Science, 7(3):359–365, March 2011. ISSN 1549-3636. doi: 10.3844/jcssp.2011.359.365. Chomphan Suphattharachai and Chompunth Chutarat. Improvement of Tone Intelligibility for Average-Voice-Based Thai Speech Synthesis. American Journal of Applied Sciences, 9 (3):358–364, 2012. ISSN 1546-9239. Chomphan Suphattharachai and Kobayashi Takao. Design of tree-based context clustering for an HMM-based Thai speech synthesis system. In Proceeding of the 6th ISCA Workshop on Speech Synthesis, pages 160–165, Bonn, Germany, August 2007. Chomphan Suphattharachai and Kobayashi Takao. Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis. Speech Communication, 50(5):392–404, 2008. ISSN 0167-6393. Chomphan Suphattharachai and Kobayashi Takao. Tone correctness improvement in speakerindependent average-voice-based Thai speech synthesis. Speech Communication, 51(4): 330–343, 2009. ISSN 0167-6393. Chou Fu-Chiang, Tseng Chiu yu, and Lee Lin-Shan. Automatic generation of prosodic structure for high quality Mandarin speech synthesis. In ICSLP. ISCA, 1996. Chou Fu-Chiang, Tseng Chiu-yu, and Lee Lin-shan. A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10(7):481–494, October 2002. ISSN 1063-6676. doi: 10.1109/TSA.2002.803437. Clark Alexander Simon, Fox Chris, and Lappin Shalom, editors. The handbook of computational linguistics and natural language processing. Blackwell handbooks in linguistics. Wiley-Blackwell, Chichester, paperback ed edition, 2013. ISBN 9781405155816 9781118347188 9781405155816. Cochran William G. and Cox Gertrude M. Experimental Designs, 2nd Edition. Wiley, 2 edition, April 1992. ISBN 0471545678. Collins Michael. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999.

Bibliography

269

Collins Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8. Association for Computational Linguistics, July 2002. doi: 10.3115/1118693.1118694. Collins Michael and Roark Brian. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 111–118, Barcelona, Spain, July 2004. doi: 10.3115/1218955.1218970. Dao Loan. The Vietnamese classifiers ’CON’, ’CÁI’ and the Natural Semantic Metalanguage (NSM) approach: A preliminary study. In Proceedings of the 42nd Australian Linguistic Society Conference, Australia, 2011. Dinh Anh-Tuan, Phan Thanh-Son, Vu Tat-Thang, and Luong Chi Mai. Vietnamese hmmbased speech synthesis with prosody information. In 8th ISCA Workshop on Speech Synthesis, pages 51–54, Barcelona, Spain, August 2013. Do The Dung, Tran Thien Huong, and Boulakia George. Intonation in Vietnamese (395416). In Intonation systems: A Survey of Twenty Languages (Hirst Daniel and di Cristo). Cambridge University Press, December 1998. ISBN 9780521395502. Do Thi Ngoc Diep, Le Viet Bac, Bigi Brigitte, Besacier Laurent, and Castelli Eric. Mining a comparable text corpus for a vietnamese - french statistical machine translation system. In The 4th Workshop on statistical machine translation - EACL 2009, Athens, Greece, March 2009. Do Tu Trong and Takara Tomio. Precise tone generation for Vietnamese text-to-speech system. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), volume 1, pages I–504–I–507 vol.1, April 2003. doi: 10.1109/ICASSP.2003.1198828. Do Tu Trong and Takara Tomio. Vietnamese Text-To-Speech system with precise tone generation. Acoustical Science and Technology, 25(5):347–353, 2004. ISSN 1346-3969, 1347-5177. Do Van Thao, Tran Do Dat, and Nguyen Thi Thu Trang. Non-uniform unit selection in Vietnamese speech synthesis. In Proceedings of the Second Symposium on Information and Communication Technology (SoICT 2011), pages 165–171, Hanoi, Vietnam, 2011. ISBN 978-1-4503-0880-9. Doan Thien Thuat. Vietnamese phonetics (in Vietnamese). Vacational School and University Publisher, 1977. Doan Xuan Kien. Re-consider a problem of Vietnamese phonetics: Vowels (in Vietnamese). Hop Luu, 45, 1999a. Doan Xuan Kien. Re-consider a problem of Vietnamese phonetics: Syllable structure (in Vietnamese). Hop Luu, 48:1–24, 1999b. Donovan R. E., Ittycheriah A., Franz M., Ramabhadran B., Eide E., Viswanathan M., Bakis R., Hamza W., Picheny M., Gleason P., Rutherfoord T., Cox P., Green D., Janke E., Revelin S., Waast C., Zeller B., Guenther C., and Kunzmann J. The IBM Trainable Speech Synthesis System. In In Proc ICSLP, 1998.

270

Bibliography

Doukhan David, Rilliard Albert, Rosset Sophie, and d’Alessandro Christophe. Modelling pause duration as a function of contextual length. In INTERSPEECH. ISCA, 2012. Doval Boris, Alessandro Christophe, and Henrich Nathalie. The Voice Source as a Causal/Anticausal Linear Filter. In Voice Quality: Functions, Analysis and Synthesis, pages 16–20, Geneva, Switzerland, 2003. Duan Quansheng, Kang Shiyin, Wu Zhiyong, Cai Lianhong, Shuang Zhiwei, and Qin Yong. Comparison of Syllable/Phone HMM Based Mandarin TTS. In 2010 20th International Conference on Pattern Recognition (ICPR), pages 4496–4499, August 2010. doi: 10.1109/ ICPR.2010.1092. Dutoit Thierry. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997. ISBN 0-7923-4498-7. Dutoit Thierry and Stylianou Yannis. Text-to-speech synthesis. Handbook of Computational Linguistics, pages 323–338, 2003. Emeneau Murray Barnson. Studies in Vietnamese (Annamese) Grammar. University of California Press, first edition, 1951. Ferlus Michel. The origin of tones in viet-muong. The Eleventh Annual Conference of the Southeast Asian Linguistics Society 2001, pages 297–313, 2001. Francois Hélène and Boëffard Olivier. Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Proceedings of the 7th European Conference on Speech Communication and Technology, pages 829–832, Denmark, 2001. Freund Yoav and Schapire Robert E. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3):277–296, December 1999. ISSN 0885-6125. doi: 10.1023/A: 1007662407062. Gibbon Dafydd, Moore Roger, and Winski Richard, editors. Handbook of Standards and Resources for Spoken Language Systems. Mouton De Gruyter, Berlin ; New York, July 1997. ISBN 9783110153668. Grishman Ralph, Macleod Catherine, and Sterling John. Evaluating Parsing Strategies Using Standardized Parse Files. In Proceedings of the Third Conference on Applied Natural Language Processing, ANLC ’92, pages 156–161, Stroudsburg, PA, USA, 1992. Association for Computational Linguistics. doi: 10.3115/974499.974528. Guan Yong, Tian Jilei, Wu Yi-jian, Yamagishi Junichi, and Nurminen Jani. An Unified and Automatic Approach of Mandarin HTS System. In Proceedings of the 7th Speech Synthesis Workshop (SSW7), Kyoto, Japan, 2010. Han Mieko Shimizu. Vietnamese vowels, volume 4. Acoustic Phonetics Research Laboratory, University of Southern California, 1966. Haudricourt André Georges. La place du vietnamien dans les langues austroasiatiques. Bulletin de la Société de Linguistique de Paris 49(1), pages 122—-128, 1953. Haudricourt André-Georges. The origin of the peculiarities of the Vietnamese alphabet (Translated by Alexis Michaud). Mon-Khmer Studies, 39:89–104, 2010.

Bibliography

271

Hawley Mark S, Cunningham Sean P, Green Phil D, Enderby Pam, Palmer Rebecca, Sehgal Siddharth, and O’Neill Peggy. A voice-input voice-output communication aid for people with severe speech impairment. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 21(1):23–31, 2013. He Linyu, Yang Jian, Zuo Libo, and Kui Liping. A trainable Vietnamese speech synthesis system based on HMM. In Proceedings of the International Conference on Electric Information and Control Engineering (ICEICE), pages 3910–3913, Wuhan, China, 2011. Hiroya Fujisaki Keikichi Hirose. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4):233–242, 1984. ISSN 0388-2861. doi: 10.1250/ast.5.233. Hsia Chi-Chun, Wu Chung-Hsien, and Wu Jung-Yun. Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):1994–2003, November 2010. ISSN 1558-7916. doi: 10.1109/TASL.2010.2040791. Huang Liang. Forest reranking: Discriminative parsing with non-local features. In Proceedings of ACL-08: HLT, pages 586–594, Columbus, Ohio, June 2008. Association for Computational Linguistics. Huang Liang and Sagae Kenji. Dynamic programming for linear-time incremental parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1077–1086, Uppsala, Sweden, July 2010. Association for Computational Linguistics. Huang Xuedong, Acero Alex, and Hon Hsiao-Wuen. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001. ISBN 0130226165. Hunt A.J. and Black A.W. Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings, volume 1, pages 373–376 vol. 1, May 1996. doi: 10.1109/ICASSP.1996.541110. Jilka Matthias, Möhler Gregor, and Dogil Grzegorz. Rules for the Generation of ToBI-based American English Intonation. Speech Communication, 28:83–108, June 1999. Jokisch Oliver, Kruschke Hans, and Hoffmann Rüdiger. Prosodic reading style simulation for text-to-speech synthesis. In Affective Computing and Intelligent Interaction, pages 426–432. Springer, 2005. Keri Venkatesh, Pammi Sathish Chandra, and Prahallad Kishore. Pause prediction from lexical and syntax information. In Proceedings of International Conference on Natural Language Processing (ICON), 2007. Kirby James P. Vietnamese (Hanoi Vietnamese). Journal of the International Phonetic Association, 41(03):381–392, 2011. Klein Dan and Manning Christopher D. A* parsing: Fast exact viterbi parse selection. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 40–47, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. doi: 10.3115/1073445.1073461.

272

Bibliography

Kroeger Paul. Analyzing Grammar: An Introduction. Cambridge University Press, May 2005. ISBN 9780521816229. Kui Liping, Yang Jian, He Bin, and Hu Enxing. An Experimental Study on Vietnamese Speech Synthesis. In Proceedings of the International Conference on Asian Language Processing (IALP), pages 232–235, Penang, Malaysia, 2011. Lafferty John D., McCallum Andrew, and Pereira Fernando C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. Laver John. Principles of Phonetics. Cambridge University Press, 1994. ISBN 9780521456555. Le Anh-Cuong, Nguyen Phuong-Thai, Vuong Hoai-Thu, Pham Minh-Thu, and Ho TuBao. An experimental study on lexicalized statistical parsing for vietnamese. Knowledge and Systems Engineering, International Conference on, 0:162–167, 2009. doi: http: //doi.ieeecomputersociety.org/10.1109/KSE.2009.41. Le Anh-Tu, Tran Do-Dat, and Nguyen Thi Thu Trang. A Model of F0 Contour for Vietnamese Questions, Applied in Speech Synthesis. In Proceedings of the Second Symposium on Information and Communication Technology (SoICT 2011), SoICT ’11, pages 172–178, Hanoi, Vietnam, 2011. ACM. ISBN 978-1-4503-0880-9. doi: 10.1145/2069216.2069250. Le Hong Phuong, Nguyen Thi Minh Huyen, Roussanaly Azim, and Ho Tuong Vinh. A Hybrid Approach to Word Segmentation of Vietnamese Texts, volume 5196. Springer-Verlag Berlin, c Heidelberg 2008, 2008. ISBN 978-3-540-88281-7. Le Hong Phuong, Roussanaly Azim, Nguyen Thi Minh Huyen, and Rossignol Mathias. An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues Naturelles - TALN 2010, Montreal, Canada, 2010. Le Hong Phuong, Nguyen Thi Minh Huyen, and Roussanaly Azim. Vietnamese parsing with an automatically extracted tree-adjoining grammar. In 2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), Ho Chi Minh City, Vietnam, February 27 - March 1, 2012, pages 1–6, 2012. doi: 10.1109/rivf.2012.6169832. Le Quang Thang, Noji Hiroshi, and Miyao Yusuke. Optimal Shift-Reduce Constituent Parsing with Structured Perceptron. In Proceeding of the 7th International Joint conference on Natural Language Processing of the Association for Computational Linguistics, Beijing, China, July 2015. Le Van Ly. Le parler vietnamien essai d’une grammaire vietnamienne. PhD thesis, Université de Paris (1896-1968), Paris, 1948. Le Viet Bac, Tran Do Dat, Castelli Eric, Besacier Laurent, and Serignat Jean-François. Spoken and Written Language Resources for Vietnamese. LREC, 4:599–602, 2004. Le Viet-Bac, Tran Do-Dat, Besacier Laurent, Castelli Eric, and Serignat Jean-François. First steps in building a large vocabulary continuous speech recognition system for Vietnamese. RIVF 2005. Can Tho, Vietnam, 2005.

Bibliography

273

Li Aijun, Pan Shifeng, and Tao Jianhua. HMM-based speech synthesis with a flexible Mandarin stress adaptation model. In 2010 IEEE 10th International Conference on Signal Processing (ICSP), pages 625–628, October 2010. doi: 10.1109/ICOSP.2010.5656769. Li Ya, Tao Jianhua, Hirose Keikichi, Xu Xiaoying, and Lai Wei. Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech. Speech Communication, May 2015. ISSN 0167-6393. doi: 10.1016/j.specom.2015.05.003. Marcus Mitchell P., Santorini Beatrice, and Marcinkiewicz Mary Ann. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19 (2):313–330, 1993. Martin Philippe. Intonation: A case for experimental phonology. Les Cahiers de l’ICP. Bulletin de la communication parlée, (5):89–107, 2000. Martin Philippe. WinPitch corpus, a text to speech alignment tool for multimodal corpora. In LREC, 2004. Martin Philippe. WinPitch LTL, un logiciel multimédia d’enseignement de la prosodie. Alsic. Apprentissage des Langues et Systèmes d’Information et de Communication, 8(2), 2005. Martin Philippe. Prosodic structure revisited: a cognitive approach. the example of french. In Speech Prosody 2010, Chicago, 2010. Maspero Henri. Études sur la phonétique historique de la langue annamite: Les initiales. Bulletin de l’École Francaise d’Extrême Orient 12, pages 114—-116, 1912. Masuko Takashi. HMM-Based Speech Synthesis and Its Applications. PhD thesis, Institute of Technology, Japan, 2002. McClosky David, Charniak Eugene, and Johnson Mark. Reranking and self-training for parser adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 337–344, Sydney, Australia, July 2006. Association for Computational Linguistics. doi: 10.3115/1220175.1220218. Michaud Alexis. Final consonants and glottalization: new perspectives from Hanoi Vietnamese. Phonetica, 61(2-3):119–146, 2004. ISSN 0031-8388. Michaud Alexis, Vu-Ngoc Tuân, Amelot Angélique, and Roubeau Bernard. Nasal release, nasal finals and tonal contrasts in Hanoi Vietnamese: an aerodynamic experiment. MonKhmer Studies, 36:pp. 121–137, 2006. Michaud Alexis, Ferlus Michel, and Nguyen Minh-Chau. Strata of standardization: the phong nha dialect of vietnamese (quang binh province) in historical perspective. Linguistics of the Tibeto-Burman Area, Volume 38.1, April 2015. Moungsri D., Koriyama T., and Kobayashi T. HMM-based Thai speech synthesis using unsupervised stress context labeling. In Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA), pages 1–4, December 2014. doi: 10.1109/APSIPA.2014.7041599. Navarro Gonzalo. A Guided Tour to Approximate String Matching. ACM Computing Surveys, 33:31–88, 2001.

274

Bibliography

Nespor Marina and Vogel Irene. Prosodic Structure Above the Word. In Cutler Dr Anne and Ladd Dr D. Robert, editors, Prosody: Models and Measurements, number 14 in Springer Series in Language and Communication, pages 123–140. Springer Berlin Heidelberg, January 1983. ISBN 978-3-642-69105-8, 978-3-642-69103-4. Nespor Marina and Vogel Irene. Prosodic phonology: with a new foreword, volume 28. Walter de Gruyter, 2007. Nguyen Dung Tien, Luong Chi Mai, Vu Bang Kim, Mixdorff Hansjoerg, and Ngo Huy Hoang. Fujisaki model based F0 contours in vietnamese TTS. In INTERSPEECH. Citeseer, 2004. Nguyen Huu Quynh. Vietnamese Grammar (in Vietnamese). Bach Khoa Dictionary Publisher, 2007. Nguyen Phuong-Thai, Vu Xuan-Luong, Nguyen Thi-Minh-Huyen, Nguyen Van-Hiep, and Le Hong-Phuong. Building a Large Syntactically-annotated Corpus of Vietnamese. In Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP ’09, pages 182– 185, Suntec, Singapore, 2009. Association for Computational Linguistics. ISBN 978-1932432-52-7. Nguyen Quy, Nguyen Ngan, and Miyao Yusuke. Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, chapter Utilizing State-of-the-art Parsers to Diagnose Problems in Treebank Annotation for a Less Resourced Language, pages 19–27. Association for Computational Linguistics, 2013a. Nguyen Thi Thu Trang. Demo-voices of VTed Text-To-Speech system. [Online]. Available: https://perso.limsi.fr/trangntt/demo-voices/, 2015a. Nguyen Thi Thu Trang. PhD web page: HMM-based Vietnamese Text-To-Speech: Prosodic phrasing modeling, corpus design, system design and evaluation. [Online]. Available: https://perso.limsi.fr/trangntt, 2015b. Nguyen Thi Thu Trang, Pham Thanh Thi, and Tran Do-Dat. A method for Vietnamese text normalization to improve the quality of speech synthesis. In Proceedings of the First Symposium on Information and Communication Technology (SoICT 2010), SoICT ’10, pages 78–85, Hanoi, Vietnam, 2010. ACM. ISBN 978-1-4503-0105-3. doi: 10.1145/1852611. 1852627. Nguyen Thi Thu Trang, Alessandro Christophe, Rilliard Albert, and Tran Do Dat. HMMbased TTS for Hanoi Vietnamese: issues in design and evaluation. In 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), pages 2311–2315, Lyon, France, August 2013b. ISCA. Nguyen Thi Thu Trang, Rilliard Albert, Tran Do Dat, and d’Alessandro Christophe. Prosodic phrasing modeling for vietnamese TTS using syntactic information. In 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), pages 2332–2336, Singapore, September 2014a. ISCA. Nguyen Thi Thu Trang, Tran Do Dat, Rilliard Albert, Alessandro Christophe, and Pham Thi Ngoc Yen. Intonation issues in HMM-based speech synthesis for Vietnamese. In The 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14), pages 98–104, St. Petersburg, Russia, May 2014b.

Bibliography

275

Nguyen Thu and Ingram John. Perception of prominence pattern in Vietnamese dissyllabic words. The Mon-Khmer Studies Journal, 42, 2013. Nguyen Van Loi and Edmondson Jerold A. Tones and voice quality in modern northern Vietnamese: Instrumental case studies. Mon-Khmer Studies Journal, 28:1–18, 1998. Oh Yoo Rhee, Kim Yong Guk, Kim Mina, Kim Hong Kook, Lee Mi Suk, and Bae Hyun Joo. Phonetically Balanced Text Corpus Design Using a Similarity Measure for a Stereo SuperWideband Speech Database. IEICE Transactions on Information and Systems, E94-D(7): 1459–1466, July 2011. ISSN 1745-1361, 0916-8532. Pammi Sathish, Charfuelan Marcela, and Schröder Marc. Multilingual Voice Creation Toolkit for the MARY TTS Platform. In Language Resources and Evaluation (LREC), Malta, 2010. Parlikar Alok. Style-Specific Phrasing in Speech Synthesis. PhD thesis, Carnegie Mellon University, 2013. Petrov Slav and Klein Dan. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411, Rochester, New York, April 2007. Association for Computational Linguistics. Phan Son Thanh, Vu Thang Tat, and Luong Mai Chi. Extracting MFCC, F0 feature in Vietnamese HMMbased speech synthesis. International Journal of Electronics and Computer Science Engineering, 46, 2013a. ISSN 2277-1956. Phan Thanh Son, Vu Tat Thang, Duong Tu Cuong, and Luong Chi Mai. A study in Vietnamese statistical parametric speech synthesis base on HMM. International Journal of Advances in Computer Science and Technology, 2:1–6, 2012. Phan Thanh-Son, Duong Tu-Cuong, Dinh Anh-Tuan, Vu Tat-Thang, and Luong Chi-Mai. Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information. In 2013 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pages 276–281, November 2013b. doi: 10.1109/RIVF.2013.6719907. Phan Thanh Son, Dinh Anh Tuan, Vu Tat Thang, and Luong Chi Mai. An Improvement of Prosodic Characteristics in Vietnamese Text to Speech System. In Huynh Van Nam, Denoeux Thierry, Tran Dang Hung, Le Anh Cuong, and Pham Son Bao, editors, Knowledge and Systems Engineering, number 244 in Advances in Intelligent Systems and Computing, pages 99–111. Springer International Publishing, 2014. ISBN 978-3-319-02740-1, 978-3319-02741-8. Powers David M. W. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Technical Report SIE-07-001, School of Informatics and Engineering, Flinders University, Adelaide, Australia, 2007. Qian Yao and Soong Frank Kao-PingK. HMM-based bilingual (Mandarin-English) TTS techniques, August 2012. U.S. Classification 704/256.3, 704/250, 704/257, 704/256, 704/260, 704/261, 704/258, 704/243; International Classification G10L13/08, G10L15/00, G10L17/00, G10L13/00, G10L15/14; Cooperative Classification G10L13/06; European Classification G10L13/06.

276

Bibliography

Qian Yao, Soong Frank, Chen Yining, and Chu Min. An HMM-Based Mandarin Chinese Text-To-Speech System. In Huo Qiang, Ma Bin, Chng Eng-Siong, and Li Haizhou, editors, Chinese Spoken Language Processing, number 4274 in Lecture Notes in Computer Science, pages 223–232. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-49665-6, 978-3540-49666-3. Qian Yao, Cao Houwei, and Soong F.K. HMM-Based Mixed-Language (Mandarin-English) Speech Synthesis. In 6th International Symposium on Chinese Spoken Language Processing, 2008. ISCSLP ’08, pages 1–4, December 2008. doi: 10.1109/CHINSL.2008.ECP.15. Quinlan J. Ross. Induction of decision trees. Machine learning, 1(1):81–106, 1986. Rosenblatt F. The perception: A probabilistic model for information storage and organization in the brain. In Anderson James A. and Rosenfeld Edward, editors, Neurocomputing: Foundations of Research, pages 89–114. MIT Press, Cambridge, MA, USA, 1988. ISBN 0-262-01097-6. Sag Ivan A., Wasow Thomas, and Bender Emily M. Syntactic Theory: A Formal Introduction. Center for the Study of Language and Information, 2003. ISBN 9781575863993. Sagae Kenji and Lavie Alon. Proceedings of the Ninth International Workshop on Parsing Technology, chapter A Classifier-Based Parser with Linear Run-Time Complexity, pages 125–132. Association for Computational Linguistics, 2005. Sagae Kenji and Lavie Alon. A best-first probabilistic shift-reduce parser. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 691–698, Sydney, Australia, July 2006. Association for Computational Linguistics. Sarkar Parakrant and Sreenivasa Rao K. Data-driven pause prediction for speech synthesis in storytelling style speech. In Communications (NCC), 2015 Twenty First National Conference on, pages 1–5. IEEE, 2015. Schröder Marc, Charfuelan Marcela, Pammi Sathish, and Türk Oytun. The MARY TTS entry in the Blizzard Challenge 2008. In Blizzard Challenge 2008, Queensland, Australia, 2008. Selkirk Elisabeth. The Syntax-Phonology Interface. In Goldsmith John, Riggle Jason, and Yu Alan C. L., editors, The Handbook of Phonological Theory, pages 435–484. WileyBlackwell, 2011. ISBN 9781444343069. Selkirk Elisabeth O. On prosodic structure and its relation to syntactic structure. Indiana University Linguistics Club, 1980. Shih Chilin and Kochanski Greg. Chinese tone modeling with stem-ml. In INTERSPEECH, pages 67–70, 2000. Silverman Kim, Beckman Mary, and Pierrehumbert . TOBI: A standard scheme for labeling prosody. In Proceedings of the Second International Conference on Spoken Language Processing, 1992. Socher Richard, Bauer John, Manning Christopher D., and Andrew Y. Ng. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 455–465, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

Bibliography

277

Soukoreff R. William and MacKenzie I. Scott. Measuring errors in text entry tasks: an application of the Levenshtein string distance statistic. In Proceedings of the CHI ’01 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’01, pages 319– 320, New York, USA, 2001. ACM. ISBN 1-58113-340-5. Tao Jianhua, Dong Honghui, and Zhao Sheng. Rule learning based Chinese prosodic phrase prediction. In 2003 International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings, pages 425–432, October 2003. doi: 10.1109/NLPKE.2003.1275944. Tao Jianhua, Liu Fangzhou, Zhang Meng, and Jia Huibin. Design of Speech Corpus for Mandarin Text to Speech. In The Blizzard Challenge 2008 workshop, 2008. Taylor Paul. Text-to-Speech Synthesis. Cambridge University Press, Cambridge, UK ; New York, 1 edition edition, March 2009. ISBN 9780521899277. Thompson Laurence C. A Vietnamese Reference Grammar. University of Hawaii Press, 1987. ISBN 9780824811174. Tokuda K., Zen Heiga, and Black A.W. An HMM-based speech synthesis system applied to English. In Proceedings of the 2002 IEEE Workshop on Speech Synthesis, pages 227–230, California, USA, 2002. Tran Ba Thien. Handbook on accessing information technology for the Blinds. Technical report, Van-Lang University, Ho Chi Minh, Vietnam, February 2007a. Tran Ba Thien. Survey on requirements of the Blinds to Voice of Southern Vietnam. Technical report, Van-Lang University, Ho Chi Minh, Vietnam, April 2013. Tran Do Dat. Synthèse de la parole à partir du texte en langue vietnamienne. PhD thesis, Grenoble, INPG, January 2007b. Tran Do Dat and Castelli Eric. Generation of F0 contours for Vietnamese speech synthesis. In Communications and Electronics (ICCE), 2010 Third International Conference on, pages 158–162. IEEE, 2010. Tran Do Dat, Castelli Eric, Serignat Jean-François, Trinh Van Loan, and Lê Xuan Hung. Influence of F0 on Vietnamese Syllable Perception. In 9th European Conference on Speech Communication and Technology (Interspeech 2005), pages 1697–1700, 2005. Tran Do-Dat, Castelli Eric, Serignat Jean-François, and Le Viet-Bac. Analysis and Modeling of Syllable Duration for Vietnamese Speech Synthesis. In O-COSCODA2007, 2007. Uraga Esmeralda and Gamboa César. VOXMEX Speech Database: design of a phonetically balanced corpus. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, volume 46, pages 1–34, Portugal, 2004. doi: 10.1007/ s10579-011-9166-8. Valin Robert D. Van. An Introduction to Syntax. Cambridge University Press, April 2001. ISBN 9780521635660. Villaseñor-Pineda Luis, Gómez Manuel Montes-y, Vaufreydaz Dominique, and Serignat JeanFrançois. Experiments on the Construction of a Phonetically Balanced Corpus from the

278

Bibliography

Web. In Gelbukh Alexander, editor, Computational Linguistics and Intelligent Text Processing, number 2945 in Lecture Notes in Computer Science, pages 416–419. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21006-1, 978-3-540-24630-5. Vogel Irene, Tseng I-Ju Elanna, and Yap Ngee-Thai. Syllable structure in Vietnamese. In Proceedings of the second Theoretical East Asian Linguistic (TEAL) Workshop, Taiwan, 2004. Vu Hai Quan and Cao Xuan Nam. Phrase-based concatenation for Vietnamese TTS (in Vietnamese). Journal on Information, Technologies, anh Communications (Vietnamese), V-1(Projects on research, development and application of Information Technology), 2010. Vu Minh Quang, Trân Ðô Ðat, and Castelli Eric. Prosody of interrogative and affirmative sentences in vietnamese language: Analysis and perceptive results. In Ninth International Conference on Spoken Language Processing, 2006. Vu Ngoc Thang and Schultz Tanja. Vietnamese large vocabulary continuous speech recognition. In IEEE Workshop on Automatic Speech Recognition Understanding, 2009. ASRU 2009, pages 333–338, November 2009. doi: 10.1109/ASRU.2009.5373424. Vu Ngoc Thang and Schultz Tanja. Optimization on Vietnamese Large Vocabulary Speech Recognition. In The 2nd International Workshop on Spoken Language Technologies for Under-resourced Languages, Malaysia, 2010. Vu Tat Thang, Nguyen Tien Dung, and Luong Chi Mai. Vietnamese large vocabulary continuous speech recognition. In Proceeding of 9th European Conference on Speech Communication and Technology, pages 1689–1692, Portugal, September 2005. Vu Thang Tat, Luong Mai Chi, and Nakamura S. An HMM-based Vietnamese speech synthesis system. In Proceedings of the Oriental COCOSDA International Conference on Speech Database and Assessments, pages 116–121, Beijing, China, 2009. Wu Zhizheng, Valentini-Botinhao Cassia, Watts Oliver, and King Simon. Deep Neural Networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proceedings of the 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Australia, 2015. Würgler Simon. Implementation and evaluation of an hmm-based speech generation component for the svox tts system. Master’s thesis, Swiss Federal Institute of Technology Zurich, 2011. Yoshimura Takayoshi. Simultaneous Modeling of Phonetic and Prosodic Parameters, and Characteristic Conversion for HMM-Based Text-To-Speech Systems. PhD thesis, Nagoya Institute of Technology, Japan, 2002. Yoshimura Takayoshi, Tokuda Keiichi, Kobayashi Takao, Masuko Takashi, and Kitamura Tadashi. Simultaneous Modeling Of Spectrum, Pitch And Duration In HMM-Based Speech Synthesis. In Sixth European Conference on Speech Communication and Technology, Budapest, Hungary, September 1999. Yoshimura Takayoshi, Tokuda Keiichi, Masuko Takashi, Kobayashi Takao, and Kitamura Tadashi. Incorporating a mixed excitation model and postfilter into hmm-based text-tospeech synthesis. Systems and Computers in Japan, 36(12):43–50, 2005.

Bibliography

279

Yu Yansuo, Li Dongchen, and Wu Xihong. Prosodic modeling with rich syntactic context in HMM-based Mandarin speech synthesis. In 2013 IEEE China Summit International Conference on Signal and Information Processing (ChinaSIP), pages 132–136, July 2013. doi: 10.1109/ChinaSIP.2013.6625313. Yuan Jiahong, Shih Chilin, and Kochanski Greg P. Comparison of declarative and interrogative intonation in chinese. In Speech Prosody 2002, an International Conference, France, 2002. Zen Heiga, Nose Takashi, Yamagishi Junichi, Sako Shinji, Masuko Takashi, Black Alan, and Tokuda Keiichi. The HMM-based speech synthesis system (HTS) version 2.0. In 6th ISCA Workshop on Speech Synthesis, pages 294–299, Bonn, Germany, 2007. ISCA. Zen Heiga, Tokuda Keiichi, and Black Alan W. Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064, 2009. ISSN 0167-6393. Zen Heiga, Senior Andrew, and Schuster Mike. Statistical parametric speech synthesis using Deep Neural Network. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7962–7966, Canada, 2013. Zeng Xiao-Li, Martin Philippe, and Boulakia Georges. Tones and intonation in declarative and interrogative sentences in mandarin. In International Symposium on Tonal Aspects of Languages: With Emphasis on Tone Languages, Beijing, China, 2004. Zhang Yue and Clark Stephen. Transition-based parsing of the chinese treebank using a global discriminative model. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), pages 162–171, Paris, France, October 2009. Association for Computational Linguistics. Zhiwei Shuang Shiyin Kang. Syllable HMM based Mandarin TTS and comparison with concatenative TTS. In 20th IAPR International Conference on Pattern Recognition, pages 1767–1770, Turkey, August 2010. Zhu Muhua, Zhang Yue, Chen Wenliang, Zhang Min, and Zhu Jingbo. Fast and accurate shiftreduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 434–443, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. Zhu Weibin, Zhang Wei, Shi Qin, Chen Fangxin, Li Haiping, Ma Xijun, and Shen Liqin. Corpus building for data-driven TTS systems. In Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002, pages 199–202, September 2002. doi: 10.1109/WSS.2002.1224408.

NGUYEN Thi Thu Trang HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design, System Design, and Evaluation

Abstract Keywords: speech synthesis, text-to-speech, vietnamese, tonal language, prosodic phrasing modeling. The thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-ToSpeech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-tospeech Development system). In view of the great importance of lexical tones, a “tonophone” – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED. In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearance of pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tones. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% (Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%) or syntactic link (F-score=52.6%) alone. The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non-uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech.

Résumé Mots-clefs : synthèse de la parole, text-to-speech, Vietnamien, langue tonale, modélisation de phrasé prosodique.

L’objectif de cette thèse est de concevoir et de construire, un système Text-To-Speech (TTS) haute qualité à base de HMM (Hidden Markov Model) pour le vietnamien, une langue tonale. Le système est appelé VTED (Vietnamese TExt-to-speech Development system). Au vu de la grande importance de tons lexicaux, un “tonophone” – un allophones dans un contexte tonal – a été proposé comme nouvelle unité de la parole dans notre système de TTS. Un nouveau corpus d’entraînement, VDTS (Vietnamese Di-Tonophone Speech corpus), a été conçu à partir d’un grand texte brut pour une couverture de 100% de di-phones tonalisés (di-tonophones) en utilisant l’algorithme glouton. Un total d’environ 4000 phrases ont été enregistrées et pré-traitées comme corpus d’apprentissage de VTED. Dans la synthèse de la parole sur la base de HMM, bien que la durée de pause puisse être modélisée comme un phonème, l’apparition de pauses ne peut pas être prédite par HMM. Les niveaux de phrasé ne peuvent pas être complètement modélisés avec des caractéristiques de base. Cette recherche vise à obtenir un découpage automatique en groupes intonatifs au moyen des seuls indices de durée. Des blocs syntaxiques constitués de phrases syntaxiques avec un nombre borné de syllabes (n), ont été proposés pour prévoir allongement final (n = 6) et pause apparente (n = 10). Des améliorations pour allongement final ont été effectuées par des stratégies de regroupement des blocs syntaxiques simples. La qualité du modèle prédictive J48-arbre-décision pour l’apparence de pause à l’aide de blocs syntaxiques, combinée avec lien syntaxique et POS (Part-OfSpeech) dispose atteint un F-score de 81,4 % (Précision = 87,6 %, Recall = 75,9 %), beaucoup mieux que le modèle avec seulement POS (F-score=43,6%) ou un lien syntaxique (F-score=52,6%). L’architecture du système a été proposée sur la base de l’architecture HTS avec une extension d’une partie traitement du langage naturel pour le Vietnamien. L’apparence de pause a été prédit par le modèle proposé. Les caractéristiques contextuelles incluent les caractéristiques d’identité de “tonophones”, les caractéristiques de localisation, les caractéristiques liées à la tonalité, et les caractéristiques prosodiques (POS, allongement final, niveaux de rupture). Mary TTS a été choisi comme plateforme pour la mise en œuvre de VTED. Dans le test MOS (Mean Opinion Score), le premier VTED, appris avec les anciens corpus et des fonctions de base, était plutôt bonne, 0,81 (sur une échelle MOS 5 points) plus élevé que le précédent système – HoaSung (lequel utilise la sélection de l’unité non-uniforme avec le même corpus) ; mais toujours 1,2-1,5 point de moins que le discours naturel. La qualité finale de VTED, avec le nouveau corpus et le modèle de phrasé prosodique, progresse d’environ 1,04 par rapport au premier VTED, et son écart avec le langage naturel a été nettement réduit. Dans le test d’intelligibilité, le VTED final a reçu un bon taux élevé de 95,4%, seulement 2,6% de moins que le discours naturel, et 18% plus élevé que le premier. Le taux d’erreur du premier VTED dans le test d’intelligibilité générale avec le carré latin test d’environ 6-12% plus élevé que le langage naturel selon des niveaux de syllabe, de ton ou par phonème. Le résultat final ne s’écarte de la parole naturelle que de 0,4-1,4%.

[PDF] Thi Thu Trang NGUYEN - Free Download PDF (2024)

References