EE 519: Speech Recognition and Processing for Multimedia

Posted 2021-02-15 nighta

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了EE 519: Speech Recognition and Processing for Multimedia相关的知识，希望对你有一定的参考价值。

Out: Mar 30 2019
Due: Apr 13 2019
EE 519: Speech Recognition and
Processing for Multimedia
Spring 2019
Homework 5
There are 2 problems in this homework, with several questions. Please make sure to show the details of working for
each question. Answers without any justification may not receive credit.
Return:
A PDF file with your solutions, including any code you wrote (in an Appendix at the end of your file) and any
plots asked.
Your Matlab source files, which we should be able to run.
All your files should be uploaded to the corresponding dropbox in the D2L platform. If you make multiple submissions,
only the last one will be evaluated.
Problem 1
Solve questions (a), (b) of this problem either by hand or implementing Viterbi algorithm in Matlab.
Consider a 5-state HMM with parameters λ used to model a sequence of voiced (V ) and unvoiced (U) sounds. Assume
equal transition probabilities and equal initial state probabilities:
aij = 0.2 i, j = 1, 2, 3, 4, 5 πi = 0.2 ?i = 1, 2, 3, 4, 5
The observation probabilities are
State 1 State 2 State 3 State 4 State 5
P(V ) 0.2 0.6 0.8 0.25 0.2
P(U) 0.8 0.4 0.2 0.75 0.8
We observe the sequence O = V V UUUV V V UU.
(a) What is the most probable state sequence Q?
(b) What is the joint probability of the observation sequence and the most probable state sequence P = P(O, Q?|λ)?
(c) What is the probability that the observation sequence was generated only by State 1?
(d) What is the average number of steps in a given HMM state? Note that at every “step” the model emits a single
observation.
Problem 2
For this Problem, you will implement an ASR system for recognition of isolated spoken digits following the HMM/GMM
paradigm. To that end, you will use the same dataset you used in HW4, together with the MFCCs you
created. For your convenience, the MFCCs are provided in the file mfccs hw4.mat, but you are welcome to use
your own features from HW4. In mfccs hw4.mat you will find four objects, each one containing the recordings of
one speaker. For example, mfcc jackson is a cell array where mfcc jackson{i}{j} is a 2D array with the MFCCs
for the j-th Jackson’s recording for the digit i1. You will need to download the Matlab toolboxes found on
https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm_download.html and add them to your path.
Note: There are a few functions in those toolboxes which have the same names with built-in Matlab functions. This
may cause problems if you want to use the corresponding built-in function. For example, to use the built-in function
assert, you can just rename the corresponding function in KPMtools to assert2.m.
You will run three experiments:
(i) Speaker-dependent ASR (unique speaker): You will partition randomly Jackson’s recordings into training (70%)
and test (30%) sets keeping them balanced in terms of the different classes (digits). You will train the ASR system
using your training set and evaluate on the test set.
(ii) Speaker-dependent ASR (multiple speakers): You will partition randomly all the recordings into training (70%)
and test (30%) sets keeping them balanced in terms of the different classes (digits). You will train the ASR system
1
using your training set and evaluate on the test set.
(iii) Speaker-independent ASR: You will use all the Jackson’s recordings as your test set and all the other recordings as
your training set (leave-one-speaker-out scenario). You will train the ASR system using your training set and evaluate
on the test set.
PART 1: Initialization
(a) Because of the limited vocabulary (only 10 words), instead of phonemes or triphones, we can use the entire
words as the fundamental linguistic units that will be modeled through HMMs. So, 10 models need to be trained,
each one with Nd
states, with d = 0, 1, · · · , 9, corresponding to the digits 0, 1, · · · , 9. Because of the different acoustic
complexity of each word, Nd will not be the same ?d. For example, the word five sounds more complex than the
word two, so we will use more states to model the former. Using the CMU Pronouncing Dictionary1
, find the number
of phonemes that each word of your vocabulary is composed of. For example, the phonetic representation of the word
six is S IH K S, so the word is composed of 4 phonemes. Use 4 states for the words with one phoneme (if such a
word exists in your vocabulary). For each additional phoneme, add one state. For example, the word six should be
modeled by an HMM with 4+3=7 states, so N6 = 7. Report the values of Nd
, d = 0, 1, · · · , 9.
(b) You will use left-right, linear HMMs, like the one shown in Figure 1.
Figure 1: 5-state linear HMM.
Given this topology, what is a reasonable initialization of the transition matrix
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
for a 4-state HMM? What about a 6-state HMM? For the initial state probabilities, you will use
πi =（1 i = 1 (1st state)0 otherwise
(c) A possible observation at a given state of the HMM is a vector of 13 MFCCs. The probability of an observation
at a given state will be modeled by a Gaussian Mixture Model (GMM) with 3 13-dimensional Gaussians. In order
to initialize the models, you will need a set of data from which the sample means and covariances will be estimated.
Because of the topology used, it is not reasonable to assume that all the available features from a recording are generated
from any state of the HMM with equal probability. Instead, the first few MFCC vectors (first few frames) are
expected to be gererated from the first state of the HMM, the subsequent few vectors from the second state, etc. A
rather crude first estimation is to assume that each recording for the digit d is uniformly partitioned into Nd
(almost)
equal parts where each part has about Nf /Nd MFCC vectors (frames). Remember from the notation used in HW4
that Nf is the number of frames for a particular recording. Following this procedure for all the trainining recordings
and all the digits, you will have the data samples needed for the initialization of the distributions modeling all the
states of the 10 HMMs. Report the number of data samples you got through this process for the states 1, 2, · · · , 7 of
the HMM corresponding to the digit 6. Report those 7 numbers for the speaker-independent setting. What about the
digit 3? Using a diagonal covariance matrix for each Gaussian, at this point you have everything you need to initialize
your models with the function mixgauss init.
Note: If you cannot uniformly partition your data as described, you can initialize your models using all the available
training data for a particular digit to initialize all the states and continue to the next steps. That way, you may even
get a higher accuracy!
1http://www.speech.cs.cmu.edu/cgi-bin/cmudict?
2
PART 2: Training
The parameters of the models are obtained as their Maximum Likelihood estimations. Since a closed-form solution
does not exist, in practice this is done through the Expectation-Maximization algorithm. Use the function mhmm em
for that purpose, giving the right inputs, as explained in the comments of the function. You can leave the arguments
thres and max iter to their default values. Continue using diagonal covariance matrices. Running this function for
each HMM, you will have the log likelihoods, as well as the trained parameters of the HMMs and the GMMs.
(a) Plot, at the same graph, the log likelihood as a function of the iteration number (of the EM algorithm) for the
10 models for the speaker-independent setting.
(b) Here, you will verify that the GMMs have been suitably trained in order to model the true underlying distribution
of the training data in the 13-dimensional space of MFCCs. To do so, use Viterbi algorithm to find the most probable
state sequence for all the recordings of the training set. For that step, you will need the functions mixgauss prob and
viterbi path. After this, for the digit 6, collect all the feature vectors (MFCCs) which have been mapped to the
i-th state of the corresponding HMM. Plot the histogram of the 2nd MFCC, for the 3rd state of the corresponding
HMM. Plot, additionally, the probability distribution corresponding to that state, as modeled by the trained GMMs.
Of course, the GMMs you have used are composed of 13-dimensional Gaussians. How can you generate your plot
only for the 2nd MFCC (1-dimensional)? For your plot, you can implement your own functions or you can use the
Plot GM function found here: https://www.mathworks.com/matlabcentral/fileexchange/8793-plot_gm. Your
plots should correspond to the speaker-independent setting. Does this probability distribution represent the underlying
sample distribution as visualized by the histogram?
PART 3: Evaluation
(a) For each recording in the test set, find the probability that the sequence of observations (MFCCs) of the particular
recording was generated by each one of the 10 trained HMMs, using the function mhmm logprob. Assign to the recording
the digit corresponding to the HMM which gave the maximum probablility. For the three experiments (i), (ii), (iii)
report the accuracy of your system.
Accuracy = number of correctly classified recordings
total number of recordings
Additionally, for the experiment (iii), report the F1 score for each one of the 10 classes (digits).
F1(d) = 2 Precision(d) · Recall(d)
Precision(d) + Recall(d)
Precision(d) = number of correctly classified recordings in class d
number of recordings classified in class d
Recall(d) = number of correctly classified recordings in class d
total number of recordings in class d
(b) - extra credit For the experiment (iii), pick two digits d and ?d. For d, pick 3 recordings correctly classified and
plot the most probable trajectory of the observations (MFCCs) in the space of HMM states (for the HMM modeling
d). In other words, your x-axis should be time (or frame index) and the y-axis should be the index of the HMM state
(1, 2, · · ·). To do so, you will need the functions mixgauss prob and viterbi path like in Part 2, Question b. Do
the same for 3 recordings of d. Now, pick one of the 3 recordings of d and plot the most probable trajectory of the
observations (MFCCs) in the space of HMM states for the HMM modeling d. (3+3+1=7 plots are asked totally in
this question)
PART 4: Architecture and Parameter Tuning - Optional
If you want, try different parameters (e.g. number of states, number of Gaussians, full vs diagonal covariance matrices,
ergodic vs linear HMMs, etc) and report the accuracy you got with your optimal configuration.

因为专业，所以值得信赖。如有需要，请加QQ：99515681 或邮箱：[email protected]

微信：codinghelp

以上是关于EE 519: Speech Recognition and Processing for Multimedia的主要内容，如果未能解决你的问题，请参考以下文章