0% found this document useful (0 votes)

60 views10 pages

Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015

This document summarizes a student project to implement speaker recognition using machine learning algorithms. It discusses how audio files are transformed into Mel-frequency cepstrum coefficients (MFCCs) which represent the power spectrum of sound. A vector quantization algorithm called Linde-Buzo-Gray (LBG) is used to cluster the MFCC vectors into "codebooks" for each speaker. These codebooks are then used to classify new audio clips by determining which codebook has the closest cumulative distance to the MFCC vectors of the test clip. The student tested the algorithm on both clean training audio and their own recorded audio, achieving 100% and 75% classification accuracy respectively.

Uploaded by

Akah Precious Chiemena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views10 pages

Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015

Uploaded by

Akah Precious Chiemena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Implementing

Speaker Recognition

Chase Zhou
Physics 406 - 11 May 2015
Introduction

Machinery has come to replace much of human labor. They are faster, stronger, and

more consistent than any human. They‟ve exceeded human beings in most measurable ways.

However, some of the most challenging problems facing modern computing is how to allow

them to be more like us. For all their calculating power, it is rather difficult for them to perform

some basic functions like identifying pictures and voices. Special models and algorithm must be

designed for them to do so. For this project, I attempted to train a computer to identify who is the

speaker of a sound.

Algorithm

In order for the computer to recognize speech patterns, we must first transform the audio

files into something the computer can learn from. The most commonly used and most effective

of such transformations is to turn the sound file into a table of things called Mel-frequency

cesptrum coefficients. These coefficients represent the power spectrum of sound on the Mel-

scale. It will be the computer‟s task to figure out which speaker is which from these numbers.

The first step in this transformation is to divide the sound file into short frames of around

20 ms in length. This allows us to split each ~2 second sound file into ~100 individual samples.

By dividing the signal into such short frames, each section is a relatively constant signal that

does not change much. We then pass each frame through a windowing function to resolve the
discontinuity in the beginning and end of each frame. Many different functions can be used, but

the most common and the one I used is the hamming window.

For each of these frames, we must find their cepstrum coefficients. Ordinary WAV files

store sound by measuring the amplitude of the signal at a certain sampling rate. By taking the

Fourier transform of this signal, we can obtain the frequency domain of the sound wave. We

then pass these frequencies through a filter bank.

The Mel scale filter bank is composed of triangular band-pass filters of equal width in the

Mel-Scale. The Mel-scale was developed in 1937 as a way to measure frequencies based on

their perceived pitches from people. Humans actually do not perceive pitch as a linear function

to frequency. Rather, it is logarithmic. The most commonly used conversion from frequency to

mels is shown in figure below. Each filter represents a mel-frequency coefficient. The magnitude

of the resulting signal through each filter is the value of that coefficient. The result is then an n-

dimmensional vector where n is the number of mel-frequency coefficients we choose to look for.
After processing each frame, we will have an array of such n-dimmensional vectors. Now, the

machine must learn to differentiate the speakers based on these arrays.

At this point, many machine learning techniques can be utilized to distinguish learners

based on their tables of MFCCs. The one I chose to use is called vector quantization. Based on

my research, it seemed to be the most effective and easiest to implement of all learning

algorithms. The idea behind it is to treat each n-dimmensional vector from each frame as a point

in some n-dimmensional space. We will then arrange these points into k clusters for some

number k of our choosing.

I used the Linde, Buzo, Gray (LBG) algorithm to determine each cluster center. For each

speaker, take the array of MFFCs. Find the center of all these points by taking the mean of all

point. This point will be the first cluster-center. We then split this cluster center into two new

centers. Let X be the vector representing the first cluster center. We define X_1 = X(1-e), X_2 =

X(1+e) for some small e of our choosing. We then go through all the vectors again and assign

each to the cluster center closest to it. Now each vector in the array is assigned to one of these

two cluster centers. For each cluster center, we recalculate its position by finding the mean of

each vector assigned to it. These new cluster centers are then split again into four cluster

centers. This process of splitting and recalculating means is repeated until the specified number

of cluster centers is found. The result is a collection of cluster centers called a “codebook”. This

codebook will represent the way a speaker “sounds” and is ultimately the tool to classify which

speaker is assigned to a new speech file.

After generating a codebook for each speaker, it is very easy to classify new sounds. We

must first generate the MFCCs for the new sound file the same way as we generated them for

training. It is important to use the same windowing function, frame length, and cepstrum

coefficients in order to keep the new MFCCs compatible with the ones from the training data.

For each MFCC vector of the new sound file, we calculate the distance of it to the nearest

cluster center in each codebook. We then sum each of these distances for all the vectors for the
new sound source. The codebook with the smallest cumulative distance is the speaker we

choose.

Process

A lot of time spent on this project was done doing research. It took quite a while to read

through articles, trying to make sense of the process of speaker recognition. After puzzling

together the overall process, I attempted to create a matlab program that would generate the

MFCCs from wav files. However, I quickly realized that attempting to do so would take too much

time and was quite risky as well. The process of generating MFCCs takes a lot of manipulation

of the WAV file information. Additionally, there would be no way of knowing if my program works

since the output would essentially be a random looking sequence of numbers. Ultimately, I

found a matlab file online that would output MFCCs of a WAV file and decided to use that.

Similarly, while researching vector quantization, I found code that would generate codebooks

using the method I described above. I chose to user their code instead of writing it from scratch.

With both the codebook maker and the MFCC generator, I wrote a program that took in two

WAV files and generated two codebooks for them and another function that tested these

codebooks with a test audio file.

I found sample files from http://minhdo.ece.illinois.edu/teaching/speaker_recognition/

which contained clean audio for training and testing. I trained and tested several pairs of such

speakers and the program was able to successfully predict the speakers of all instances.

However, these sound files were extremely clean with little to no background noise. I wanted to

test the algorithm on more realistic audio that one might expect for everyday use. For this, I

recorded three different people‟s voice. I had each of them say some phrase for around 1-2

seconds twice.
Figure 1 - Bill's Cesptrsum

Figure 2 - Duncan's Cepstrum

Figure 3 - Emily's Cepstrum

Here are the results of the testing. The horizontal bar represents which sound file I

tested it on. The number in each result box is the difference of vector distortion of the correct

and incorrect speaker normalized to the length of the test file. This number represents a

quantified “sureness” of the classifier. The larger the number, the larger the difference and so

the more certain we are that the classifier was correct. The first table is from the clean audio

from the website. The second table is the audio I recorded.

train\test S1Test S2Test S3Test S4Test

S1/S2 Correct Correct X X

697.4417 667.6604

S1/S3 Correct X Correct X

466.6963 122.9370

S1/S4 Correct X X Correct

204.5936 258.7675

S2/S3 X Correct Correct X

743.2253 336.1526

S2/S4 X Correct X Correct

640.6071 546.5009

S3/S4 X X Correct Correct

221.2987 446.1940

train\test Emily 1 Emily 2 Duncan1 Duncan2 Bill 1 Bill 2

Emily 1, X Incorrect X X X Correct

Bill1 20.3377 145.5533

Emily 2, Bill Correct X X X Correct X

2 97.2494 127.5489

Emily 1, X Correct X Correct X X

Duncan 1 55.2156 67.1011

Emily 2, Correct X Correct X X X

Duncan 2 213.3416 237.7819

Bill 1, X X X Incorrect X Correct

Duncan 1 127.2159 251.8071

Bill2, X X Correct X Incorrect X

Duncan 2 126.4453 58.3555
The clean audio was able to achieve a 100% success rate while my recorded audio

achieved a 75% success rate. This discrepancy can most likely be attributed to background

noise and the fact that a good portion of the sound file was silence. Modifying the sound files to

contain only the voices should greatly increase the model‟s accuracy.

To train and test these or your own sound files, download the files and run this in MATLAB:

TrainAndTest(„Speaker1Train‟wav,‟Speaker2Train.wav‟,‟Test.wav‟);

Where the arguments are the names of audio files you want to use for training and testing.

This will output which speaker the computer believes is the speaker of the input wav file.

FILES:
https://drive.google.com/folderview?id=0BwxRkJZ9bJhyfmpQRGVsS2pUclpGYlVKU3NGdFFid2
54N2N1bFRCOUlLbS05TkFiSWpFZEU&usp=sharing

Future

There are many ways to further test and build on this program. One of the original plans

was to test instrument recognition instead of speaker recognition. However, people were more

accessible to me than the large variety of instruments needed for such a project. Theoretically,

the same process could be applied to musical instruments. By training the program with two

instruments playing the same note, it should be able to recognize which instrument is being

played if it‟s playing the same note. I also want to extend the code to test for more than two

speakers. It should be pretty easy to implement. I just need to patch the code to handle training

and testing a variable number of speakers. Finally, I want to get the accuracy of my classifier

higher. I would like to find a way to clean up the audio in some kind of pre-processing before

handing it off to the trainer/tester. I would also like to train the classifier on more audio files as

that should make the codebooks more representative of the speaker.

Bibliography

MFCC generator: http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-

matlab/content/mfcc/mfcc.m

Vector quantization: http://www.mathworks.com/matlabcentral/fileexchange/10943-vector-

quantization-k-means/content/qsplit.m

"Mel Frequency Cepstral Coefficient (MFCC) Tutorial." Practical Cryptography. N.p., n.d. Web.
15 May 2015. <http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-
frequency-cepstral-coefficients-mfccs/>.

Martinez, J.; Perez, H.; Escamilla, E.; Suzuki, M.M., "Speaker recognition using Mel frequency
Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques," Electrical
Communications and Computers (CONIELECOMP), 2012 22nd International Conference on ,
vol., no., pp.248,251, 27-29 Feb. 2012

Hasan, Rashidul, Mustafa Jamil, and Golam Rabbani. Proceedings of ICECE 2004: Venue: Pan
Pacific Sonargaon Hotel, Dhaka, Bangladesh, Date: December 28 - 30, 2004. Dhaka, Bangladesh:
n.p., 2004. SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS.
Web. <http://www.buet.ac.bd/icece/pub2004/P141.pdf>.

Do, Minh. "Digital Signal Processing Mini-Project:." DSP Mini-Project: Speaker Recognition. N.p.,
n.d. Web. 15 May 2015. <http://minhdo.ece.illinois.edu/teaching/speaker_recognition/>.

Soong, F.; Rosenberg, A.; Rabiner, L.; Juang, B.H., "A vector quantization approach to speaker
recognition," Acoustics, Speech, and Signal Processing, IEEE International Conference on
ICASSP '85. , vol.10, no., pp.387,390, Apr 1985

"Mel Scale." Wikipedia. Wikimedia Foundation, n.d. Web. 15 May 2015.

<http://en.wikipedia.org/wiki/Mel_scale>.

Study of Compatible Models On Speech To Text SMS Messaging System
No ratings yet
Study of Compatible Models On Speech To Text SMS Messaging System
13 pages
MFCC Technique For Speech Recognition
No ratings yet
MFCC Technique For Speech Recognition
6 pages
Automatic Speaker Recognition Report Hiya
No ratings yet
Automatic Speaker Recognition Report Hiya
8 pages
Biometrics Lecture Speech
No ratings yet
Biometrics Lecture Speech
38 pages
Digital Voice Analysis
0% (2)
Digital Voice Analysis
20 pages
Speaker Recognition File
No ratings yet
Speaker Recognition File
16 pages
DSP Lab Mini Project
No ratings yet
DSP Lab Mini Project
7 pages
Python Notes
No ratings yet
Python Notes
103 pages
Speech Recognition Using Matlab: Objective
No ratings yet
Speech Recognition Using Matlab: Objective
2 pages
Final Project Report
No ratings yet
Final Project Report
15 pages
CCNA 200-301 Exam Prep Workbook
100% (1)
CCNA 200-301 Exam Prep Workbook
143 pages
Text-Independent Speaker Recognition
No ratings yet
Text-Independent Speaker Recognition
12 pages
LBG VQ
No ratings yet
LBG VQ
3 pages
Abstract:: Text-Independent and Dependent Methods. in A Text
No ratings yet
Abstract:: Text-Independent and Dependent Methods. in A Text
11 pages
Speech Recognition for Engineers
100% (1)
Speech Recognition for Engineers
18 pages
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
No ratings yet
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
12 pages
Automatic Speaker Recognition System
No ratings yet
Automatic Speaker Recognition System
11 pages
V3S3-6 JamalPriceReport
No ratings yet
V3S3-6 JamalPriceReport
10 pages
EEL6586 Final Project:: A Speaker Identification and Verification System
No ratings yet
EEL6586 Final Project:: A Speaker Identification and Verification System
16 pages
Speaker Identification E6820 Spring '08 Final Project Report Prof. Dan Ellis
No ratings yet
Speaker Identification E6820 Spring '08 Final Project Report Prof. Dan Ellis
16 pages
Speaker Recognition Using Vocal Tract Features
No ratings yet
Speaker Recognition Using Vocal Tract Features
5 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Ijves Y14 05338
No ratings yet
Ijves Y14 05338
5 pages
Recognizing Voice For Numerics Using MFCC and DTW
No ratings yet
Recognizing Voice For Numerics Using MFCC and DTW
4 pages
Speaker Recognition Using Matlab
No ratings yet
Speaker Recognition Using Matlab
14 pages
Cie Igcse Computer Science 0478 Practical Znotes
No ratings yet
Cie Igcse Computer Science 0478 Practical Znotes
7 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
KPC WHCP FAT Procedure IDS0223-PRC-500 Rev1
No ratings yet
KPC WHCP FAT Procedure IDS0223-PRC-500 Rev1
36 pages
M FCC Review
No ratings yet
M FCC Review
10 pages
MFCC and Vector Quantization For Arabic Fricatives2012
No ratings yet
MFCC and Vector Quantization For Arabic Fricatives2012
6 pages
Audio Recognition with Deep Learning
No ratings yet
Audio Recognition with Deep Learning
52 pages
Brochure - Tp-Link Ax3000 Whole Home Mesh Wifi 6 System With Poe Deco X50-Poe (Us) v2
No ratings yet
Brochure - Tp-Link Ax3000 Whole Home Mesh Wifi 6 System With Poe Deco X50-Poe (Us) v2
3 pages
Use The Windows 11 User Interface
No ratings yet
Use The Windows 11 User Interface
3 pages
Voice Recognition
No ratings yet
Voice Recognition
6 pages
Speaker Verification For Remote Authentication
100% (2)
Speaker Verification For Remote Authentication
31 pages
Sigma Elite Software Procedure
100% (1)
Sigma Elite Software Procedure
27 pages
Bca CF&PST Put
No ratings yet
Bca CF&PST Put
11 pages
Introduction To Daa
No ratings yet
Introduction To Daa
24 pages
Algorithm and Flow Chart
No ratings yet
Algorithm and Flow Chart
24 pages
An Automatic Speaker Recognition System
100% (1)
An Automatic Speaker Recognition System
11 pages
Algorithm For The Identification and Verification Phase
No ratings yet
Algorithm For The Identification and Verification Phase
9 pages
Top 4 Software Development Methodologies
No ratings yet
Top 4 Software Development Methodologies
1 page
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
AWS Accounts & EC2 Basics Guide
No ratings yet
AWS Accounts & EC2 Basics Guide
13 pages
DSP Project 2
No ratings yet
DSP Project 2
10 pages
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
No ratings yet
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
5 pages
Blok Diagram Pitch Correction
No ratings yet
Blok Diagram Pitch Correction
37 pages
Exam 19 March Questions
No ratings yet
Exam 19 March Questions
13 pages
MFCCs in Speech Recognition
No ratings yet
MFCCs in Speech Recognition
14 pages
Voice Recognition PDF
No ratings yet
Voice Recognition PDF
37 pages
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
No ratings yet
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
34 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
Bangla Speech Recognition Study
No ratings yet
Bangla Speech Recognition Study
13 pages
Collection Framework in JAVA
No ratings yet
Collection Framework in JAVA
10 pages
Speaker Recognition
No ratings yet
Speaker Recognition
11 pages
Operating Systems: Chapter 5: Input/Output
No ratings yet
Operating Systems: Chapter 5: Input/Output
26 pages
Tier 1 Network - Wikipedia
No ratings yet
Tier 1 Network - Wikipedia
14 pages
Team 1912 Combustion Ignition Team Manual An Introduction To First Labview
No ratings yet
Team 1912 Combustion Ignition Team Manual An Introduction To First Labview
31 pages
Cybersecurity Awareness Presentation 17 May 22
No ratings yet
Cybersecurity Awareness Presentation 17 May 22
18 pages
Unit 4 Binary Search
No ratings yet
Unit 4 Binary Search
4 pages
Module 2 Mastery - Basic Router Config
No ratings yet
Module 2 Mastery - Basic Router Config
3 pages
Final Year Project Progress Report
No ratings yet
Final Year Project Progress Report
17 pages
Owasp Docker
No ratings yet
Owasp Docker
25 pages
K F RAHMAN - OOP - Btech (EC)
No ratings yet
K F RAHMAN - OOP - Btech (EC)
2 pages
OBSBOT Tiny 2 User Manual - EN
No ratings yet
OBSBOT Tiny 2 User Manual - EN
6 pages
Speaker Recognition System Based On VQ in MATLAB Environment
No ratings yet
Speaker Recognition System Based On VQ in MATLAB Environment
8 pages
MFCCs
No ratings yet
MFCCs
12 pages
Maretext Independent Speaker Identification Based On K-Mean Algorithm
No ratings yet
Maretext Independent Speaker Identification Based On K-Mean Algorithm
9 pages
Banner Grabbing
No ratings yet
Banner Grabbing
3 pages
Speech Recognition: MFCC Explained
No ratings yet
Speech Recognition: MFCC Explained
9 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
ESX Job and Account Refresh Code
No ratings yet
ESX Job and Account Refresh Code
2 pages
Text Independent Speaker Verification System: Khushboo Modi
No ratings yet
Text Independent Speaker Verification System: Khushboo Modi
12 pages
A Comparative Study On Image Compression in Cloud Computing
No ratings yet
A Comparative Study On Image Compression in Cloud Computing
3 pages
Implementation of Speech Recognition Using Artificial Neural Networks
No ratings yet
Implementation of Speech Recognition Using Artificial Neural Networks
12 pages
Application Note: Operational Amplifiers
No ratings yet
Application Note: Operational Amplifiers
11 pages
GC University Faisalabad: Name
No ratings yet
GC University Faisalabad: Name
7 pages
Cisco Secure Network Server Ordering Guide
No ratings yet
Cisco Secure Network Server Ordering Guide
7 pages
8834 PDF
No ratings yet
8834 PDF
8 pages
TD-8816 V8 Datasheet
No ratings yet
TD-8816 V8 Datasheet
3 pages
Speaker Recognition
No ratings yet
Speaker Recognition
11 pages

Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015

Uploaded by

Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015

Uploaded by

Implementing

then pass these frequencies through a filter bank.

machine must learn to differentiate the speakers based on these arrays.

number k of our choosing.

speaker is assigned to a new speech file.

codebooks with a test audio file.

I found sample files from http://minhdo.ece.illinois.edu/teaching/speaker_recognition/

Figure 2 - Duncan's Cepstrum

from the website. The second table is the audio I recorded.

S1/S2 Correct Correct X X

S1/S3 Correct X Correct X

S1/S4 Correct X X Correct

S2/S3 X Correct Correct X

S2/S4 X Correct X Correct

S3/S4 X X Correct Correct

train\test Emily 1 Emily 2 Duncan1 Duncan2 Bill 1 Bill 2

Emily 1, X Incorrect X X X Correct

Emily 2, Bill Correct X X X Correct X

Emily 1, X Correct X Correct X X

Emily 2, Correct X Correct X X X

Bill 1, X X X Incorrect X Correct

Bill2, X X Correct X Incorrect X

that should make the codebooks more representative of the speaker.

MFCC generator: http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-

Vector quantization: http://www.mathworks.com/matlabcentral/fileexchange/10943-vector-

"Mel Scale." Wikipedia. Wikimedia Foundation, n.d. Web. 15 May 2015.

You might also like