Gender and Language Classification for Voice Scoring System

Jen Burge
Vonetta Lewis

Dr. Ishwar K. Sethi

This work was completed during the Research Experience for Undergraduates (REU) program at Oakland University supported by the National Science Foundation (NSF), Daimler-Chrysler, and Ford Motor Company.


Our Project

One of the major challenges facing intelligence officials working on homeland security today is the volume of audio data being intercepted from telephone conversations. This puts enormous strain on human analysts who have to examine every clip in their search for important clues and limits the amount of time and effort they can afford to spend on each piece of data. Our advisor, Dr. Sethi, proposed a system to automate some of this process by sifting through the clips to find those most likely to yield useful information, which would help analysts to focus their energy in the right places and avoid listening to billions of normal conversations. This could be achieved through the use of automatic speech recognition (ASR) to transcribe speech clips. The transcription would then be scored based on the words that appeared in it, indicating whether the content of the clip warrants further analysis by a human expert. Our research explores methods for performing some of the integral functions of such a system.

A major issue in designing such a system is the accuracy of the transcription. Major transcription errors could result in the insertion of a normal conversation, or worse, the omission of one containing vital clues, in those clips tagged for further investigation. It is well known that the most accurate way to perform speech recognition is to have the system trained for the speaker. Unfortunately, in the context of this application, the speech recognizer must be able to handle a wide variety of different speakers using several different languages. Therefore, speaker independent speech recognition is needed.

The main reason for the decrease in transcription accuracy for speaker independent speech recognition systems is cross-speaker variability resulting from features such as gender, accent, and age. Researchers have adopted two approaches to dealing with inter-speaker variability. One involves adapting the model to the current speaker during recognition, and the other builds multiple models for the variability, using a pre-processing stage to classify the current speaker in order to select the appropriate model for transcription.

In our project, we investigate the automatic selection of an appropriate model for speech transcription based on the gender of the speaker and the language spoken. We consider two methods for the automatic gender classification: one involving a fundamental frequency threshold, and one using Gaussian Mixture Models (GMMs) using Mel Frequency Cepstral Coefficients (MFCCs). The pitch method results in an 89% accuracy rate for the gender classification, and a GMM using 16 MFCCs and 128 mixture components can accurately identify the gender of 94% of our test samples. We use a commercially available ASR software package, IBM ViaVoice, to test the improvement in transcription accuracy using speaker dependent models with the same gender as a speaker, obtaining a 353% improvement in transcription accuracy over a model with the opposite gender as the speaker. Finally, we use the number of words transcribed by the English ViaVoice system as the basis for an English/non-English classifier, resulting in an accuracy of 80.5% over 65 foreign language and 22 English clips.

Here is a snapshot of our demo which calculates a speaker's gender in real time:




Future Work

In our experiments, we were able to show that speech recognition accuracy could be improved by having a model which was trained by a person of the same gender as the speaker rather than a person of the opposite gender, but these speaker dependent models were still outperformed in general by the IBM speaker independent model. However, the average accuracy of this model could possibly be improved by developing speaker independent male and female models. This would require gender classification before transcription of each new speaker, but our research has shown that this can be done accurately and in real time.

In both the gender classification accuracy and the accuracy of ViaVoice transcriptions in our work, we have noticed that perhaps the most important variability across speakers besides gender is accent. Therefore it seems that any recognition system attempting to be robust to a wide variety of speakers and languages should make some effort to account for different accents that the system might encounter. Since there are quite a few categories that accents could be placed into as opposed to gender where there is just two, the extent to which accent information should be used requires further research. A multilingual system would also have accents to consider for each language.

In addition to information from the content of the transcription, a voice scoring system for the homeland security application or a commercial call center could use information about the agitation of the speaker in the decision of whether to examine a clip. The system could attempt to calculate agitation based on the speaking rate, where the voiced/sec measure which we have already calculated could be a rough estimate, or the pitch contour, information that would also already be available if the fundamental frequency method was used for the gender decision.

In order to extend our language classification method to decide between multiple languages, we would need an ASR system for each of the target languages and a method similar to our English/non-English classifier for each language. Considering grammatical and semantic information in the transcript could also improve the classifier.


Final Paper
Final Presentation (.zip)


We would like to thank our advisor, Dr.  Sethi. We would also like to acknowledge the help of Mingkun Li, Victor Kulesh, and the entire Intelligent Information Engineering Laboratory at Oakland University.


Kent, R. and Read, C.   Acoustic Analysis of Speech: Second Edition.  Singular/Thomson Learning.  Albany, NY. 2002.

Deller,J., Proakis,J., and Hansen,J.  Discrete-Time Processing of Speech Signals. Macmillan Publishing. New York, NY. 1993.

Rabiner,L. and Schafer,R.  Digital Processing of Speech Signals.  Prentice Hall.  New Jersey. 1978.

Proakis,J. and Manolakis, D. Digital Signal Processing: Principles, Algorithms, and Applications. Prentice Hall. New Jersey. 1996.

Parris,E. and Carey, M. Language Independent Gender Identification. 1996.

Abdulla,W. and Kasabov,N.  Improving speech recognition performance through gender separation. 2001.

Chen,T., Huang,C., Chang,E. and Wang,J. Automatic Accent Identification Using Gaussian Mixture Models. 2001.

Vergin,R., Farhat,A. and O'Shaugnessy,D. Robust Gender-Dependent Acoustic-Phonetic Modeling and Continuous Speech Recognition Based on a New Automatic Male/Female Classification. 1996.

Gu,Q. and Shibata,T.  Speaker and Text Independent Language Identification Using Predictive Error Histogram. 2003

Torres-Carrasquillo,P., Singer,E., Kohler,M., Greene,R., Reynolds,D. and Deller,J.  Approaches to Language Identification Using Gaussian Mixture Models and Shifted Delta Cepstral Features. 2002.

Boersma, P.  Accurate Short-Term Analysis of the Fundamental Frequency and the Harmonics-to-noise Ratio of a Sampled Sound. 1993.



Data Collection:
National Public Radio (NPR)
Cable News Network (CNN)
World Radio Network

Speech Processing:
Speech Analysis
Audio Signal Processing
A Brief Introduction to Speech Analysis and Recognition, An Internet Tutorial.