Computer-Aided Language Learning (CALL) is an important intelligent voice application, and Pronunciation Evaluation is its core technology. With the popularity of online education, especially the promotion of online teaching by the new crown epidemic, pronunciation evaluation technology is being paid attention and researched by more and more scholars.
Unlike speech recognition and other fields, the pronunciation evaluation field has long lacked a public data set for horizontal comparison. Researchers usually can only choose their own private data sets for testing, which to some extent hinders the communication and development of this field.
Based on the above situation, MI and Speechocean have open sourced the industry's first relatively complete public English pronunciation evaluation data set, and contributed corresponding sample codes to Kaldi, in order to promote the exchange and sharing of researchers in the field and promote the research of pronunciation evaluation. The data set has been uploaded to the OpenSLR website recently, and its sample code has also been merged into the Kaldi mainline.
The picture from OpenSLR
The following is a brief introduction to the data set:
Dataset Name: speech 762
Dataset language: Chinese speak English
Balanced samples and perfect content
The data set contains 5000 English sentences, covering many aspects of daily life; it is recorded by 250 English non-native speakers, whose mother tongue is Putonghua; the proportion of gender and age of the speakers is balanced, with the ratio of male to female being 1:1, and that of children and adults being 1:1; the English proficiency of the speakers is strictly designed and screened, and the ratio of good, medium and poor is 2:1:1, which can ensure the English proficiency of different degrees Feedback test for learners of English pronunciation.
Five experts scored independently, granularity to phoneme level
The data set provides multi-dimensional manual scoring, and its granularity includes not only sentence level, word level, but also phoneme level. The sentence level score includes four dimensions: accuracy, completeness, fluency and prosody, the word level score includes two dimensions: phoneme accuracy and stress location accuracy, and the phoneme level score includes one dimension: accuracy.
An important feature of this data set is that the manual scoring of all dimensions of the data set comes from five experts who use the same scoring criteria to score independently. This greatly reduces the subjectivity of manual scoring.
Customized Kaldi recipe
MI voice team contributed a customized recipe for this dataset to Kaldi, demonstrating how to score phoneme level. For Kaldi's C + + code and public scripts, MI has also added some new functions to better support the recipe. This recipe makes a phoneme rating test on this data set, and the test results can be used as the baseline in the paper by scholars.
MI and Speechocean will also jointly publish a paper to introduce the data set and the corresponding baseline system, so as to facilitate scholars' reference.
The data set download link is: http://www.openslr.org/101/ The corresponding Kaldi recipe entry is: EGS / GOP_ speechocean762. For more details about datasets, refer to the documentation in datasets.