Nt1310 Unit 3 Speech Recognition System

3 Speech Recognition System
3.1 Pocketsphinx
The recognition framework used for acoustic modeling and recognition is Sphinx/pocketsphinx [11]. It was chosen because of the low processing and memory footprint: fast feedback to the user will be essential even when many clients connect at the same time to a server and many instances of the engine might be running in parallel.
Due to the restriction in the current pocketsphinx decoder, maximum 128 word-classes can be used, therefore, the source code was modified to accommodate larger number of classes without any impact on the performance.
3.2 Training Configuration
Acoustic model training and performance evaluation was conducted using Sphinx training tools. Customized procedure for model training and testing was established. The critical parameters as the word recognition performance (WER), real …show more content…
Three test sets were employed in the experiments. One is the test set described in section [???], the “TEST 3”. The other two are test sets with “in-domain” content on the completely same speech utterances, but with slightly different transcriptions.
The difference is that in the first - “TEST 1”, numerical values are represented as digits, while in the second one - “TEST 2” the they are represented as words. This was done to see the influence of the general purpose models where the digits are seldom represented in the language model and the corresponding dictionary. The “in-domain” speech (TEST 1 and 2) was recorded by 20 different speakers (13 male and 7 female) among them 2 non-native German speakers.
The experiments were divided into following categories where new LMs are produced:
Adaptation of the general purpose language models (“freespeech” and “sdewac”) with the transcriptions of the TEST2 “in-domain”

Nt1310 Unit 3 Speech Recognition System

Similar Documents

Popular Essays