First evaluation of keyword spotting in Czech supported by Ministries of Interior and Defense

Czech Republic, with a population of 10 million, is surprisingly home to three successful university research groups working in the area of speech recognition:

Since 2007, these three groups are cooperating on a research project “Overcoming the language barrier complicating investigation into financing terrorism and serious financial crimes” sponsored by Czech Ministry of Interior under number VD20072010B16. The project aims at the analysis of spontaneous telephone calls from security and defense domains. On contrary to English, where corpora such as Switchboard and Fisher provide sufficient amounts of training material, Czech was lacking a well transcribed database of spontaneous telephone calls. This is why in 2007, the activities of the project concentrated on the creation of such a database – the consortium now disposes of almost 100 hours of transcribed and checked spontaneous speech data.

After several discussions with Interior and Defense representatives (defense people have joined the project later and now are also providing valuable support to the consortium), keyword spotting in Czech was defined as the priority task for security and defense analysts. To compare the performances of systems, an evaluation of keyword spotting systems was organized in 2008, with the final run in November and post-evaluation workshop on November 21st. Systems were compared by standard metrics FOM (figure of merit) and EER (equal error rate) but their speed and ability to handle OOV (out of vocabulary) words were also important.

TUL built on its extensive experience with the recognition of Czech and designed a system based on LVCSR with large vocabulary of 350k words and 410k pronunciation variants (note that Czech is a highly inflective language so that 50k vocabularies common for English are not sufficient), without language model. The advantage of the system is its high speed 0.15xRT and ability to detect colloquial variants even if a correct form of a word is entered. TUL group uses a simple acoustic modeling based on context-independent models with high numbers of Gaussians, with a state-of-the-art feature extraction: MFCC coefficients processed by Heteroscedastic Linear Discriminant Analysis (HLDA) transform.

UWB experimented with two systems: the first purely acoustic (keyword model works against a background model and the resulting likelihood ratio is thresholded), the second working with LVCSR lattices. While the acoustic system provided better results for the development set with rather artificial selection of “good” (read: “long”) keywords, the advantages of LVCSR-based system fully emerged in the test on evaluation data, where the selection of keywords was not limited. The acoustic modeling in this system uses not only discriminative training of HMMs, but also discriminative adaptation to individual conversation sides. UWB experimented also with fusion of individual systems and has shown their complementarity.

BUT tested 4 systems in this evaluation: FastLVCSR is based on LVCSR with insertion of keywords into language model, HybridLVCSR is doing full-fledged word and subword recognition and indexing, and two acoustic systems based on GMM/HMM and NN/HMM. While LVCSR systems are more precise, the advantage of acoustic ones is in their speed. HybridLVCSR is worth mentioning as it allows for pre-processing large quantities of data off-line with subsequent very fast searches, including OOVs. BUT built on its experience in LVCSR and KWS in EC-sponsored AMI and AMIDA projects, as well as its participation in 2006 NIST STD evaluation.

The research groups consider this event very important, as it creates more confidence in speech technologies in Czech security and defense community, and hope it will have a positive impact on their future funding.

The leaders of Czech speech groups (from the left): Honza Cernocky (BUT), Jan Nouza (TUL), Ludek Muller (UWB)