QUESST 2015 Multilingual Database for Query-by-Example Keyword Spotting

Links to files:

If you publish any paper regarding QUESST 2015, please cite Query by Example Search on Speech at Mediaeval 2015 (Igor Szoke, Luis Javier Rodriguez-Fuentes, Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca, Martin Lojka, Xiao Xiong)

The task of QUESST ("QUery by Example Search on Speech Task") is to search FOR audio content WITHIN audio content USING an audio query. As in previous years, the search database was collected from heterogeneous sources, covering multiple languages, and under diverse acoustic conditions. Some of these languages are resource-limited, some are recorded in challenging acoustic conditions and some contain heavily accented speech (typically from non-native speakers). No transcriptions, language tags or any other metadata are provided to participants. The task therefore requires researchers to build a language-independent audio-to-audio search system.

QUESST 2014 Multilingual Database for Query-by-Example Keyword Spotting

Link to database: [1 116 MB]

The QUESST 2014 search dataset consists of 23 hours or around 12.500 spoken documents in the following languages: Albanian, Basque, Czech, non-native English, Romanian and Slovak. The languages were chosen so that relatively little annotated data can be found for them, as would be the case for a ``low resource'' language. The recordings were PCM encoded with 8 KHz sampling rate and 16 bit resolution (down-sampling or re-encoding were done when necessary to homogenize the database). The spoken documents (6.6 seconds long on average) were extracted from longer recordings of different types: read, broadcast, lecture and conversational speech. Besides language and speech type variability, the search dataset also features acoustic environment and channel variability. The distribution of spoken documents per language is shown in Table 1. The database is free for research purposes. Feel free to use the setup for evaluation and comparison of you results with results achieved by others and MediaEval QUESST 2014 evaluations.

Here ( you can find the MediaEval QUESST 2014 task description, and here ( are the system descriptions of particular teams which participated in MediaEval QUESST 2014 evaluations. An overview paper discussing achieved results is published at ICASSP 2015 and is available here. If you publish any results based on this QUESST2014 database, please refer the paper (bibtex).

Table 1:

According to the spoken language and the recording conditions, the database is organized into 5 language subsets:

SWS 2013 Multilingual Database for Query-by-Example Keyword Spotting

Link to database:

The database used for the SWS 2013 evaluation has been collected thanks to a joint effort from several participating institutions that provided search utterances and queries on multiple languages and acoustic conditions (see Table 1). The database is available to the community for research purposes. Fell free to evaluate your query-by-example approaches for keyword spotting (spoken term detection). The database contains 20 hours of utterance audio (the data you search in), ~500 development and ~500 evaluation audio queries (the data you search for), scoring scripts and references.

Here ( you can find the MediaEval SWS2013 task description, and here ( are the system descriptions of particular teams which participated in MediaEval SWS2013 evaluations. An overview paper discussing achieved results was published at SLTU 2014 and is available here. If you publish any results based on this SWS2013 database, please refer the paper (bibtex).

VB Diarization with Eigenvoice and HMM Priors

This python code implements a generalized version of speaker diarization described in:

Kenny, P. Bayesian Analysis of Speaker Diarization with Eigenvoice Priors, Montreal, CRIM, May 2008,

Kenny, P., Reynolds, D., and Castaldo, F. Diarization of Telephone Conversations using Factor Analysis IEEE Journal of Selected Topics in Signal Processing, December 2010,

The generalization introduced in this implementation lies in using an HMM instead of the simple mixture model when modeling generation of segments (or even frames) from speakers. HMM limits the probability of switching between speakers when changing frames, which makes it possible to use the model on frame-by-frame bases without any need to iterate between 1) clustering speech segments and 2) re-segmentation (i.e. as it was done in the paper above).

Download: (attention, the file has 93 MB as it contains example data).

RNNLM Toolkit

Neural network based language models are nowdays among the most successful techniques for statistical language modeling. They can be easily applied in wide range of tasks, including automatic speech recognition and machine translation, and provide significant improvements over classic backoff n-gram models. The 'rnnlm' toolkit can be used to train, evaluate and use such models.

The goal of this toolkit is to speed up research progress in the language modeling field. First, by providing useful implementation that can demonstrate some of the principles. Second, for the empirical experiments when used in speech recognition and other applications. And finally third, by providing a strong state of the art baseline results, to which future research that aims to "beat state of the art techniques" should compare to.

Speech search

Source code of three systems for speech search is available here:

Neural Network Trainer TNet

Thanks for the interest in TNet! At the moment let's consider it as a dead project, as I fully switched the efforts to 'nnet1' recipe in kaldi : You can still use the TNet, but it is not to be extended anymore. Thanks!


TNet is a tool for parallel training of neural networks for classification, containing two independent sets of tools, the CPU and GPU tools. The CPU training is based on multithread data-parallelization, the GPU training is implemented in CUDA, both are implementing mini-batch Stochastic Gradient Descent, optimizing per-frame Cross-entropy.

The toolkit contains example of NN training on TIMIT, which can be easily transfered to your data. You may also be interested in hierarchical "Universal Context Network" a.k.a. "Stacked bottleneck netowork", which can be built using one-touch-script : tools/train/

Lattice Spoken Term Detection toolkit (LatticeSTD)

Toolkit for experiments with lattice based spoken term detection. It allows you to define set of terms and search them in lattices.

* Searches for defined sequence of links in lattice and outputs label assign to this sequence.
* Calculates confidence of found sequence (posterior probability)
* Filter overlapped detection
* Allows handle substitutions/deletion/insertion (useful for phone lattices)

KWSViewer - Interactive viewer for Keyword spotting output


This tool can load an output of a keyword-spotting system (KWS) and reference file in HTK-MLF format and show detections in a tabular view. You can also use it to replay detections, tune and visualize scores, hits, misses and false-alarms using sliders on the right-side panel.

Installation - Windows

Download here (6MB). No installation is needed. You can run kwsviewer.exe directly without any installation.

Joint Factor Analysis Matlab Demo

This set of Matlab functions and data by Ondrej Glembek ( is a simple tutorial of Joint Factor Analysis (JFA), as it was investigated at the JHU 2008 workshop

The tutorial is based on Patrick Kenny's paper:

Kenny, P "Joint factor analysis of speaker and session variability: Theory and algorithms" - Technical report CRIM-06/08-13 Montreal, CRIM, 2005,

especially on the simplified version of the training in: