QUESST 2015 Multilingual Database for Query-by-Example Keyword Spotting

Links to files:

If you publish any paper regarding QUESST 2015, please cite Query by Example Search on Speech at Mediaeval 2015 (Igor Szoke, Luis Javier Rodriguez-Fuentes, Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca, Martin Lojka, Xiao Xiong)

The task of QUESST ("QUery by Example Search on Speech Task") is to search FOR audio content WITHIN audio content USING an audio query. As in previous years, the search database was collected from heterogeneous sources, covering multiple languages, and under diverse acoustic conditions. Some of these languages are resource-limited, some are recorded in challenging acoustic conditions and some contain heavily accented speech (typically from non-native speakers). No transcriptions, language tags or any other metadata are provided to participants. The task therefore requires researchers to build a language-independent audio-to-audio search system.

Compared to the previous year, two main changes were introduced for this year's evaluation. First, we provide queries with longer context. So participants can use this surrounding speech to adapt their systems. Second, we artificially add noises and reverberations into the data.This aims to measure robustness of particular feature extractions and algorithms in heavy channel mismatch.

As in the previous year, the proposed task does not require the localization (time stamps) of query matchings within audio files. However, systems must provide a score (a real number) for each query matching. The higher (the more positive) the score, the more likely it is that the query appears in the audio file. The normalized cross entropy cost (Cnxe) is used as the primary metric, whereas the Actual Term Weighted Value (ATWV) is kept as a secondary metric for diagnostic purposes, which means that systems must provide not only scores, but also Yes/No decisions.

Three types of query matchings are considered: the first one (T1) involves "exact matches" whereas the second one (T2) allows for inflectional variations of words or word re-orderings (that is, "approximate matches"); the third one (T3) is similar to T2, but queries are drawn from conversational speech, thus containing strong coarticulations and some filler content between words.

The QUESST 2015 dataset is the result of a joint effort by several institutions to put together a sizable amount of data to be used in this evaluation and for later research on the topic of query-by-example search on speech. The search corpus is composed of around 18 hours of audio (11662 files) in the following 7 languages: Albanian, Czech, English, Mandarin, Portuguese, Romanian and Slovak, with different amounts of audio per language. The search utterances, which are relatively short (5.8 seconds long on average), were automatically extracted from longer recordings and manually checked to avoid very short or very long utterances. The QUESST 2015 dataset includes 445 development queries and 447 evaluation queries, the number of queries per language being more or less balanced with the amount of audio available in the search corpus. A big effort has been made to manually record the queries, in order to avoid problems observed in previous years due to acoustic context derived from cutting queries from longer sentences. Speakers recruited for recording the queries were asked to maintain a normal speaking speed and a clear speaking style. All audio files are PCM encoded at 8 kHz, 16 bits/sample, and stored in WAV format.

The data was then artificially noised and reverberated with equal amounts of clean, noisy, reverb and noisy+reverb speech. We used both stationary and transient noises downloaded from https://www.freesound.org. Reverberation was obtained by passing the audio through a filter. with an artificially generated room impulse response (RIR).