SWS 2013 Multilingual Database for Query-by-Example Keyword Spotting
Link to database: http://speech.fit.vutbr.cz/files/sws2013Database.tgz
The database used for the SWS 2013 evaluation has been collected thanks to a joint effort from several participating institutions that provided search utterances and queries on multiple languages and acoustic conditions (see Table 1). The database is available to the community for research purposes. Fell free to evaluate your query-by-example approaches for keyword spotting (spoken term detection). The database contains 20 hours of utterance audio (the data you search in), ~500 development and ~500 evaluation audio queries (the data you search for), scoring scripts and references.
Here (http://ceur-ws.org/Vol-1043/mediaeval2013_submission_92.pdf) you can find the MediaEval SWS2013 task description, and here (http://ceur-ws.org/Vol-1043/) are the system descriptions of particular teams which participated in MediaEval SWS2013 evaluations. An overview paper discussing achieved results was published at SLTU 2014 and is available here. If you publish any results based on this SWS2013 database, please refer the paper (bibtex).
According to the spoken language and the recording conditions, the database is organized into 5 subsets:
4 African languages: Isixhosa, Isizulu, Sepedi and Setswana. Recordings come from the Lwazi Corpus. All 4 languages were recorded in similar acoustic conditions and contribute equally both to the search repository and the sets of queries. All files include read speech recorded at 8 kHz through a telephone channel. Queries were obtained by cutting segments from speech utterances not included in the search repository. This subset features speaker mismatch but not channel mismatch between the search utterances and the queries.
Albanian & Romanian
Recordings come from the University Politehnica of Bucharest (SpeeD Research Laboratory). All files include read speech recorded through common PC microphones, originally at 16 kHz and then downsampled to 8 kHz to keep consistency with other subsets. Queries were obtained by cutting segments from speech utterances not included in the search repository. This subset features speaker mismatch and some channel mismatch between the search utterances and the queries, since different microphones on different PCs were used in recordings.
Speech utterances in the search repository come from the recently created Basque subset of the COST278 Broadcast News database, whereas the queries were specifically recorded for this evaluation. COST278 data include TV broadcast news speech (planned and spontaneous) in clean (studio) and noisy (outdoor) environments, originally sampled at 16 kHz and downsampled to 8 kHz for this evaluation. Three examples per query were read by different speakers and recorded in an office environment using a Roland Edirol R09 digital recorder. The Basque subset features both channel and speaker mismatch between the search utterances and the queries.
This subset contains conversational (spontaneous) speech obtained from telephone calls into radio live broadcasts, recorded at 8 kHz. The fact that all the recordings contain telephone-quality (i.e. low-quality) speech makes this subset more challenging than others in the database. Queries (10 examples per query, most of them from different speakers) were automatically cut (by forced alignment) from speech utterances not included in the search repository. This subset features speaker mismatch between the search utterances and the queries.
This subset includes lecture speech in English obtained from technical conferences in SuperLectures.com, speakers ranging from native to strong-accented non-native. Originally recorded at 44 kHz, audio files were downsampled to 8 kHz to keep consistency with other subsets. Queries were automatically extracted (by forced alignment) from speech utterances not included in the search repository. The original recordings were made using a high-quality microphone placed in front of the speaker, but might contain strong reverberation and some far-field channel effects. Therefore, besides speaker mismatch, there could be some channel mismatch between the search utterances and the queries.
The 9 languages selected for this database cover European and African language families. As a special case, the non-native English database consists of a mixture of native and non-native English speakers presenting their oral talks at different events. This subset thus presents a large variability in pronunciations, as it includes, for example, strong Indian English, French English and Chinese English, among others. Another interesting aspect of the database is the variety of speaking styles (read, planned, lecture, spontaneous) and the variety of acoustic (environment/channel) conditions, which forces systems to be built with low/zero resource constraints. The Basque subset is a good example of such mentioned variability, with read-speech queries recorded in an office environment and a set of search utterances extracted from TV broadcast news recordings including planned and spontaneous speech from a completely different set of speakers.
Co-author Igor Szoke was supported by the Czech Science Foundation, under post-doctoral project No. GPP202/12/P567