Link to database: http://speech.fit.vutbr.cz/files/quesst14Database.tgz [1 116 MB]
The QUESST 2014 search dataset consists of 23 hours or around 12.500 spoken documents in the following languages: Albanian, Basque, Czech, non-native English, Romanian and Slovak. The languages were chosen so that relatively little annotated data can be found for them, as would be the case for a ``low resource'' language. The recordings were PCM encoded with 8 KHz sampling rate and 16 bit resolution (down-sampling or re-encoding were done when necessary to homogenize the database). The spoken documents (6.6 seconds long on average) were extracted from longer recordings of different types: read, broadcast, lecture and conversational speech. Besides language and speech type variability, the search dataset also features acoustic environment and channel variability. The distribution of spoken documents per language is shown in Table 1. The database is free for research purposes. Feel free to use the setup for evaluation and comparison of you results with results achieved by others and MediaEval QUESST 2014 evaluations.
Here (http://ceur-ws.org/Vol-1263/mediaeval2014_submission_35.pdf) you can find the MediaEval QUESST 2014 task description, and here (http://ceur-ws.org/Vol-1263/) are the system descriptions of particular teams which participated in MediaEval QUESST 2014 evaluations. An overview paper discussing achieved results is published at ICASSP 2015 and is available here. If you publish any results based on this QUESST2014 database, please refer the paper (bibtex).
Table 1:

According to the spoken language and the recording conditions, the database is organized into 5 language subsets: