Software

BUT Speech@FIT Reverb Database

This is the first release of BUT Speech@FIT Reverb Database. The database is being built with respect to collect a large number of various Room Impulse Responses, Room environmental noises (or "silences"), Retransmitted speech (for ASR and SID testing), and meta-data (positions of microphones, speakers etc.).

The goal is to provide speech community with a dataset for data enhancement and distant microphone or microphone array experiments in ASR and SID.

The database has Apache 2.0 license and you can download it here:
http://www.fit.vutbr.cz/~szoke/speech/ReverbDB/BUT_ReverbDB_rel_18_11.tgz [126 GB]

Multilingual Region-Dependent Transforms

This shell package allows to extract features based on Region Dependent Transforms (RDT) models from audio files. The features are well suitable mainly for Gaussian Mixture Models (GMM) in Automatic Speech Recognition (ASR) systems but they could be used in other applications as well.

The whole process could be split into 3 steps. Standard PLP-HLDA features are concatenated with Stacked Bottle-Neck Features trained in multilingual fashion on Babel data coming from 17 different languages. This features are going into discriminatively trained RDT transforms on 17 Babel languages which generates final outputs.

BUT/Phonexia Bottleneck feature extractor

This python package allows to extract bottleneck, stacked bottleneck features and phoneme/senones posteriors from audio files. Primarily, bottleneck features are tuned for the task of spoken language recognition but can be used in other applications (e.g. speaker recognition, speech recognition) as well. Package includes three neural networks i.e. there are three types of features one can extract with it. Two networks are trained on English data only and the third is trained in multilingual fashion on data coming from 17 different languages.

Speaker Diarization based on Bayesian HMM with Eigenvoice Priors

This python code implements speaker diarization algorithm described in:

Diez Mireia, Burget Lukáš and Matějka Pavel. Speaker Diarization based on Bayesian HMM with Eigenvoice Priors. In: Proceedings of Odyssey 2018. Les Sables d´Olonne: International Speech Communication Association, 2018, pp. 147-154. ISSN 2312-2846. available at: http://www.fit.vutbr.cz/research/groups/speech/publi/2018/diez_odyssey2018_63.pdf

Abstract:

Nowadays, most speaker diarization methods address the task in two steps: segmentation of the input conversation into (preferably) speaker homogeneous segments, and clustering. Generally, different models and techniques are used for the two steps. In this paper we present a very elegant approach where a straightforward and efficient Variational Bayes (VB) inference in a single probabilistic model addresses the complete SD problem. Our model is a Bayesian Hidden Markov Model, in which states represent speaker specific distributions and transitions between states represent speaker turns. As in the i-vector or JFA models, speaker distributions are modeled by GMMs with parameters constrained by eigenvoice priors. This allows to robustly estimate the speaker models from very short speech segments. The model, which was released as open source code and has already been used by several labs, is fully described for the first time in this paper. We present results and the system is compared and combined with other state-of-the-art approaches. The model provides the best results reported so far on the CALLHOME dataset.

Download: http://www.fit.vutbr.cz/~burget/VB_diarization.zip (attention, the file has 35 MB as it contains example data).

QUESST 2015 Multilingual Database for Query-by-Example Keyword Spotting

Links to files:

If you publish any paper regarding QUESST 2015, please cite Query by Example Search on Speech at Mediaeval 2015 (Igor Szoke, Luis Javier Rodriguez-Fuentes, Andi Buzo, Xavier Anguera, Florian Metze, Jorge Proenca, Martin Lojka, Xiao Xiong)

The task of QUESST ("QUery by Example Search on Speech Task") is to search FOR audio content WITHIN audio content USING an audio query. As in previous years, the search database was collected from heterogeneous sources, covering multiple languages, and under diverse acoustic conditions. Some of these languages are resource-limited, some are recorded in challenging acoustic conditions and some contain heavily accented speech (typically from non-native speakers). No transcriptions, language tags or any other metadata are provided to participants. The task therefore requires researchers to build a language-independent audio-to-audio search system.

QUESST 2014 Multilingual Database for Query-by-Example Keyword Spotting

Link to database: http://speech.fit.vutbr.cz/files/quesst14Database.tgz [1 116 MB]

The QUESST 2014 search dataset consists of 23 hours or around 12.500 spoken documents in the following languages: Albanian, Basque, Czech, non-native English, Romanian and Slovak. The languages were chosen so that relatively little annotated data can be found for them, as would be the case for a ``low resource'' language. The recordings were PCM encoded with 8 KHz sampling rate and 16 bit resolution (down-sampling or re-encoding were done when necessary to homogenize the database). The spoken documents (6.6 seconds long on average) were extracted from longer recordings of different types: read, broadcast, lecture and conversational speech. Besides language and speech type variability, the search dataset also features acoustic environment and channel variability. The distribution of spoken documents per language is shown in Table 1. The database is free for research purposes. Feel free to use the setup for evaluation and comparison of you results with results achieved by others and MediaEval QUESST 2014 evaluations.

Here (http://ceur-ws.org/Vol-1263/mediaeval2014_submission_35.pdf) you can find the MediaEval QUESST 2014 task description, and here (http://ceur-ws.org/Vol-1263/) are the system descriptions of particular teams which participated in MediaEval QUESST 2014 evaluations. An overview paper discussing achieved results is published at ICASSP 2015 and is available here. If you publish any results based on this QUESST2014 database, please refer the paper (bibtex).

Table 1:

According to the spoken language and the recording conditions, the database is organized into 5 language subsets:

SWS 2013 Multilingual Database for Query-by-Example Keyword Spotting

Link to database: http://speech.fit.vutbr.cz/files/sws2013Database.tgz

The database used for the SWS 2013 evaluation has been collected thanks to a joint effort from several participating institutions that provided search utterances and queries on multiple languages and acoustic conditions (see Table 1). The database is available to the community for research purposes. Fell free to evaluate your query-by-example approaches for keyword spotting (spoken term detection). The database contains 20 hours of utterance audio (the data you search in), ~500 development and ~500 evaluation audio queries (the data you search for), scoring scripts and references.

Here (http://ceur-ws.org/Vol-1043/mediaeval2013_submission_92.pdf) you can find the MediaEval SWS2013 task description, and here (http://ceur-ws.org/Vol-1043/) are the system descriptions of particular teams which participated in MediaEval SWS2013 evaluations. An overview paper discussing achieved results was published at SLTU 2014 and is available here. If you publish any results based on this SWS2013 database, please refer the paper (bibtex).

RNNLM Toolkit

Neural network based language models are nowdays among the most successful techniques for statistical language modeling. They can be easily applied in wide range of tasks, including automatic speech recognition and machine translation, and provide significant improvements over classic backoff n-gram models. The 'rnnlm' toolkit can be used to train, evaluate and use such models.

The goal of this toolkit is to speed up research progress in the language modeling field. First, by providing useful implementation that can demonstrate some of the principles. Second, for the empirical experiments when used in speech recognition and other applications. And finally third, by providing a strong state of the art baseline results, to which future research that aims to "beat state of the art techniques" should compare to.

Speech search

Source code of three systems for speech search is available here:

Neural Network Trainer TNet

Thanks for the interest in TNet! At the moment let's consider it as a dead project, as I fully switched the efforts to 'nnet1' recipe in kaldi : http://kaldi.sourceforge.net/dnn1.html You can still use the TNet, but it is not to be extended anymore. Thanks!

Overview:

TNet is a tool for parallel training of neural networks for classification, containing two independent sets of tools, the CPU and GPU tools. The CPU training is based on multithread data-parallelization, the GPU training is implemented in CUDA, both are implementing mini-batch Stochastic Gradient Descent, optimizing per-frame Cross-entropy.

The toolkit contains example of NN training on TIMIT, which can be easily transfered to your data. You may also be interested in hierarchical "Universal Context Network" a.k.a. "Stacked bottleneck netowork", which can be built using one-touch-script : tools/train/tnet_train_uc.sh