You are here


Current projects

IARPA MATERIAL: Machine Translation for English Retrieval of Information in Any Language (2017-2021)

The MATERIAL Program seeks to develop methods for finding speech and text content in low-resource languages that is relevant to domain-contextualized English queries. Such methods must use minimal training data and be rapidly deployable to new languages and domains.

BUT’s task in MATERIAL is to work on automatic speech recognition in Material target languages, supported by other technologies, such as automatic language identification to filter out non-target speech data. We are part of the “FLAIR” team coordinated by Raytheon BBN Technologies. BUT's principal investigator in MATERIAL is Dr. Martin Karafiat.


TAČR NOSIČI (2018-2019)

The goals of this project are (1) improve existing and design new neural network techniques for speech signal processing and speech data mining, mainly in the fields of remote sensing (microphone arrays), training on limited real data, language modeling, speaker recognition and detection of out-of-vocabulary words (OOV). (2) prepare the research results for industrial adoption in the form of functioning software, consultations with the industrial partner and intensive transfer of know-how.

BUT is the prime contractor of this project, with Phonexia as an industrial partner. The project is sponsored by the Technology Agency of the Czech Republic under the "Zeta" program. As this program accentuates gender equality, the research team is in composed in large part of female researchers and developpers from the BUT Speech@FIT group and Phonexia.


Robust SPEAKER DIariazation systems using Bayesian inferenCE and deep learning methods - SpeakerDice (2017-2019)

The proposed project deals with Speaker Diarization (SD) which is commonly defined as the task of answering the question "who spoke when?" in a speech recording. The first objective of the proposal is to optimize the Bayesian approach to SD, which has shown to be promising for the tasks. For Variational Bayes (VB) inference, that is very sensitive to initialization, we will develop new fast ways of obtaining a good starting point. We will also explore alternative inference methods, such as collapsed VB or collapsed Gibbs Sampling, and investigate into alternative priors similar to those introduced for Bayesian speaker recognition models. The second part of the proposal is motivated by the huge performance gains that, in recent years, have been brought to other recognition tasks by Deep Neural Networks (DNNs). In the context of SD, DNNs have been used in the computation of i-vectors, but their potential was never explored for other stages of SD. We will study ways of integrating DNNs in the different stages of SD systems. The objectives of the proposal will be achieved by theoretical work, implementation, and careful testing on real speech data. The outcomes of the project are intended not only for scientific publications, but eagerly awaited by European speech data mining industry (for example Czech Phonexia or Spanish Agnitio). The project is proposed by an excellent female researcher, Dr. Mireia Diez, having finished her thesis in the GTTS group of University of the Basque Country, one of the most important European labs dealing with speaker recognition and diarization. The proposed host is the Speech@FIT group of Brno University of Technology, with a 20-year track of top speech data mining research. The proposed research training and combination of skills of Dr. Diez and the host institution have chances to advance the state-of-the-art in speaker diarization, provide the applicant with improved career opportunities and benefit European industry.

SpeakerDice is funded from the European Union's Horizon 2020 Marie Sklodowska Curie Action.


Sequence summarizing neural networks for speaker recognition - SEQUENS (2016-2019)

The proposed project deals with speaker recognition and is motivated by the huge performance gains that, in recent years, have been brought to other recognition tasks by so called neural networks (NN)s. The objective of the proposal is to develop a new type of NN that is suitable for speaker recognition and take it to the state where it is ready for practical use. So far, attempts to take advantage of NNs in speaker recognition have replaced one or more components in the state-of-the-art speaker recognition chain with NN equivalencies. However, this approach has the same limitations as the state-of-art processing chain in terms of what kind of patterns in the speech signals that be can modeled. Instead, our proposed project aims at replacing the whole speaker recognition chain with one NN that process whole utterances in one step. This approach should take better advantage of NNs ability to model complex patterns in the speech signals. The objectives of the proposal will be achieved by theoretical work (derivation of NN structure, training criteria etc.), implementation (parallelization, scalability etc.) and careful testing on real speech data (finding appropriate default settings etc.).

SEQUENS is funded from European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie action cofinanced by the South Moravian Region.


End-to-end DNN Speaker recognition system

Text-independent speaker verification (SV) is currently the only bastion in the domain of speech data mining that resists the massive attack of deep neural networks (DNNs). We have already seen the end-to-end DNN approach to yield very good performance in the area of text-dependent SV and DNNs have been very successful in the related domain of spoken language recognition. In this project, we will depart from existing DNN approaches for SV and advance towards full-DNN systems.

This project is financed by Faculty Research Award by Google, its principal investigator is Oldrich Plchot.

DARPA Lorelei (2015-2019)

The goal of the Low Resource Languages for Emergent Incidents (LORELEI) Program is to dramatically advance the state of computational linguistics and human language technology to enable rapid, low-cost development of capabilities for low-resource languages. The program aims at providing situational awareness by identifying elements of information in foreign language and English sources, such as topics, names, events, sentiment and relationships.

BUT works on information mining from speech and concentrates on topic detection, sentiment analysis and system training without much or any resources in the target language. We are part of the “ELISA” team coordinated by the University of South California (USC) in L.A.


MoI Dolování infoRmAcí z řeči Pořízené vzdÁlenými miKrofony - DRAPÁK (Information mining in speech acquired by distant microphones) (2014-2020)

Speech data mining is becoming indispensable for units fighting criminality and terrorism. The current versions allow for successful deployment on data acquired from close-talk microphones. The goal of DRAPAK is to increase the performance of speech data mining from distant microphones in real environments and to generate relevant information in corresponding operational scenarios. The output is a set of software tools to be tested by the Police of the Czech Republic and other state agencies.

DRAPAK is supported by the Ministry of Interior of the Czech Republic and is coordinated by BUT, that is responsible for core speech and signal processing R&D. The project is tightly linked to the 2015 Frederick Jelinek workshop group “Far-Field Enhancement and Recognition in Mismatched Settings”. Our partner in the project is Phonexia, responsible for industrial R&D and relations with security-oriented customers.


Important past projects

EU H2020 BIg Speech data analytics for cONtact centers - BISON (2015-2017)

Contact centers (CC) are an important business for Europe: 35,000 contact centers generate 3.2 Million jobs (~1% of Europe’s active population). A typical CC produces a wealth of multilingual spoken data. BISON works toward (1) basic speech data mining technologies (systems quickly adaptable to new languages, domains and CC campaigns), (2) business outcome mining from speech (translated into improvement of CCs Key Performance Indicators) and (3) CC support systems integrating both speech and business outcome mining in user-friendly way.

BUT works on speech mining technologies adapted to the CC domain, adaptable to the needs of CC users and capable of making use of CC resources. BISON is coordinated by BUT’s spin-off company Phonexia, the consortium includes eight partners across Europe.


TAČR Meeting assIstaNT - MINT (2014-2017)

The goal of this project is R&D in the field of meeting audio processing (including meetings, team briefings, customer relations, etc.) leading to creation of prototype of an intelligent meeting assistant helping during a meeting (on-line), with the processing of meeting minutes (off-line), and with the following storage and sharing of meeting-related materials.

BUT coordinates this project and is responsible for the core speech data mining R&D. We have partnered with Phonexia (prototype integration, production aspects of speech data mining, speech I/O), Lingea (terminology, natural language processing and translation) and Tovek (data mining from heterogeneous resources, use cases).


DARPA RATS (2010-2017)

Existing speech signal processing technologies are inadequate for most noisy or degraded speech signals that are important to military intelligence. The Robust Automatic Transcription of Speech (RATS) program is creating algorithms and software for performing the following tasks on potentially speech-containing signals received over communication channels that are extremely noisy and/or highly distorted: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Key Word Spotting (KWS).

BUT’s task in RATS is to work on robust techniques for SAD, LID and SID, especially using neural-network based algorithms. We are part of the “RATS-Patrol” team coordinated by Raytheon BBN.


IARPA BABEL (2012-2016)

The Babel Program develops agile and robust speech recognition technology that can be rapidly applied to any human language in order to provide effective search capability for analysts to efficiently process massive amounts of real-world recorded speech. Today's transcription systems are built on technology that was originally developed for English, with markedly lower performance on non-English languages. These systems have often taken years to develop and cover only a small subset of the languages of the world. Babel intends to demonstrate the ability to generate a speech transcription system for any new language within one week to support keyword search performance for effective triage of massive amounts of speech recorded in challenging real-world situations.

BUT’s task in Babel is to develop algorithms and solutions for fast prototyping of recognizers in shortening times and on lower and lower amounts of data (note that the “VLLP” condition has only 3 hours of training data). We are part of the “Babelon” team coordinated by Raytheon BBN.


EU FP7 Applying Pilots Models for Safer Aircraft A-PiMod (2013-2016)

Within the A-PiMod project a hybrid of multimodal pilot (crew) interaction, operator modeling and real-time risk assessment approaches to adaptive automation is suggested for the design of future cockpit systems.

A-PiMod is coordinated by Deutsches Zentrum für Luft- und Raumfahrt (DLR, the “German NASA”). BUT closely cooperates with Honeywell Brno. BUT’s activities have two tracks: the graphics/video group works on gaze and gesture detection, while the speech guys are on in-cockpit speech recognition (combining grammar-based approaches and classical LVCSR).


MoI Zpřístupnění Automatického Ověřování Mluvčího širokému spektru uživatelů v oblasti bezpečnosti (Enabling automatic speaker verification to broad spectrum of users in the security domain) – ZAOM (2013-2015)

Past few years have witnessed a substantial progress in theory and algorithmization of speaker recognition (SRE). ZAOM aimed at adaptation of SRE algorithms for specific needs of police and intelligence services, in order to (1) provide precise but easy-to-understand visualization so that responsible personnel obtains timely information needed to cope with threats and to speed up investigation, (2) be able to adapt systems to target user data and substantially improve their performances.

ZAOM is supported by the Ministry of Interior of the Czech Republic and was coordinated by BUT, that was responsible for core speech and signal processing R&D. Phonexia was our industrial partner, responsible for the development part and interaction with security oriented customers. An important output of the project is our proposal of the Voice Biometry Standard.


TAČR Technologie zpracování Řeči Pro efektIvní komunikaci člověk-počíTač (Technologies of speech processing for efficient human-machine communication) – TAČR TŘPIT (2011-2014)

The project aimed at development of advanced techniques in speech recognition and their deployment in the functional applications: search in electronic dictionaries on mobile devices, dictating translations, in defense and security, in dialogue systems, in client-care systems (CRM, helpdesk etc.) and in audio-visual access to teaching materials.

BUT coordinates this project and partnered with Phonexia (security and defense applications), Lingea (electronic dictionaries) and Optimsys (interactive voice response (IVR) systems). The main output of BUT is the lecture browsing system now available at and


IARPA BEST (2009-2011)

IARPA Biometrics Exploitation Science & Technology (BEST) program sought to significantly advance the state-of-the-science for biometrics technologies. The overarching goals for the program are: (1) To significantly advance the Intelligence Community's (IC) ability to achieve high-confidence match performance, even when the features are derived from non-ideal data, (2) To significantly relax the constraints currently required to acquire high fidelity biometric signatures.

BUT was part of the PRISM team coordinated by the STAR laboratory of SRI International in Menlo Park, CA, USA. We were working on high-level features for speaker recognition (SRE). Among the notable achievements were the advances on multinomial distribution describing discrete features for SRE and the definition of PRISM data set.


EOARD Improving the capacity of language recognition systems to handle rare languages using radio broadcast data (2008-2010)

This project proposed to fill the gap of insufficient training data for language recognition (LRE) by using the data acquired from public sources, namely radio broadcasts.

The project was finances by the U.S. Air Force European Office of Aerospace Research & Development (EOARD). This work helped NIST and LDC to generate data for the NIST 2009 language recognition evaluation. See the technical report for details.


MoI Overcoming the language barrier complicating investigation into financing terrorism and serious financial crimes (2007-2010)

The project aimed at bringing speech data mining technologies to the use of the Czech national security community.

The project was supported by the Ministry of Interior of the Czech Republic. BUT was the member of the consortium including University of West Bohemia in Pilsen and Technical University Liberec. In addition to advances in language recognition, speaker recognition and speech transcription, the project produced very valuable Czech spontaneous speech database that is still serving to R&D in the ASR of Czech. It also started the tradition of annual meetings of the Czech speech researchers with the members of national security community.


MoC Multilingual recognition and search in speech for electronic dictionaries (2009-2013)

The project aimed at research, development and assessment of technologies for prototyping of speech recognition and search systems with only a few hours of transcribed training data, without the need for phonetic or linguistic expertise. These technologies were tested in the domain of electronic dictionaries.

The project was supported by the Ministry of Trade and Commerce of the Czech Republic under the “TIP” program. It was coordinated by Lingea, BUT was responsible for development of training paradigms requiring small amounts of training data. The MoC project contributed to the definition of Subspace Gaussian Mixture models (SGMMs) and it allowed us to jump-start the work under IARPA Babel.


EU FP7 MObile BIOmetry (MOBIO, 2007-2010)

The concept of the MOBIO project was to develop new mobile services secured by biometric authentication means. Scientific and technical objectives included robust-to-illumination face authentication, robust-to-noise speaker authentication, joint bi-modal authentication, model adaptation and scalability.

The project was coordinated by IDIAP research institute. BUT concentrated on algorithms for robust and computationally inexpensive speaker verification. This work was strongly linked to the landmark JHU 2008 workshop “Robust Speaker Recognition Over Varying Channels” that gave birth to iVectors (the dominating paradigm in speaker recognition nowadays). At ICASSP 2011 in Prague, Ondrej Glembek received the Ganesh Ramaswamy prize for his paper “Simplification and Optimization of i-vector extraction” supported by MOBIO.


EU FP6 Detection and Identification of Rare Audio-visual Cues – DIRAC (2006-2010)

Unexpected rare events are potentially information rich but still poorly processed by today's computing systems. DIRAC project has addressed this crucial machine weakness and developed methods for environment-adaptive autonomous artificial cognitive systems that will detect, identify and classify possibly threatening rare events from the information derived by multiple active information-seeking audio-visual sensors.

The project was coordinated by the Hebrew University Of Jerusalem (scientific coordination) and Carl von Ossietzky University Oldenburg (administrative coordination). BUT mainly worked on out-of-vocabulary (OOV) word detection and handling.


EU FP4-6 Multi Modal Meeting Manager (M4), Augmented Multi-party Interaction (AMI) and Augmented Multi-party Interaction with Distance Access/AMI/AMIDA (2002-2009)

The series of projects have have set up serious grounds in multiple research areas related to human-human interaction modeling, computer enhanced human-human communication (especially in the context of face-to-face and remote meetings), social communication sensing, and social signal processing.

The projects were coordinated by IDIAP research institute (scientific coordination) and the University of Edinburgh (administrative coordination). A notable output of M4/AMI/AMIDA is the AMI meeting corpus – a valuable resource to train ASR of spontaneous non-native English. The work of BUT concentrated on the ASR of meetings, the AMI ASR team was headed by Thomas Hain from the University of Sheffield.


EU FP6 Content Analysis and REtrieval Technologies to Apply Knowledge Extraction to massive Recording – CARETAKER (2006-2008)

The project aimed at studying, developing and assessing multimedia knowledge-based content analysis, knowledge extraction components, and metadata management sub-systems in the context of automated situation awareness, diagnosis and decision support. It focused on the extraction of a structured knowledge from large multimedia collections recorded over networks of camera and microphones deployed in real sites.

The project was coordinated by Thales Communications. BUT worked on both video and audio analysis. In audio, we were applying the know-how from speech recognition to the identification of rare audio events.


EU IST HLT Speech-driven Interfaces for Consumer Devices – SpeeCon (2000-2003)

During the lifetime of the project, originally scheduled to last two years, partners collected speech data for 18 languages or dialectal zones, including most of the languages spoken in the EU. SpeeCon devoted special attention to the environment of the recordings - at home, in the office, in public places or in moving vehicles.

The project was coordinated by Siemens R&D and BUT together with the Czech Technical University in Prague were sub-contracted by Harmann/Becker to collect the data for Czech. Czech (as well as other) SpeeCon databases are currently available from ELRA.


EU IST HLT Eastern European Speech Databases for Creation of Voice Driven Teleservices – SpeechDat-E (1998-2000)

The project focused on Spoken Language Resources, namely speech databases for fixed telephone networks including associated annotations and pronunciation lexica. Speech from 2500 speakers was collected for Russian and from 1000 speakers for Czech, Slovak , Polish and Hungarian.

The project was coordinated by Lernout & Hauspie. BUT together with the Czech Technical University in Prague. As you might expect, we were working on the Czech. The project was the first EU project funded at BUT and we worked on it while still at the “old” Faculty of Electrical Engineering and Computer Science (the transition of the speech group to FIT happened only in 2002). Czech (as well as other) SpeechDat-E databases are currently available from ELRA.