Answer to EC Tender CNECT/LUX/2022/OP/0030 - LANGUAGE TECHNOLOGY SOLUTIONS Lot 2 (2023-2024)
Jazyková paměť regionů České republiky. Metody strojového učení pro uchování, dokumentaci a prezentaci nářečí českého jazyka (2023-2027)
Praktické ověření možnosti integrace umělé inteligence pro příjem tísňových volání pomocí hlasového chatbota, vyvinutého v rámci výzkumného projektu BV č. VI20192022169, s technologií pro příjem tísňové komunikace 112 a 150 v ČR (TCTV 112) (2023-2025)
Robustní zpracování nahrávek pro operativu a bezpečnost (ROZKAZ) řešení (2020-2025)
H2020 ESPERANTO - Exchanges for SPEech ReseArch aNd TechnOlogies (2021-2024)
Multi-lingualita v řečových technologií (2020-2023)
HumanE AI Network (2020-2023)
GACR EXPRO Neural Representations in multi-modal and multi-lingual modeling - NEUREM3 (2019-2023)

The NEUREM3 project encompasses basic research in speech processing (SP) and natural language processing (NLP) with accent on multi-linguality and multi-modality (speech and text processing with the support of visual information). Current deep machine learning methods are based on continuous vector representations that are created by the neural networks (NN) themselves during the training. Although empirically, the results of such NNs are often excellent, our knowledge and understanding of such representations is insufficient. NEUREM3 has an ambition to fill this gap and to study neural representations for speech and text units of different scopes (from phonemes and letters to whole spoken and written documents) and representations acquired both for isolated tasks and multi-task setups. NEUREM3 will also improve NN architectures and training techniques, so that they can be trained on incomplete or incoherent data.

The project is supported by the program "excellence in basic research" (EXPRO) of the Czech Science Foundation (GACR). We are working with partners from Charles University in Prague. The PI is Lukas Burget.


H2020 WELCOME - Multiple Intelligent Conversation Agent Sevices for Reception, Management and Integration of Third Country Nationals (2020-2023)

WELCOME is an EU funded project which aims to research and develop intelligent technologies for support of the reception and integration of migrants and refugees in Europe. WELCOME will offer a personalized and psychologically and socially competent solution for both migrants and refugees and public administrations. It will develop immersive and intelligent services, in which embodied intelligent multilingual agents will act as dedicated personalized assistants of migrants and refugees in contexts of registration, orientation, language teaching, civic education, and social and societal inclusion.


The project is funded from European Horizon 2020 Migration program. It coordinated by Leo Wanner from Universitat Pompeu Fabra Barcelona. BUT is working on multi-linguality in ASR and on language recognition, we're also busy with integration issues. Katia Egorova and Jan Svec are the main BUT people on WELCOME.

TACR DEEPSY - Deep learning in psychotherapy: Machine learning applied on therapeutic session recordings (2020-2023)

Psychotherapy is an expert activity requiring continuous decision-making and continuous evaluation of the course of the psychotherapeutic process by the psychotherapist. In practice, however, psychotherapists suffer from a lack of immediate feedback to support this decision. The project aims to create a tool that enables automated analysis of audio recordings of psychotherapeutic sessions to provide psychotherapists feedback on the course in a short time. The project aims at technologies of automatic speech recognition, natural language computer processing, machine learning, expert coding of psychotherapeutic process and self-assessment questionnaire methods. Its expected outcome will be software providing psychotherapists with user-friendly and practically beneficial feedback with the potential to improve psychotherapeutic care.

The project was proposed in cooperation of Brno University of Technology and Masaryk University and it is funded by the Techbology Agency of the Czexch Republic within the ETA program. The PI of the project is Pavel Matejka.

Past European

H2020 HAAWAII - Highly Automated Air Traffic Controller Workstations with Artificial Intelligence Integration (2020-2022)

Advanced automation support developed in Wave 1 of SESAR IR includes using of automatic speech recognition (ASR) to reduce the amount of manual data inputs by air-traffic controllers. HAAWAII project aims to research and develop a reliable, error resilient and adaptable solution to automatically transcribe voice commands issued by both air-traffic controllers and pilots. The project will build on very large collection of data, organized with a minimum expert effort to develop a new set of models for complex environments of Icelandic en-route and London TMA. HAAWAII aims to perform proof-of-concept trials in challenging environments, i.e. to be directly connected with real-life data from ops room. As pilot read-back error detection is the main application, HAAWAII aims to significantly enhance the validity of the speech recognition models.


The project is funded from European Horizon 2020 SESAR Joint Undertaking. It is coordinated by DLR (the "German NASA"). BUT is working on NLP issues and ASR, our PI in HAAWAII is Pavel Smrz and the speech part is coordinated by Franta Grezl.

H2020 ROXANNE - Real time network, text, and speaker analytics for combating organized crime (2019-2022)

ROXANNE (Real time network, text, and speaker analytics for combating organized crime) is an EU funded collaborative research and innovation project, aiming to unmask criminal networks and their members as well as to reveal the true identity of perpetrators by combining the capabilities of speech/language technologies and visual analysis with network analysis.

ROXANNE collaborates with Law Enforcement Agencies (LEAs), industry and researchers to develop new tools to speed up investigative processes and support LEA decision-making. The end-product will be an advanced technical platform which uses new tools to uncover and track organized criminal networks, underpinned by a strong legal framework.


The project is funded from European Horizon 2020 Security program. The consortium comprises 24 European organisations from 16 countries while 11 of them are LEAs from 10 different countries. It is coordinated by IDIAP by BUT alumnus Petr Motlicek. BUT is leading workpackage 5 "Speech, text and video data analysis". Honza Cernocky and Johan Rohdin are BUT co-PIs in ROXANNE.

H2020 ATCO2 - Automatic collection and processing of voice data from air-traffic communications (2019-2022)

ATCO2 project aims at developing a unique platform allowing to collect, organize and pre-process air-traffic control (voice communication) data from air space. Preliminarily the project will consider the real-time voice communication between air-traffic controllers and pilots available either directly through publicly accessible radio frequency channels, or indirectly from air-navigation service providers (ANSPs). In addition to the voice communication, the contextual information available in a form of metadata (i.e. surveillance data) will be exploited.


The project is funded from European Horizon 2020 CleanSky 2 program. It is coordinated by IDIAP. We are proud that BUT spin-off ReplayWell is a partner as well. BUT is working on ASR (issues of robustness, semi-supervised and unsupervised training, our PI in ATCO2 is Karel Vesely.

H2020 Marie Curie Robust End-To-End SPEAKER recognition based on deep learning and attention models (2019-2021)

This project focuses on automatic speaker recognition (SID), the task of determining the identity of the speaker in a speech recording. We aim for end-to-end SID where the system is optimized as a whole for the target task.


This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 843627. It supports Dr. Alicia Lozano-Diez, and excellent female researcher, with a Ph.D. at Audias (Universidad Autonoma de Madrid, Spain), during her post-doctoral stay in Brno.

Robust SPEAKER DIariazation systems using Bayesian inferenCE and deep learning methods - SpeakerDice (2017-2019)

The proposed project deals with Speaker Diarization (SD) which is commonly defined as the task of answering the question "who spoke when?" in a speech recording. The first objective of the proposal is to optimize the Bayesian approach to SD, which has shown to be promising for the tasks. For Variational Bayes (VB) inference, that is very sensitive to initialization, we will develop new fast ways of obtaining a good starting point. We will also explore alternative inference methods, such as collapsed VB or collapsed Gibbs Sampling, and investigate into alternative priors similar to those introduced for Bayesian speaker recognition models. The second part of the proposal is motivated by the huge performance gains that, in recent years, have been brought to other recognition tasks by Deep Neural Networks (DNNs). In the context of SD, DNNs have been used in the computation of i-vectors, but their potential was never explored for other stages of SD. We will study ways of integrating DNNs in the different stages of SD systems. The objectives of the proposal will be achieved by theoretical work, implementation, and careful testing on real speech data. The outcomes of the project are intended not only for scientific publications, but eagerly awaited by European speech data mining industry (for example Czech Phonexia or Spanish Agnitio). The project is proposed by an excellent female researcher, Dr. Mireia Diez, having finished her thesis in the GTTS group of University of the Basque Country, one of the most important European labs dealing with speaker recognition and diarization. The proposed host is the Speech@FIT group of Brno University of Technology, with a 20-year track of top speech data mining research. The proposed research training and combination of skills of Dr. Diez and the host institution have chances to advance the state-of-the-art in speaker diarization, provide the applicant with improved career opportunities and benefit European industry.

SpeakerDice is funded from the European Union's Horizon 2020 Marie Sklodowska Curie Action.


Sequence summarizing neural networks for speaker recognition - SEQUENS (2016-2019)

The proposed project deals with speaker recognition and is motivated by the huge performance gains that, in recent years, have been brought to other recognition tasks by so called neural networks (NN)s. The objective of the proposal is to develop a new type of NN that is suitable for speaker recognition and take it to the state where it is ready for practical use. So far, attempts to take advantage of NNs in speaker recognition have replaced one or more components in the state-of-the-art speaker recognition chain with NN equivalencies. However, this approach has the same limitations as the state-of-art processing chain in terms of what kind of patterns in the speech signals that be can modeled. Instead, our proposed project aims at replacing the whole speaker recognition chain with one NN that process whole utterances in one step. This approach should take better advantage of NNs ability to model complex patterns in the speech signals. The objectives of the proposal will be achieved by theoretical work (derivation of NN structure, training criteria etc.), implementation (parallelization, scalability etc.) and careful testing on real speech data (finding appropriate default settings etc.).

SEQUENS is funded from European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie action cofinanced by the South Moravian Region.


EU H2020 BIg Speech data analytics for cONtact centers - BISON (2015-2017)

Contact centers (CC) are an important business for Europe: 35,000 contact centers generate 3.2 Million jobs (~1% of Europe’s active population). A typical CC produces a wealth of multilingual spoken data. BISON works toward (1) basic speech data mining technologies (systems quickly adaptable to new languages, domains and CC campaigns), (2) business outcome mining from speech (translated into improvement of CCs Key Performance Indicators) and (3) CC support systems integrating both speech and business outcome mining in user-friendly way.

BUT works on speech mining technologies adapted to the CC domain, adaptable to the needs of CC users and capable of making use of CC resources. BISON is coordinated by BUT’s spin-off company Phonexia, the consortium includes eight partners across Europe.


EU FP7 Applying Pilots Models for Safer Aircraft A-PiMod (2013-2016)

Within the A-PiMod project a hybrid of multimodal pilot (crew) interaction, operator modeling and real-time risk assessment approaches to adaptive automation is suggested for the design of future cockpit systems.

A-PiMod is coordinated by Deutsches Zentrum für Luft- und Raumfahrt (DLR, the “German NASA”). BUT closely cooperates with Honeywell Brno. BUT’s activities have two tracks: the graphics/video group works on gaze and gesture detection, while the speech guys are on in-cockpit speech recognition (combining grammar-based approaches and classical LVCSR).


EU FP7 MObile BIOmetry (MOBIO, 2007-2010)

The concept of the MOBIO project was to develop new mobile services secured by biometric authentication means. Scientific and technical objectives included robust-to-illumination face authentication, robust-to-noise speaker authentication, joint bi-modal authentication, model adaptation and scalability.

The project was coordinated by IDIAP research institute. BUT concentrated on algorithms for robust and computationally inexpensive speaker verification. This work was strongly linked to the landmark JHU 2008 workshop “Robust Speaker Recognition Over Varying Channels” that gave birth to iVectors (the dominating paradigm in speaker recognition nowadays). At ICASSP 2011 in Prague, Ondrej Glembek received the Ganesh Ramaswamy prize for his paper “Simplification and Optimization of i-vector extraction” supported by MOBIO.


EU FP6 Detection and Identification of Rare Audio-visual Cues – DIRAC (2006-2010)

Unexpected rare events are potentially information rich but still poorly processed by today's computing systems. DIRAC project has addressed this crucial machine weakness and developed methods for environment-adaptive autonomous artificial cognitive systems that will detect, identify and classify possibly threatening rare events from the information derived by multiple active information-seeking audio-visual sensors.

The project was coordinated by the Hebrew University Of Jerusalem (scientific coordination) and Carl von Ossietzky University Oldenburg (administrative coordination). BUT mainly worked on out-of-vocabulary (OOV) word detection and handling.


EU FP4-6 Multi Modal Meeting Manager (M4), Augmented Multi-party Interaction (AMI) and Augmented Multi-party Interaction with Distance Access/AMI/AMIDA (2002-2009)

The series of projects have have set up serious grounds in multiple research areas related to human-human interaction modeling, computer enhanced human-human communication (especially in the context of face-to-face and remote meetings), social communication sensing, and social signal processing.

The projects were coordinated by IDIAP research institute (scientific coordination) and the University of Edinburgh (administrative coordination). A notable output of M4/AMI/AMIDA is the AMI meeting corpus – a valuable resource to train ASR of spontaneous non-native English. The work of BUT concentrated on the ASR of meetings, the AMI ASR team was headed by Thomas Hain from the University of Sheffield.


EU FP6 Content Analysis and REtrieval Technologies to Apply Knowledge Extraction to massive Recording – CARETAKER (2006-2008)

The project aimed at studying, developing and assessing multimedia knowledge-based content analysis, knowledge extraction components, and metadata management sub-systems in the context of automated situation awareness, diagnosis and decision support. It focused on the extraction of a structured knowledge from large multimedia collections recorded over networks of camera and microphones deployed in real sites.

The project was coordinated by Thales Communications. BUT worked on both video and audio analysis. In audio, we were applying the know-how from speech recognition to the identification of rare audio events.


EU IST HLT Speech-driven Interfaces for Consumer Devices – SpeeCon (2000-2003)

During the lifetime of the project, originally scheduled to last two years, partners collected speech data for 18 languages or dialectal zones, including most of the languages spoken in the EU. SpeeCon devoted special attention to the environment of the recordings - at home, in the office, in public places or in moving vehicles.

The project was coordinated by Siemens R&D and BUT together with the Czech Technical University in Prague were sub-contracted by Harmann/Becker to collect the data for Czech. Czech (as well as other) SpeeCon databases are currently available from ELRA.


EU IST HLT Eastern European Speech Databases for Creation of Voice Driven Teleservices – SpeechDat-E (1998-2000)

The project focused on Spoken Language Resources, namely speech databases for fixed telephone networks including associated annotations and pronunciation lexica. Speech from 2500 speakers was collected for Russian and from 1000 speakers for Czech, Slovak , Polish and Hungarian.

The project was coordinated by Lernout & Hauspie. BUT together with the Czech Technical University in Prague. As you might expect, we were working on the Czech. The project was the first EU project funded at BUT and we worked on it while still at the “old” Faculty of Electrical Engineering and Computer Science (the transition of the speech group to FIT happened only in 2002). Czech (as well as other) SpeechDat-E databases are currently available from ELRA.


Past US

IARPA MATERIAL: Machine Translation for English Retrieval of Information in Any Language (2017-2021)

The MATERIAL Program seeks to develop methods for finding speech and text content in low-resource languages that is relevant to domain-contextualized English queries. Such methods must use minimal training data and be rapidly deployable to new languages and domains.

BUT’s task in MATERIAL is to work on automatic speech recognition in Material target languages, supported by other technologies, such as automatic language identification to filter out non-target speech data. We are part of the “FLAIR” team coordinated by Raytheon BBN Technologies. BUT's principal investigator in MATERIAL is Dr. Martin Karafiat.


End-to-end DNN Speaker recognition system

Text-independent speaker verification (SV) is currently the only bastion in the domain of speech data mining that resists the massive attack of deep neural networks (DNNs). We have already seen the end-to-end DNN approach to yield very good performance in the area of text-dependent SV and DNNs have been very successful in the related domain of spoken language recognition. In this project, we will depart from existing DNN approaches for SV and advance towards full-DNN systems.

This project is financed by Faculty Research Award by Google, its principal investigator is Oldrich Plchot.

DARPA Lorelei (2015-2019)

The goal of the Low Resource Languages for Emergent Incidents (LORELEI) Program is to dramatically advance the state of computational linguistics and human language technology to enable rapid, low-cost development of capabilities for low-resource languages. The program aims at providing situational awareness by identifying elements of information in foreign language and English sources, such as topics, names, events, sentiment and relationships.

BUT works on information mining from speech and concentrates on topic detection, sentiment analysis and system training without much or any resources in the target language. We are part of the “ELISA” team coordinated by the University of South California (USC) in L.A.


DARPA RATS (2010-2017)

Existing speech signal processing technologies are inadequate for most noisy or degraded speech signals that are important to military intelligence. The Robust Automatic Transcription of Speech (RATS) program is creating algorithms and software for performing the following tasks on potentially speech-containing signals received over communication channels that are extremely noisy and/or highly distorted: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Key Word Spotting (KWS).

BUT’s task in RATS is to work on robust techniques for SAD, LID and SID, especially using neural-network based algorithms. We are part of the “RATS-Patrol” team coordinated by Raytheon BBN.


IARPA BABEL (2012-2016)

The Babel Program develops agile and robust speech recognition technology that can be rapidly applied to any human language in order to provide effective search capability for analysts to efficiently process massive amounts of real-world recorded speech. Today's transcription systems are built on technology that was originally developed for English, with markedly lower performance on non-English languages. These systems have often taken years to develop and cover only a small subset of the languages of the world. Babel intends to demonstrate the ability to generate a speech transcription system for any new language within one week to support keyword search performance for effective triage of massive amounts of speech recorded in challenging real-world situations.

BUT’s task in Babel is to develop algorithms and solutions for fast prototyping of recognizers in shortening times and on lower and lower amounts of data (note that the “VLLP” condition has only 3 hours of training data). We are part of the “Babelon” team coordinated by Raytheon BBN.


IARPA BEST (2009-2011)

IARPA Biometrics Exploitation Science & Technology (BEST) program sought to significantly advance the state-of-the-science for biometrics technologies. The overarching goals for the program are: (1) To significantly advance the Intelligence Community's (IC) ability to achieve high-confidence match performance, even when the features are derived from non-ideal data, (2) To significantly relax the constraints currently required to acquire high fidelity biometric signatures.

BUT was part of the PRISM team coordinated by the STAR laboratory of SRI International in Menlo Park, CA, USA. We were working on high-level features for speaker recognition (SRE). Among the notable achievements were the advances on multinomial distribution describing discrete features for SRE and the definition of PRISM data set.


EOARD Improving the capacity of language recognition systems to handle rare languages using radio broadcast data (2008-2010)

This project proposed to fill the gap of insufficient training data for language recognition (LRE) by using the data acquired from public sources, namely radio broadcasts.

The project was finances by the U.S. Air Force European Office of Aerospace Research & Development (EOARD). This work helped NIST and LDC to generate data for the NIST 2009 language recognition evaluation. See the technical report for details.


Past Czech

MoI Employment of artificial intelligence into an emergency call reception (AI v TiV) (2019-2022)

The project focuses on research and development of artificial intelligence technologies for automated reception and processing of emergency calls in the environment of integrated rescue system by means of voice chat-bota (HCHB).

BUT is member of a consortium coordinated by

We are responsible for R&D in speech data mining and for data processing, and BUT PI is Ondrej Glembek.

MoI Dolování infoRmAcí z řeči Pořízené vzdÁlenými miKrofony - DRAPÁK (Information mining in speech acquired by distant microphones) (2014-2020)

Speech data mining is becoming indispensable for units fighting criminality and terrorism. The current versions allow for successful deployment on data acquired from close-talk microphones. The goal of DRAPAK is to increase the performance of speech data mining from distant microphones in real environments and to generate relevant information in corresponding operational scenarios. The output is a set of software tools to be tested by the Police of the Czech Republic and other state agencies.

DRAPAK is supported by the Ministry of Interior of the Czech Republic and is coordinated by BUT, that is responsible for core speech and signal processing R&D. The project is tightly linked to the 2015 Frederick Jelinek workshop group “Far-Field Enhancement and Recognition in Mismatched Settings”. Our partner in the project is Phonexia, responsible for industrial R&D and relations with security-oriented customers.


TAČR NOSIČI (2018-2019)

The goals of this project are (1) improve existing and design new neural network techniques for speech signal processing and speech data mining, mainly in the fields of remote sensing (microphone arrays), training on limited real data, language modeling, speaker recognition and detection of out-of-vocabulary words (OOV). (2) prepare the research results for industrial adoption in the form of functioning software, consultations with the industrial partner and intensive transfer of know-how.

BUT is the prime contractor of this project, with Phonexia as an industrial partner. The project is sponsored by the Technology Agency of the Czech Republic under the "Zeta" program. As this program accentuates gender equality, the research team is in composed in large part of female researchers and developpers from the BUT Speech@FIT group and Phonexia.


TAČR Meeting assIstaNT - MINT (2014-2017)

The goal of this project is R&D in the field of meeting audio processing (including meetings, team briefings, customer relations, etc.) leading to creation of prototype of an intelligent meeting assistant helping during a meeting (on-line), with the processing of meeting minutes (off-line), and with the following storage and sharing of meeting-related materials.

BUT coordinates this project and is responsible for the core speech data mining R&D. We have partnered with Phonexia (prototype integration, production aspects of speech data mining, speech I/O), Lingea (terminology, natural language processing and translation) and Tovek (data mining from heterogeneous resources, use cases).


MoI Zpřístupnění Automatického Ověřování Mluvčího širokému spektru uživatelů v oblasti bezpečnosti (Enabling automatic speaker verification to broad spectrum of users in the security domain) – ZAOM (2013-2015)

Past few years have witnessed a substantial progress in theory and algorithmization of speaker recognition (SRE). ZAOM aimed at adaptation of SRE algorithms for specific needs of police and intelligence services, in order to (1) provide precise but easy-to-understand visualization so that responsible personnel obtains timely information needed to cope with threats and to speed up investigation, (2) be able to adapt systems to target user data and substantially improve their performances.

ZAOM is supported by the Ministry of Interior of the Czech Republic and was coordinated by BUT, that was responsible for core speech and signal processing R&D. Phonexia was our industrial partner, responsible for the development part and interaction with security oriented customers. An important output of the project is our proposal of the Voice Biometry Standard.


TAČR Technologie zpracování Řeči Pro efektIvní komunikaci člověk-počíTač (Technologies of speech processing for efficient human-machine communication) – TAČR TŘPIT (2011-2014)

The project aimed at development of advanced techniques in speech recognition and their deployment in the functional applications: search in electronic dictionaries on mobile devices, dictating translations, in defense and security, in dialogue systems, in client-care systems (CRM, helpdesk etc.) and in audio-visual access to teaching materials.

BUT coordinates this project and partnered with Phonexia (security and defense applications), Lingea (electronic dictionaries) and Optimsys (interactive voice response (IVR) systems). The main output of BUT is the lecture browsing system now available at and


MoI Overcoming the language barrier complicating investigation into financing terrorism and serious financial crimes (2007-2010)

The project aimed at bringing speech data mining technologies to the use of the Czech national security community.

The project was supported by the Ministry of Interior of the Czech Republic. BUT was the member of the consortium including University of West Bohemia in Pilsen and Technical University Liberec. In addition to advances in language recognition, speaker recognition and speech transcription, the project produced very valuable Czech spontaneous speech database that is still serving to R&D in the ASR of Czech. It also started the tradition of annual meetings of the Czech speech researchers with the members of national security community.


MoC Multilingual recognition and search in speech for electronic dictionaries (2009-2013)

The project aimed at research, development and assessment of technologies for prototyping of speech recognition and search systems with only a few hours of transcribed training data, without the need for phonetic or linguistic expertise. These technologies were tested in the domain of electronic dictionaries.

The project was supported by the Ministry of Trade and Commerce of the Czech Republic under the “TIP” program. It was coordinated by Lingea, BUT was responsible for development of training paradigms requiring small amounts of training data. The MoC project contributed to the definition of Subspace Gaussian Mixture models (SGMMs) and it allowed us to jump-start the work under IARPA Babel.