Phoneme recognizer based on long temporal context

The phoneme recognizer was developed at Brno University of Technology, Faculty of Information Technology and was successfully applied to tasks including language identification [4], indexing and search of audio records, and keyword spotting [5]. The main purpose of this distribution is research. Outputs from this phoneme recognizer can be used as a baseline for subsequent processing, as for example phonotactic language modeling.

Authors:

Petr Schwarz, Pavel Matejka, Lukas Burget, Ondrej Glembek

Description

  • Split temporal context (STC) [1, 2, 3] based feature extraction
  • Neural network classifiers
  • Viterbi algorithm is used for phoneme string decoding
  • English systems was trained on the TIMIT database
  • Czech, Hungarian and Russian systems were trained on the SpeechDat-E databases

Compilation:

The source code has been successfully compiled under Linux (GCC) and under Windows (MinGW32). The program can be compiled with or without BLAS (Basic Linear Algebra Subprograms) for acceleration. The ATLAS (Automatically Tuned Linear Algebra Software) is used in this case.

  • Compilation under Linux with BLASsupport make -f makefile.lin
  • Compilation under Linux without BLASsupport make -f makefile_noblas.lin
  • Compilation under Windows with BLAS support make -f makefile.win
  • Compilation under Windows without BLASsupport make -f makefile_noblas.win

How to:

  • set the recognition system
    phnrec -c PHN_CZ_SPDAT_LCRC_N1500|PHN_HU_SPDAT_LCRC_N1500|PHN_RU_SPDAT_LCRC_N1500|
    PHN_EN_TIMIT_LCRC_N500
  • set the input format:
    phnrec -c PHN_EN_TIMIT_LCRC_N500 -w alaw|lin16
  • set input and output filesThe output is the HTK label file or Master Label File (MLF). Input is either speech file or a list of files. The recognizer can also save intermediate results like Mel-banks or posteriors. Saving of intermediate results can for example significantly speed-up tuning of word insertion penalty.
    phnrec -c PHN_EN_TIMIT_LCRC_N500 -l list -m out.mlf #!MLF!# "*/faem0.rec" 000000 1300000 pau 1300000 2000000 ah 2000000 3500000 s 3500000 4500000 ih phnrec -c PHN_EN_TIMIT_LCRC_N500 -i input.raw -o output.rec
  • change the word (phoneme) insertion penalty:
    phnrec -c PHN_EN_TIMIT_LCRC_N500 -i input.raw -o output.rec -p -3.0

Systems:

  • PHN_CZ_SPDAT_LCRC_N1500 - 8kHz, 2 block STC, trained on Czech SpeechDat-E, 15 banks, 31 points, the DCT is applied on each temporal vector to reduce its size to 11 values, 1500 neurons in all nets
  • PHN_HU_SPDAT_LCRC_N1500 - 8kHz, 2 block STC, trained on Hungarian SpeechDat-E, 15 banks, 31 points, the DCT is applied on each temporal vector to reduce its size to 11 values, 1500 neurons in all nets
  • PHN_RU_SPDAT_LCRC_N1500 - 8kHz, 2 block STC, trained on Russian SpeechDat-E, 15 banks, 31 points, the DCT is applied on each temporal vector to reduce its size to 11 values, 1500 neurons in all nets
  • PHN_EN_TIMIT_LCRC_N500 - 16kHz, 2 block STC, trained on TIMIT, 15 banks, 31 points, the DCT is applied on each temporal vector to reduce its size to 11 values, 500 neurons in all nets
System # labels ERR (%)
PHN_CZ_SPDAT_LCRC_N1500 45 24.24
PHN_HU_SPDAT_LCRC_N1500 61 33.32
PHN_RU_SPDAT_LCRC_N1500 52 39.27
PHN_EN_TIMIT_LCRC_N500 39 24.24

Note: The Czech, Hungarian and Russian SpeechDat systems were used in NIST LRE2005.
Results obtained by this system can slightly differ from published ones due to implementation.

Licence:

Source codes and binaries can be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Model files (directories PHN_CZ_SPDAT_LCRC_N1500, PHN_HU_SPDAT_LCRC_N1500, PHN_RU_SPDAT_LCRC_N1500, PHN_EN_TIMIT_LCRC_N500) can be used for research and educational purposes only. For any other use, please contact Jan Cernocky.

Download:

phnrec_v2_21.tgz [changes]
The archive contain Windows executable. If you want to run the recognizer under another operation system, you need to compile the source code.

PhnRec Google groop

This group was created for you. We are not able to personally answer/solve all questions, test the package for all possible platforms and explain in details the training scripts. So we decided to build a community around BUT phoneme recognizer that could share the knowledge. Please feel free to ask questions on this group email list. If you know already the answer to someone’s question, please help us to answer the question and guide others. Feel also free to upload any documents that you created and that could help others.

http://groups.google.com/group/phnrec

Training scripts:

We developed two sets of training scripts. One based on the excelent QuickNet software from ICSI and a second one based on our STK toolkit. The one based on QuickNet can be used to train new set of neural networks for BUT phoneme recognizer directly. On the other hand, the STK based sripts work with the HTK feature files and use quite powerfull macro language which allows to easily set-up any TRAP based or Split Temporal Context based feature extraction. This can simplify the experiments.

  • QuickNet based scripts: Download phnrec_tscripts.tgz and tscripts_patch.tgz and follow instructions in readme.txt. The readme.txt says what the other packages are necessary.
  • STK based scripts: Download NNScripts.tgz and follow instructions in scripts/doc/readme.txt. The readme.txt says what the other packages are necessary. See also scripts/doc/nn_scripts.ppt for basic introduction to the scripts and scripts/doc/stk_macros.ppt for basic introduction to STK macros.

Frequently answered questions

  • How to generate phoneme lattices?
  • The software itself does not generate lattice. You can save phoneme posterior probabilities and use another decoder, for example HVite from HTK. For this, you can use our script lattice_generation.tgz . Be sure you have changed the softening function in the 'config' config file for particular configuration directory before you start generating lattices using this script. [posteriors] softening_func=gmm_bypass 0 0 0

References:

[1] P. Schwarz, "Phoneme Recognition based on Long Temporal Context, PhD Thesis", Brno University of Technology, 2009
[2] P. Schwarz, P. Matejka, J. Cernocky, "Hierarchical Structures of Neural Networks for Phoneme Recognition", submitted for publication to ICASSP2006
[3] P. Schwarz, P. Matejka, J. Cernocky, "Towards Lower Error Rates in Phoneme Recognition", in Proc. TSD2004, Brno, Czech Republic, 2004
[4] P. Matejka, P. Schwarz, J. Cernocky, P. Chytil, "Phonotactic Language Identification using High Quality Phoneme Recognition", in Proc. Eurospeech2005, Sep, 2005
[5] I. Szoke, P. Schwarz, L. Burget, M. Fapso, M. Karafiat, J. Cernocky, P. Matejka, "Comparison of Keyword Spotting Approaches for Informal Continuous Speech", in Proc. Eurospeech2005, Sep, 2005