Mohammad Zeineldeen

RWTH Aachen University

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Hermann Ney

RWTH Aachen University

Ralf Schlüter

RWTH Aachen University

Albert Zeyer

RWTH Aachen University

Christoph Lüscher

RWTH Aachen University

Wilfried Michel

RWTH Aachen University

Nick Rossenbach

RWTH Aachen University

Wei Zhou

Wuhan University

Jingjing Xu

Southwest Jiaotong University

Yingbo Gao

RWTH Aachen University

Benedikt Hilmes

RWTH Aachen University

Cooperative Institutions

RWTH Aachen University

FH Aachen

University of Chinese Academy of Sciences

Inform (Germany)

Laboratoire d'Informatique de Paris-Nord

FIR e. V. an der RWTH Aachen

Google (United States)

IBM (United States)

IBM Research - Thomas J. Watson Research Center

University of Teacher Education Lucerne

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

The Conformer Encoder May Reverse the Time Dimension

arXiv (Cornell University) (2024)

Robin Schmitt Albert Zeyer Mohammad Zeineldeen Ralf Schlüter Hermann Ney

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

Conformational isomerism

10.48550/arxiv.2410.00680

Cite

Citations (0)

Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

arXiv (Cornell University) (2023)

Mohammad Zeineldeen Kartik Audhkhasi Murali Karthick Baskar Bhuvana Ramabhadran

This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNN-T architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data.

Discriminative model

Sequence (biology)

10.48550/arxiv.2303.05958

Cite

Citations (0)

Discrete Steps towards Approximate Computing

2022 23rd International Symposium on Quality Electronic Design (ISQED) (2022)

M. Gansen Jie Lou Florian Freye Tobias Gemmeke Farhad Merchant

As long as a computational precision above 8 bits is preferred, digital design generally outperforms analog one incurring less hardware cost. This motivates our recent studies on digital approximate computing as presented in this paper. Rather than using fixed-point numbers, discrete steps of approximation using floating-point number representations such as BFloat16 and posit formats are explored particularly. Time-domain computing is addressed as well which starts in the digital domain with discrete delay values and moves towards the analog domain under increased delay uncertainties when pushed for energy efficiency by voltage scaling. The proposed approximate arithmetic and nonlinear activation functions are further evaluated in various artificial neural networks achieving competitive Quality-of-Service compared to the state-of-the-art with full-precision computing.

10.1109/isqed54688.2022.9806215

Cite

Citations (5)

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

arXiv (Cornell University) (2021)

Nick Rossenbach Mohammad Zeineldeen Benedikt Hilmes Ralf Schlüter Hermann Ney

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data.

Connectionism

Word error rate

10.48550/arxiv.2104.05379

Cite

Citations (0)

Layer-Normalized LSTM for Hybrid-Hmm and End-To-End ASR

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)

Mohammad Zeineldeen Albert Zeyer Ralf Schlüter Hermann Ney

Training deep neural networks is often challenging in terms of training stability. It often requires careful hyperparameter tuning or a pretraining scheme to converge. Layer normalization (LN) has shown to be a crucial ingredient in training deep encoder-decoder models. We explore various LN long short-term memory (LSTM) recurrent neural networks (RNN) variants by applying LN to different parts of the internal recurrency of LSTMs. There is no previous work that investigates this. We carry out experiments on the Switchboard 300h task for both hybrid and end-to-end ASR models and we show that LN improves the final word error rate (WER), the stability during training, allows to train even deeper models, requires less hyperparameter tuning, and works well even without pre-training. We find that applying LN to both forward and recurrent inputs globally, which we denoted by Global Joined Norm variant, gives a 10% relative improvement in WER.

End-to-end principle

10.1109/icassp40776.2020.9053635

Cite

Citations (8)

Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text

Interspeech 2022 (2023)

Parnia Bahar Mattia Di Gangi Nick Rossenbach Mohammad Zeineldeen

10.21437/interspeech.2023-903

Cite

Citations (2)

Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model Improves End-to-End ASR

arXiv (Cornell University) (2024)

Jintao Jiang Yingbo Gao Mohammad Zeineldeen Zoltán Tüske

In this paper, alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training. Towards this end, triphone and BPE alignments are extracted using a pre-existing hybrid ASR system. Then, regularization effect is obtained by cross-entropy based intermediate auxiliary losses computed on such alignments at a mid-layer representation of the encoder for triphone alignments and at the encoder for BPE alignments. Weak supervision is achieved through strong label smoothing with parameter of 0.5. Experimental results on TED-LIUM 2 indicate that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss. Moreover, their combination lowers the word error rate further. We also investigate the alternation of the two auxiliary tasks during model training, and additional performance gain is observed. Overall, the proposed techniques result in over 10% relative error rate reduction over a CTC-regularized baseline system.

End-to-end principle

10.48550/arxiv.2402.15594

Cite

Citations (0)

Robust Knowledge Distillation from RNN-T Models with Noisy Training Labels Using Full-Sum Loss

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023)

Mohammad Zeineldeen Kartik Audhkhasi Murali Karthick Baskar Bhuvana Ramabhadran

This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNNT architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data.

Discriminative model

Sequence (biology)

10.1109/icassp49357.2023.10096744

Cite

Citations (0)

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

arXiv (Cornell University) (2021)

Mohammad Zeineldeen Aleksandr Glushko Wilfried Michel Albert Zeyer Ralf Schlüter

Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to this implicit LM, similarly as in the hybrid hidden Markov model approach. The implicit LM cannot be calculated efficiently in general and it is yet unclear what are the best methods to estimate it. In this work, we compare different approaches from the literature and propose several novel methods to estimate the ILM directly from the AED model. Our proposed methods outperform all previous approaches. We also investigate other methods to suppress the ILM mainly by decreasing the capacity of the AED model, limiting the label context, and also by training the AED model together with a pre-existing LM.

Discriminative model

Source

Cite

Citations (0)

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

arXiv (Cornell University) (2021)

Wei Zhou Mohammad Zeineldeen Zuoyun Zheng Ralf Schlüter Hermann Ney

Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models. We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.

Pronunciation

End-to-end principle

10.48550/arxiv.2104.09106

Cite

Citations (0)