Gerhard Rigoll

Technical University of Munich

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Björn W. Schuller

Technical University of Munich

164

Frank Wallhoff

Jade University of Applied Sciences

Okan Köpüklü

Microsoft (Germany)

Florian Eyben

RWTH Aachen University

Martin Wöllmer

Lund University

Dejan Arsić

Müller BBM (Germany)

Stefan Hörmann

Technical University of Munich

A. Kosmala

Mercator Institute for China Studies

Joachim Schenk

Munich University of Applied Sciences

Jürgen T. Geiger

Technical University of Munich

Cooperative Institutions

Technical University of Munich

218

Ludwig-Maximilians-Universität München

Tongji University

BMW Group (Germany)

University of Duisburg-Essen

BMW (Germany)

University of Tsukuba

Mercator Institute for China Studies

Idiap Research Institute

Klinikum rechts der Isar

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Submotions for Hidden Markov Model Based Dynamic Facial Action Recognition

International Conference on Image Processing (2006)

Dejan Arsić Joachim Schenk Björn W. Schuller Frank Wallhoff Gerhard Rigoll

Video based analysis of a persons' mood or behavior is in general performed by interpreting various features observed on the body. Facial actions, such as speaking, yawning or laughing are considered as key features. Dynamic changes within the face can be modeled with the well known Hidden Markov Models (HMM). Unfortunately even within one class examples can show a high variance because of unknown start and end state or the length of a facial action. In this work we therefore perform a decomposition of those into so called submotions. These can be robustly recognized with HMMs, applying selected points in the face and their geometrical distances. Additionally the first and second derivation of the distances is included. A sequence of submotions is then interpreted with a dictionary and dynamic programming, as the order may be crucial. Analyzing the frequency of sequences shows the relevance of the submotions order. In an experimental section we show, that our novel submotion approach outperforms a standard HMM with the same set of features by nearly 30% absolute recognition rate.

Sequence (biology)

10.1109/icip.2006.312420

Cite

Citations (7)

An Explanatory Windshield Display Interface with Augmented Reality Elements for Urban Autonomous Driving

2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct) (2018)

Patrick Lindemann Tae Young Lee Gerhard Rigoll

One of the challenges in reaching wide-spread autonomous driving is the establishment of driver trust in the technology. We suggest a windshield display interface showing the perceptive abilities and decision-making of an automated car while driving. We took a human-centered design approach to determine user expectations and requirements. We present our resulting interface prototype which runs in a mixed-reality environment. We plan to evaluate its impact on situation awareness and trust in hard-to-predict urban scenarios.

Windshield

Interface (matter)

10.1109/ismar-adjunct.2018.00027

Cite

Citations (9)

Reduced Complexity and Scaling for Asynchronous HMMS in a Bimodal Input Fusion Application

Marc Al-Hames Gerhard Rigoll

The asynchronous hidden Markov model (AHMM) can model the joint likelihood of two observation sequences, even if the streams are not synchronised. Previously this model has been applied to audio-visual recognition tasks. The main drawback of the concept is its rather high training and decoding complexity. In this work we show how the complexity can be reduced significantly with advanced running indices for the calculations. Yet, the AHMM characteristics and its advantages are preserved. The improvement also allows a scaling procedure to keep numerical values in a reasonable range. In an experimental section we compare the complexity of the original and the improved concept and validate the theoretical results. Then the model is tested on a bimodal speech and gesture user input fusion task: compared to a late fusion HMM an improvement of more than 10% absolute recognition performance has been achieved

10.1109/icassp.2006.1661386

Cite

Citations (3)

Real-time framework for multimodal human-robot interaction

Jürgen Gast Alexander Bannat Tobias Rehrl Frank Wallhoff Gerhard Rigoll

This paper presents a new framework for multimodal data processing in real-time. This framework comprises modules for different input and output signals and was designed for human-human or human-robot interaction scenarios. Single modules for the recording of selected channels like speech, gestures or mimics can be combined with different output options (i.e. robot reactions) in a highly flexible manner. Depending on the included modules, online as well as offline data processing is possible. This framework was used to analyze human-human interaction to gain insights on important factors and their dynamics. Recorded data comprises speech, facial expressions, gestures and physiological data. This naturally produced data was annotated and labeled in order to train recognition modules which will be integrated into the existing framework. The overall aim is to create a system that is able to recognize and react to those parameters that humans take into account during interaction. In this paper, the technical implementation and application in a human-human and a human-robot interaction scenario is presented.

Human–robot interaction

Human interaction

10.1109/hsi.2009.5090992

Cite

Citations (11)

A Framework for Modular Signal Processing Systems with High-Performance Requirements

Lukas Diduch R. M. Muller Gerhard Rigoll

This paper introduces the software framework MMER Lab which allows an effective assembly of modular signal processing systems optimized for memory efficiency and performance. Our C/C++ framework is designed to constitute the basis of a well organized and simplified development process in industrial and academic research teams. It supports the structuring of modular systems by provision of basic data-, parameter-, and command-interfaces, ensuring the re-usability of the system components. Due to the underlying multi-threading capabilities, the applications built in MMER Lab are enabled to fully exploit the increasing computational power of multi-core CPU architectures. This feature is carried out by a buffering concept which controls the data flow between the connected modules and allows for the parallel processing of consecutive signal segments (e.g. video frames). We introduce the concept of the multi-threading environment and the data flow architecture with its comfortable programming interface. We illustrate the proposed module concept for the generic assembly of processing chains and show applications from the area of video analysis and pattern

Structuring

Interface (matter)

10.1109/icme.2007.4284861

Cite

Citations (5)

Recognition of Spontaneous Emotions by Speech within Automotive Environment

Björn W. Schuller M. Lang Gerhard Rigoll

Within the car, recognition of emotion largely helps to design communication more natural. Speech interaction is here used more broadly today, and affective cues are contained within acoustic and linguistic parameters. However, we introduce novel concepts and results considering the estimation of a driver’s emotion by focusing on acoustic information herein. As a database we recorded 2k dialog turns directed to an automotive infotainment interface during extensive usability studies. Speech recognition and natural language interpretation have thereby been realized once as a Wizard-of-Oz simulation, and once by actual recognition technology. Recorded utterances have been labelled using a closed set of four emotions, namely anger, confusion, joy, and neutrality. As acoustic features we apply a high number of prosodic, speech quality, and articulatory functionals derived by descriptive statistic analysis out of base contours as intonation, intensity, and spectral information. Self-learning feature generation and selection is employed to optimize complexity for the successive classification by Support Vector Machines. Semantic information is included by vector-space representation of the spoken content within an early feature fusion. Overall, high recognition performances can be reported for this task by the suggested approach.

Utterance

Feature (linguistics)

Natural language understanding

Source

Cite

Citations (26)

Light-Weight Self-Attention Augmented Generative Adversarial Networks for Speech Enhancement

Electronics (2021)

Lujun Li Zhenxing Lu Tobias Watzel Ludwig Kürzinger Gerhard Rigoll

Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive field. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is flexible in modeling both long-range and local interactions and can be computationally efficient at the same time. Our work is implemented in three phases: firstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95% fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters.

Baseline (sea)

Convolution (computer science)

10.3390/electronics10131586

Cite

Citations (9)

Acoustic voice communication with expert systems supporting electric power system operation

European Transactions on Electrical Power (1999)

G. Krost Gerhard Rigoll K. Salek

Abstract The appearance of expert systems supporting the operation of electric power systems was polished by adding a bi‐directional acoustic dialogue interface. The appertaining subsystems present the proposals of the expert system for switching, setting or other actions in natural/switching‐language synthetic voice form (in addition to the original screen display) and recognize the operators' spoken answers (e.g. confirmation/rejection of operations or request for explanation). The operators can keep concentrated on the operational surface of the power system without frequently swinging over to the man‐machine interface of the expert system, thus being able to more continuously observe and understand the context and reactions of the operations performed.

Interface (matter)

10.1002/etep.4450090502

Cite

Citations (0)

Large vocabulary speech recognition with context dependent MMI-connectionist / HMM systems using the WSJ database

Jörg Rottland Christoph Neukirchen Daniel Willett Gerhard Rigoll

In this paper we present a context dependent hybrid MMI-connectionist / Hidden Markov Model (HMM) speech recognition system for the Wall Street Journal (WSJ) database. The hybrid system is build with a neural network, which is used as a vector quantizer (VQ) and an HMM with discrete probablility density functions, which has the advantage of a faster decoding. The neural network is trained on an algorithm, that tries to maximize the mutual information between the classes of the input features (e.g. phones, triphones, etc.) and the neural firing sequence of the network. The system has been trained on the 1992 WSJ corpus (si-84). Tests were performed on the fiveand twentythousand word, speaker independent (si_et) tasks. The error rates of a new context dependend neural network are 29% lower (relative) than the error rates of a standard (k-means) discrete system and the error rates are very close to the best continuous/semicontinuous HMM speech recognizers.

Word error rate

Connectionism

Sequence (biology)

10.21437/eurospeech.1997-47

Cite

Citations (12)

A real-time system for hand gesture controlled operation of in-car devices

Martin Zobl M. Geiger Björn W. Schuller M. Lang Gerhard Rigoll

The integration of more and more functionality into the human machine interface (HMI) of vehicles increases the complexity of device handling. Thus optimal use of different human sensory channels is an approach to simplify the interaction with in-car devices. This way the user convenience increases as much as distraction may decrease. In this paper a video based real-time hand gesture recognition system for in-car use is presented. It was developed in course of extensive usability studies. In combination with a gesture optimized HMI it allows intuitive and effective operation of a variety of in-car multimedia and infotainment devices with hand poses and dynamic hand gestures.

Interface (matter)

10.1109/icme.2003.1221368

Cite

Citations (38)