Video based analysis of a persons' mood or behavior is in general performed by interpreting various features observed on the body. Facial actions, such as speaking, yawning or laughing are considered as key features. Dynamic changes within the face can be modeled with the well known Hidden Markov Models (HMM). Unfortunately even within one class examples can show a high variance because of unknown start and end state or the length of a facial action. In this work we therefore perform a decomposition of those into so called submotions. These can be robustly recognized with HMMs, applying selected points in the face and their geometrical distances. Additionally the first and second derivation of the distances is included. A sequence of submotions is then interpreted with a dictionary and dynamic programming, as the order may be crucial. Analyzing the frequency of sequences shows the relevance of the submotions order. In an experimental section we show, that our novel submotion approach outperforms a standard HMM with the same set of features by nearly 30% absolute recognition rate.
One of the challenges in reaching wide-spread autonomous driving is the establishment of driver trust in the technology. We suggest a windshield display interface showing the perceptive abilities and decision-making of an automated car while driving. We took a human-centered design approach to determine user expectations and requirements. We present our resulting interface prototype which runs in a mixed-reality environment. We plan to evaluate its impact on situation awareness and trust in hard-to-predict urban scenarios.
The asynchronous hidden Markov model (AHMM) can model the joint likelihood of two observation sequences, even if the streams are not synchronised. Previously this model has been applied to audio-visual recognition tasks. The main drawback of the concept is its rather high training and decoding complexity. In this work we show how the complexity can be reduced significantly with advanced running indices for the calculations. Yet, the AHMM characteristics and its advantages are preserved. The improvement also allows a scaling procedure to keep numerical values in a reasonable range. In an experimental section we compare the complexity of the original and the improved concept and validate the theoretical results. Then the model is tested on a bimodal speech and gesture user input fusion task: compared to a late fusion HMM an improvement of more than 10% absolute recognition performance has been achieved
This paper presents a new framework for multimodal data processing in real-time. This framework comprises modules for different input and output signals and was designed for human-human or human-robot interaction scenarios. Single modules for the recording of selected channels like speech, gestures or mimics can be combined with different output options (i.e. robot reactions) in a highly flexible manner. Depending on the included modules, online as well as offline data processing is possible. This framework was used to analyze human-human interaction to gain insights on important factors and their dynamics. Recorded data comprises speech, facial expressions, gestures and physiological data. This naturally produced data was annotated and labeled in order to train recognition modules which will be integrated into the existing framework. The overall aim is to create a system that is able to recognize and react to those parameters that humans take into account during interaction. In this paper, the technical implementation and application in a human-human and a human-robot interaction scenario is presented.
This paper introduces the software framework MMER Lab which allows an effective assembly of modular signal processing systems optimized for memory efficiency and performance. Our C/C++ framework is designed to constitute the basis of a well organized and simplified development process in industrial and academic research teams. It supports the structuring of modular systems by provision of basic data-, parameter-, and command-interfaces, ensuring the re-usability of the system components. Due to the underlying multi-threading capabilities, the applications built in MMER Lab are enabled to fully exploit the increasing computational power of multi-core CPU architectures. This feature is carried out by a buffering concept which controls the data flow between the connected modules and allows for the parallel processing of consecutive signal segments (e.g. video frames). We introduce the concept of the multi-threading environment and the data flow architecture with its comfortable programming interface. We illustrate the proposed module concept for the generic assembly of processing chains and show applications from the area of video analysis and pattern
Within the car, recognition of emotion largely helps to design communication more natural. Speech interaction is here used more broadly today, and affective cues are contained within acoustic and linguistic parameters. However, we introduce novel concepts and results considering the estimation of a driver’s emotion by focusing on acoustic information herein. As a database we recorded 2k dialog turns directed to an automotive infotainment interface during extensive usability studies. Speech recognition and natural language interpretation have thereby been realized once as a Wizard-of-Oz simulation, and once by actual recognition technology. Recorded utterances have been labelled using a closed set of four emotions, namely anger, confusion, joy, and neutrality. As acoustic features we apply a high number of prosodic, speech quality, and articulatory functionals derived by descriptive statistic analysis out of base contours as intonation, intensity, and spectral information. Self-learning feature generation and selection is employed to optimize complexity for the successive classification by Support Vector Machines. Semantic information is included by vector-space representation of the spoken content within an early feature fusion. Overall, high recognition performances can be reported for this task by the suggested approach.
Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive field. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is flexible in modeling both long-range and local interactions and can be computationally efficient at the same time. Our work is implemented in three phases: firstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95% fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters.
Abstract The appearance of expert systems supporting the operation of electric power systems was polished by adding a bi‐directional acoustic dialogue interface. The appertaining subsystems present the proposals of the expert system for switching, setting or other actions in natural/switching‐language synthetic voice form (in addition to the original screen display) and recognize the operators' spoken answers (e.g. confirmation/rejection of operations or request for explanation). The operators can keep concentrated on the operational surface of the power system without frequently swinging over to the man‐machine interface of the expert system, thus being able to more continuously observe and understand the context and reactions of the operations performed.
In this paper we present a context dependent hybrid MMI-connectionist / Hidden Markov Model (HMM) speech recognition system for the Wall Street Journal (WSJ) database. The hybrid system is build with a neural network, which is used as a vector quantizer (VQ) and an HMM with discrete probablility density functions, which has the advantage of a faster decoding. The neural network is trained on an algorithm, that tries to maximize the mutual information between the classes of the input features (e.g. phones, triphones, etc.) and the neural firing sequence of the network. The system has been trained on the 1992 WSJ corpus (si-84). Tests were performed on the fiveand twentythousand word, speaker independent (si_et) tasks. The error rates of a new context dependend neural network are 29% lower (relative) than the error rates of a standard (k-means) discrete system and the error rates are very close to the best continuous/semicontinuous HMM speech recognizers.
The integration of more and more functionality into the human machine interface (HMI) of vehicles increases the complexity of device handling. Thus optimal use of different human sensory channels is an approach to simplify the interaction with in-car devices. This way the user convenience increases as much as distraction may decrease. In this paper a video based real-time hand gesture recognition system for in-car use is presented. It was developed in course of extensive usability studies. In combination with a gesture optimized HMI it allows intuitive and effective operation of a variety of in-car multimedia and infotainment devices with hand poses and dynamic hand gestures.