The active disturbance rejection control (ADRC) techniques based on nonlinear dynamic inversion (NDI) method were proposed for a tailless unmanned aerial vehicle nonlinear flight control law design problem. The dynamics to be controlled were separated into two groups. For each group the corresponding control loop could be designed and the integrated controller was constructed by cascading them. The controller was compared with classic NDI method and shown to yield better performance in terms of disturbance rejection and robustness capabilities. Simulation results also showed the controller designed from one equilibrium condition could perform excellently in large flight envelope.
In this paper, we describe our contributions to the challenge of detection and classification of acoustic scenes and events. We propose multi-scale convolutional recurrent neural network(Multi-scale CRNN), a novel weakly-supervised learning framework for sound event detection. By integrating information from different time resolutions, the multi-scale method can capture both the fine-grained and coarse-grained features of sound events and model the temporal dependency including fine-grained dependency and long-term dependency. Furthermore, the ensemble method proposed in the paper reduces the frame-level prediction errors using classification results. The proposed method achieves 29.2% in the event-based F1-score and 1.40 in event-based error rate in development set of DCASE2018 task4 compared to the baseline of 14.1% F-value and 1.54 error rate [1].
A novel acoustic modeling method for Chinese speech recognition based on Intra-Syllable Dependent Phone (ISDP) set is proposed and practiced. The ISDP set extends the traditional phone set based on the intra-syllable information of Chinese phonetic knowledge. The acoustic models based on ISDP set (ISDPMs) have the following features: One, they are suitable for the case of a rather small scale of training data. Two, this scheme is an integration form of tri-phone modeling and syllable modeling. The mixed Gaussian densities are used to describe the feature space of each ISDP and the Viterbi algorithm is adopted for decoding process. In addition, the ISDP-syllable search tree is designed and presented to reduce the decoding complexity. Our Experimental result shows that the ISDP modeling is more flexible and faster than Syllable Modeling meanwhile it causes no much deduction of the performance.
Dynamic Music Emotion Prediction is crucial to the emerging applications of music retrieval and recommendation. Considering the influence of temporal context and hierarchical structure on emotion in music, we propose a Deep Bidirectional Long Short-Term Memory (DBLSTM) based multi-scale regression method. In this method, a post-processing component is utilised for individual DBSLTM output to further enhance the ability of temporal context processing and a fusion component is to integrate the output of all DBLSTM models with different scales. In addition, we investigate how the difference of sequence length between the training and predicting phase affects the performance of DBLSTM. We conduct our experiments on a public database of Emotion in Music task at MediaEval 2015, and the result shows that our method achieves significant improvement when compared with the state-of-art methods.
Question detection is of importance for many speech applications. Only parts of the speech utterances can provide useful clues for question detection. Previous work of question detection using acoustic features in Mandarin conversation is weak in capturing such proper time context information, which could be modeled essentially in recurrent neural network (RNN) structure. In this paper, we conduct an investigation on recurrent approaches to cope with this problem. Based on gated recurrent unit (GRU), we build different RNN and bidirectional RNN (BRNN) models to extract efficient features at segment and utterance level. The particular advantage of GRU is it can determine a proper time scale to extract high-level contextual features. Experimental results show that the features extracted within proper time scale make the classifier perform better than the baseline method with pre-designed lexical and acoustic feature set.
Aiming at the challenge of cruise phase control in high-speed aircraft, this paper introduces a novel approach for position and attitude controller design employing reinforcement learning techniques. Combined with the working environment of high-speed aircraft, the training environment of reinforcement learning algorithm is established utilizing the 3DOF longitudinal dynamic model. Considering that both the state and control signals in position and attitude control are continuous variables, a position and attitude controller design method based on double delay depth deterministic strategy gradient (TD3) theory is studied. Based on the actor-critic architecture, the controller undergoes a process of continuous learning through iterative interaction and experimentation with the system's environment. It effectively diminishes the reliance on highly accurate aircraft models. Simulation outcomes affirm the superior performance of the TD3-based position and attitude controller in comparison to counterparts relying on PID, NDI, or LQR. This TD3-based controller not only achieves commendable control accuracy but also streamlines the controller design process by adopting an end-to-end approach.
Recently, the number of the online videos is booming. However, its openness gives the horror clips a chance to threaten children's physical and mental health. Therefore, it is necessary to design an algorithm to filter the horror clips in online videos. In this paper, we proposed a multimodal multilevel attention neural network for horror clip detection. Information from visual modality and auditory modality is used to describe the various factor of horror, including violence, bloody, deformed human, scream, sudden sound, etc. The temporal-level attention is designed to give the model the ability to capture horror moments. The modal-level attention automatically balances the weight on all modalities. We evaluate the model on the same dataset used in MediaEval 2017 Emotional Impact of Movies Task. The experimental result shows the advantages of our proposed model compared with other groups.
In this paper, a Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservation domain (CSTSC-Flight) of 6 GB raw data containing about 50 hours’ valid speech is introduced, including the collection and transcription principles and outline. Analysis on the spoken language phenomena contained in this corpus is then performed. Based on this, four types of grammatical are proposed so as to cover as many Chinese spoken language phenomena as possible for robust natural language parsing and understanding in spoken dialogue systems.