logo
    Chinese chunking method based on conditional random fields and semantic classes
    18
    Citation
    0
    Reference
    20
    Related Paper
    Citation Trend
    Abstract:
    To improve the accuracy of Chinese chunking and utilize the semantic information of words,a new Chinese chunking method is proposed based on conditional random fields and semantic classes.Through the analysis of Chinese chunking task and its sequential characteristics,conditional random fields that could incorporate various types of features were applied to overcome the label bias problem.Semantic features were utilized to improve the chunking performance.Experimental results show that the algorithm achieves impressive accuracy of 92.77% in terms of the F-score.A further experiment indicates the effects of feature template selection and training data′s scales on the aspect of chunking performance.
    Keywords:
    Chunking (psychology)
    Semantic feature
    We present new statistical models for jointly labeling multiple sequences and apply them to the combined task of part-of-speech tagging and noun phrase chunking. The model is based on the Factorial Hidden Markov Model (FHMM) with distributed hidden states representing part-of-speech and noun phrase sequences. We demonstrate that this joint labeling approach, by enabling information sharing between tagging/chunking subtasks, out-performs the traditional method of tagging and chunking in succession. Further, we extend this into a novel model, Switching FHMM, to allow for explicit modeling of cross-sequence dependencies based on linguistic knowledge. We report tagging/chunking accuracies for varying dataset sizes and show that our approach is relatively robust to data sparsity.
    Chunking (psychology)
    Sequence labeling
    Phrase
    Citations (14)
    A new Chinese chunking algorithm is proposed based on conditional random fields and semantic features. Through the analysis of Chinese chunking task and its sequential characteristics, conditional random fields that combine various kinds of features were applied. Semantic features were utilized to further improve the chunking performance. Experimental results on the Chinese chunking corpus of Microsoft Research Asia show that the algorithm achieves impressive accuracy of 92.52% in terms of the F-score.
    Chunking (psychology)
    State-of-the-art sequence labeling systems traditionally used handcrafted n-gram features and data pre-processing, but usually ignored character-level information. In this paper, we propose to apply word hashing method which can catch the morphological information of words to sequence labeling tasks. Auto-encoder is first employed to learn latent morphological representation in a pre-training stage. Our model benefits from both morphological and semantic features of words by using bidirectional LSTM structure. Experiment results show that our model achieves best result on Chunking task - 94.93% and NP-Chunking task - 95.70% on CoNLL2000 dataset and obtains competitive performance on NER task - 89.29% on CoNLL2003 dataset.
    Chunking (psychology)
    Sequence labeling
    Sequence (biology)
    Citations (0)
    Multiword chunking is designed as a shallow parsing technique to recognize external constituent and internal relation tags of a chunk in sentence. In this paper, we propose a new solution to deal with this problem. We design a new relation tagging scheme to represent different intra-chunk relations and make several experiments of feature engineering to select a best baseline statistical model. We also apply outside knowledge from a large-scale lexical relationship knowledge base to improve parsing performance. By integrating all above techniques, we develop a new Chinese MWC parser. Experimental results show its parsing performance can greatly exceed the rule-based parser trained and tested in the same data set.
    Chunking (psychology)
    Feature Engineering
    A novel methodology is presented to enhance Chinese text chunking with the aid of transductive Hidden Markov Models (transductive HMMs),where the chunking is considered as a special tagging problem. An attempt is thus made to utilize it via a number of transformation functions to introduce as much relevant contextual information as possible in model training. These functions enable the models to make use of contextual information to a greater extent and keep away from costly changes of the original training and tagging process. Each of them results in an individual model with certain pros and cons. Through a number of experiments, the best two models are integrated into a significantly better one. The chunking experiments were carried out on the HIT Chinese Treebank corpus, of which the results show that it is an effective approach to the recognition of Chinese chunk, achieving an F score of 8238%.
    Chunking (psychology)
    Treebank
    Citations (3)
    In this paper, we address the issue of improving a Chinese chunking system with rich lexicalized information. A method that incorporates statistical information based on distributional similarity between words obtained from large unlabeled corpus and morphological knowledge into a state-of-the-art CRF-based chunking model is proposed to tackle the data sparseness problem given limited amount of labeled training data. Evaluations are performed on the latest release of Chinese Treebank, and experimental results show that our method outperforms the chunking models based on features over word and automatically assigned POS tags when using the same amount of training data.
    Chunking (psychology)
    Treebank
    Similarity (geometry)
    In the fields of Chinese natural language processing, recognizing simple and non-recursive base phrases is an important task for natural language processing applications, such as information processing and machine translation. In stead of rule-based model, we adopt the statistical machine learning method, newly proposed Latent semi-CRF model to solve the Chinese noun phrase chunking problem. The Chinese base phrases could be treated as the sequence labeling problem, which involve the prediction of a class label for each frame in an unsegmented sequence. The Chinese noun phrases have sub-structures which could not be observed in training data. We propose a latent discriminative model called Latent semi-CRF(Latent Semi Conditional Random Fields), which incorporates the advantages of LDCRF(Latent Dynamic Conditional Random Fields) and semi-CRF that model the sub-structure of a class sequence and learn dynamics between class labels, in detecting the Chinese noun phrases. Our results demonstrate that the latent dynamic discriminative model compares favorably to Support Vector Machines, Maximum Entropy Model, and Conditional Random Fields(including LDCRF and semi-CRF) on Chinese noun phrases chunking.
    Discriminative model
    Determiner phrase
    Chunking (psychology)
    Sequence labeling
    Phrase
    Citations (1)
    Text chunking is an effective method to decrease the difficulty of natural language parsing. In this paper, a statistical method based on hidden Markov model (HMM) is used for Chinese text chunking. Moreover, a transformation based error-driven learning approach is adopted to improve the performance. The definition of transformation rule templates is the key problem of this machine learning approach. All the templates are learned from the corpus automatically in this paper. The precision using HMM is 88.19% and the precision is 92.67% combining HMM and transformation based error-driven learning
    Chunking (psychology)
    Template
    Citations (3)
    This paper presents a lexicalized HMM-based approach to Chinese text chunking. To tackle the problem of unknown words, we formalize Chinese text chunking as a tagging task on a sequence of known words. To do this, we employ the uniformly lexicalized HMMs and develop a lattice-based tagger to assign each known word a proper hybrid tag, which involves four types of information: word boundary, POS, chunk boundary and chunk type. In comparison with most previous approaches, our approach is able to integrate different features such as part-of-speech information, chunk-internal cues and contextual information for text chunking under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. Our preliminary experiments on the PolyU Shallow Treebank show that the use of lexicalization technique can substantially improve the performance of a HMM-based chunking system.
    Chunking (psychology)
    Lexicalization
    Treebank
    This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields(CRFs) and Error-driven technique.First eleven types of Chinese chunks are divided into different groups to build CRFs model respectively.Then,the error-driven technique is applied over CRFs chunking results for further modification.Finally,a method is described to deal with the conflicting chunking according to the F-measure values.The experimental results show that this approach is effective,outperforming the single CRFs-based approach,distributed method and other hybrid approaches in the open test by achieving reaches 94.90%,91.00%,and 92.91% in recall,precision,and F-measure respectively.
    CRFS
    Chunking (psychology)
    Citations (1)