A research on the connectivity of undirected version Basic Element complex network generated from topic-related document set is carried out. Key properties of the network are obtained. The networks are mainly connected except some components involving very small ratio of nodes of the whole network. And the key semantic units do not exist in these small components. The properties will simplify the work if network methods are taken to address some nature language processing tasks.
This paper presents a new access-density-based prefetching strategy to improve prefetching for the access patterns that have not been dealt with in the current Linux read-ahead algorithm. These access patterns include: reading file data backwards, reading files in a stridden way (leaving holes in between two adjacent references), alternating references between multiple file regions and reading files randomly. When these patterns are used, the current Linux read-ahead algorithm cannot handle them because the read-ahead operation is not activated. Three metrics are proposed in the evaluation of the algorithm. The current results, obtained from a real prototype implementation in the Linux kernel, show that such prefetching can have a significant performance improvement on the aforementioned access patterns.
We focus on aspects of physical distribution of streams, and address the problem of communication reduction for continuous extreme values monitoring over distributed data streams. We firstly develop an effective pruning technique to minimize the number of elements to be kept for extreme values queries. Then we consider the distributed environment, where remote nodes delay the data transmission as late as possible, and adopt the pruning strategy to filter local stream tuples, which is quite efficient in communication reduction. The method is extended to adaptively run in a degraded manner for resource limitation. Analytical analysis and experimental evidences show the efficiency of proposed approach on communication reduction.
Resource discovery is a challenging problem in grid computing because computational resources are large-scale geographically distributed. Traditional decentralized resource discovery algorithms often focus on the searching method in forwarding direction. Response message is just used to report the matching node or matching failure. In this paper, a new resource discovery algorithm is introduced. Under this mechanism, request message and corresponding response message may choose different path to destination node. So we add probe feedback mechanism in response message to rediscovery requested resource if the resource can't be found in forwarding path. It provides more chances to satisfy the request. Furthermore, if advance reservation is supported by environment, response message can return the node which can provide matching resource firstly in near future if there are still no suitable resources in rediscovery period. Simulation shows that it can improve the performance of resource discovery especially when the job size is large and turn-around time is not very important to users
One important characteristic of wireless sensor networks is energy stringency. Constructing a connected dominating set (CDS) has been widely used as a topology control strategy to reduce the network communication overhead. In the paper, a novel energy efficient distributed connected dominating set algorithm based on coordinated reconstruction mechanism is presented to further prolong the network lifetime and balance energy consumption. The algorithm is with O(n) time complexity and O(n) message complexity. The simulation results show that our algorithm outperforms several existing algorithms in terms of network lifetime and CDS performance.
This paper proposes a strategy for Chinese multi-document summarization based on clustering and sentence extraction. It adopts the term vector to represent the linguistic unit in Chinese document, which obtains higher representation quality than traditional word-based vector space model in a certain extent. As for clustering, we propose two heuristics to automatically detect the proper number of clusters: the first one makes full use of the summary length fixed by the user; the second is a stability method, which has been applied to other unsupervised learning problems. We also discuss a global searching method for sentence selection from the clusters. To evaluate our summarization strategy, an extrinsic evaluation method based on classification task is adopted. Experimental results on news document set show that the new strategy can significantly enhance the performance of Chinese multi-document summarization
In this paper, the role of named entity based patterns is emphasized in measuring the document sentences and topic relevance for topic-focused extractive summarization. Patterns are defined as the informative, semantic-sensitive text bi-grams consisting of at least one named entity or the semantic class of a named entity. They are extracted automatically according to eight pre-specified templates. Question types are also taken into consideration if they are available when dealing with topic questions. To alleviate problems with coverage, pattern and uni-gram models are integrated together to compensate each other in similarity calculation. Automatic ROUGE evaluations indicate that the proposed idea can produce a very good system that tops the best-performing system at Document Understanding Conference (DUC) 2005.
Document re-ranking locates between initial retrieval and query expansion in information retrieval system. In this paper, we propose entropy-based clustering approach for document re-ranking. The value of within-cluster entropy determines whether two classes should be merged, and the value of between-cluster entropy determines how many clusters are reasonable. What to do next is finding a suitable cluster from clustering result to construct pseudo labeled document, and conduct document re-ranking as our previous method. We focus clustering strategy for documents after initial retrieval. Experiment with NTCIR-5 data show that the approach can improve the performance of initial retrieval, and it is helpful for improving the quality of document re-ranking.
In this paper a novel news story automatic segmentation scheme based on audio-visual features and text information is presented. The basic idea is to detect the shot boundaries for news video first, and then the topic-caption frames are identified to get segmentation cues by using text detection algorithm. In the next step, silence clips are detected by using short-time energy and short-time average zero-crossing rate (ZCR) parameters. At last, audio-visual features and text information are integrated to realize automatic story segmentation. On test data with 135, 400 frames, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.