Text Similarity Computing Based on Word Co-Occurrence
1
Citation
0
Reference
20
Related Paper
Abstract:
In text retrieval,insufficient expression of the client requirements usually leads to large amounts of inappropriate information,which brings inconvenience to user retrieval.The text similarity computing based on word co-occurrence presented in this paper enables users to delete or maintain text collections similar to a certain text in order to improve retrieval efficiency.Keywords:
Similarity (geometry)
Co-occurrence
Cite
Word truncation is a familiar technique employed by online searchers in order to increase recall in free text retrieval. The use of truncation, however, can be a mixed blessing since many words starting with the same root are not semantically or logically related. Consequently, online searchers often select words to be OR‐ed together from an alphabetic display of neighbouring terms in the inverted file in order to assure precision in the search. Automatic stemming algorithms typically function in a manner analogous to word truncation, with the added risk of the word roots being incorrectly identified by the algorithm. This paper describes a two‐phase stemming algorithm that consists of the identification of the word root and the automatic selection of ‘well‐formed’ morphological word variants from the actual inverted file entries that start with the same word root. The algorithm has been successfully used in an end‐user interface to NLM's Catline book catalog file.
Inverted index
Root (linguistics)
Truncation (statistics)
Cite
Citations (16)
There are many reasons that fail to get appropriate information in information retrieval. Allomorph is one of the reasons for search failure due to keyword mismatch. This research proposes a method to construct alternative word candidates automatically in order to minimize search failure due to keyword mismatch. Assuming that two words have similar meaning if they have similar co-occurrence words, the proposed method uses the concept of concentration, association word set, cosine similarity between association word sets and a filtering technique using confidence. Performance of the proposed method is evaluated using a manually extracted alternative list. Evaluation results show that the proposed method outperforms the context window overlapping in precision and recall.
Cosine similarity
Similarity (geometry)
Cite
Citations (1)
The Chinese word clouds can view the Chinese documents based on users’ input or URL web sites, in accordance with the appearing frequency of words in the documents to generate a multi-state performance image, similar to the cloud images, namely, "word cloud". This system first obtains Chinese documents users view, cuts up words to the documents, then counts the frequency of each word, and finally puts the words on the appropriate framework based on the high or low word frequency, showing an effect of "word cloud". This system uses the hash algorithm making the output fonts not overlap, and generates personalized display images of Chinese word clouds, which makes users quickly understand the key information in Chinese documents and presents to users the different visual feelings.
Tag cloud
Cite
Citations (1)
With the popularity and importance of document images as an information source, information retrieval in document image databases has become a challenge. In this paper, an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents has been proposed. Initially, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the string generated of the query word with the word string generated from the document. Based on the similarity, we can find out how a word image is relevant to the other and, can be decided whether one is a portion of the other. In order to deal with various character fonts, a primitive string which is tolerant to serif and font differences to represent a word image has been used. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. From the experimental results on a variety of document image databases it is confirmed that the proposed approach is feasible, valid, and efficient in document image retrieval.
Similarity (geometry)
Spotting
Cite
Citations (2)
During the indexing process of traditional search engine, web pages become a list of terms, but single term cannot represent the rich content of web pages, which makes information retrieval methods mainly based on terms matching often result in depressed precision. This paper proposes a novel query expansion technique that has phrases as its expansion unit. Phrases typically have a higher information content and a smaller degree of ambiguity than their constituent words, and therefore represent the concepts expressed in text more accurately than single terms. This method extracts key phrases from original results, and calculates the semantic similarity between the query phrase and each phrase extracted using the semantic similarity algorithm based on WordNet, and then expands the query with the most similar phrases to search again. Experimental results show that the proposed algorithm can provide more precision than the traditional query expansion methods.
Phrase
Similarity (geometry)
Cite
Citations (27)
This paper propsed a novel text representation and matching scheme for Chinese text retrieval.At present,the indexing methods of chinese retrieval systems are either character-based or word-based.The character-based indexing methods,such as bi-gram or tri-gram indexing,have high false drops due to the mismatches between queries and documents.On the other hand,it's difficult to efficiently identify all the proper nouns,terminology of different domains,and phrases in the word-based indexing systems.The new indexing method uses both proximity and mutual information of the word paris to represent the text content so as to overcome the high false drop,new word and phrase problems that exist in the character-based and word-based systems.The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.
Phrase
Automatic indexing
Cite
Citations (0)
Although a word-based method is commonly used in document retrieval, it cannot be directly applicable to languages that have no obvious word separator. Given a lexicon, it is possible to identify words in documents, but a large lexicon is troublesome to maintain and makes retrieval systems large and complicated. This paper proposes an effective and efficient ranking that does not use a large lexicon; words need not be identified during document registration because a character-based signature file is used for the access structure. A user request, during document retrieval, is statistically analyzed to generate an appropriate query, and the query is evaluated efficiently in a wordbased manner using the character-based index. We also propose two optimizing techniques to accelerate retrieval.
Cite
Citations (3)
A number of projects are dedicated to creating digital libraries from scanned books, such as Google Books, UDL, Digital Library of India (DLI), etc. The ability to search in the content of document images is essential for the usability and popularity of these DLs. In this work, we aim toward building a retrieval system over 120K document images coming from 1000 scanned books of Telugu literature. This is a challenge because: i) OCRs are not robust enough for Indian languages, especially the Telugu script, ii) the document images contain large number of degradations and artifacts, iii) scalability to large collections is hard. Moreover, users expect that the search system accept text queries and retrieve relevant results in interactive times.
Telugu
Popularity
Cite
Citations (0)
For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements.
Sequence (biology)
Cite
Citations (4)
Perplexity
Similarity (geometry)
Trie
Cite
Citations (33)