Block linked list index structure for large data full text retrieval

2017 
In order to solve the problems of the existing Chinese full-text retrieval algorithms in terms of large data, for example, data structure is difficult to expand, not suitable for incremental index, and the retrieval efficiency is low. Based on the traditional inverted index structure, this paper proposes an index structure, which can support large data storage with extension ability and update in real time: block linked-list index structure. Firstly, the new algorithm introduces the management concept of the block unit, the block unit is responsible for the management of the document set, and it creates an index for each term in the master index, then the index linked-list maps the term index and the block unit. This block linked-list index structure can greatly improve the ability of index expansion. Secondly, the main index and document index are both using the fixed-length storage with the same length, the positions of the terms' index information are both stationary in the master index file and the document index file of the block unit. According to the method, it can effectively solve the problems of the incremental index update and improve the update efficiency of the index. Finally, in the experiments, 350000 documents (about 1.46TB data) are randomly selected from the internet corpus (SogouT), which is used for comparing two index algorithms in three aspects, including the capability of initial datasets index creation, multitudinous files update and the retrieval of massive data sets. The results show that the new index algorithm has higher processing performance, especially in the efficiency of updating, with nearly 10% improvement.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    2
    Citations
    NaN
    KQI
    []