An algorithm for suffix stripping

Program electronic library and information systems (1980)

Martin Porter

8,177

Citation

Reference

Related Paper

Citation Trend

Abstract:

The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

Keywords:

Stripping (fiber)

Suffix array

Generalized suffix tree

Compressed suffix array

SIMPLE algorithm

Topics:

Natural Language Processing Techniques

Algorithms and Data Compression

Web Data Mining and Analysis

10.1108/eb046814

Cite

A time and space efficient data structure for string searching on large texts

Information Processing Letters (1996)

Livio Colussi Alessia De Col

Generalized suffix tree

Compressed suffix array

Suffix array

10.1016/0020-0190(96)00061-0

Cite

Citations (26)

A Fast Algorithm for Constructing Suffix Arrays for Fixed-Size Alphabets

Lecture notes in computer science (2004)

Dong Ki Kim Jun-Ha Jo Heejin Park

Generalized suffix tree

Compressed suffix array

Suffix array

Sequence (biology)

10.1007/978-3-540-24838-5_23

Cite

Citations (19)

Space efficient linear time construction of suffix arrays

Journal of Discrete Algorithms (2004)

Pang Ko Srinivas Aluru

Generalized suffix tree

Compressed suffix array

Suffix array

Linear space

10.1016/j.jda.2004.08.002

Cite

Citations (182)

Indexing Genome with the External Construction of Compressed Suffix Tree Using LCP Array

Asian Journal of Engineering and Applied Technology (2013)

Vijay Kumar Vishwakarma Abhishek Srivastava

We are proposing the genome indexing algorithm, which depends upon compressed form of suffix trees, in which every node has four parts; suffix array number, suffix start number, LCP count, and a pointer to another node. The proposed algorithm does not use the whole suffix array, it just takes some necessary information like LCP of two suffix array, compare them and suffix start number, to align the suffix to proper position and suffix array number to distinguish among all the partitions. The use of compressed suffix array minimizes the number of trees, eventually; it also minimizes the random access to input data, as it creates the compressed suffix tree for two suffix arrays using pairwise sorting, sequentially.

Compressed suffix array

Generalized suffix tree

Suffix array

10.51983/ajeat-2013.2.1.652

Cite

Citations (0)

Fast Construction of Suffix Arrays for DNA Strings

Jeongbo gwahaghoe nonmunji. si'seu'tem mich i'lon (2007)

Jun-Ha Jo Namhee Kim Ki-Ryong Kwon Dong-Kyue Kim

To perform fast searching in massive data such as DNA strings, the most efficient method is to construct full-text index data structures of given strings. The widely used full-text index structures are suffix trees and suffix arrays. Since the suffix may uses less space than the suffix tree, the suffix array is proper for DNA strings. Previously developed construction algorithms of suffix arrays are not suitable for DNA strings since those are designed for integer alphabets. We propose a fast algorithm to construct suffix arrays on DNA strings whose alphabet sizes are fixed by 4. We reduce the construction time by improving encoding and merging steps on Kim et al.[1]'s algorithm. Experimental results show that our algorithm constructs suffix arrays on DNA strings 1.3-1.6 times faster than Kim et al.'s algorithm, and also for other algorithms in most cases.

Compressed suffix array

Generalized suffix tree

Suffix array

Source

Cite

Citations (0)

The Enhanced Suffix Array and Its Applications to Genome Analysis

Lecture notes in computer science (2002)

Mohamed Abouelhoda Stefan Kurtz Enno Ohlebusch

Generalized suffix tree

Compressed suffix array

Suffix array

Tree (set theory)

10.1007/3-540-45784-4_35

Cite

Citations (123)

Space Efficient Linear Time Construction of Suffix Arrays

Lecture notes in computer science (2003)

Pang Ko Srinivas Aluru

Generalized suffix tree

Compressed suffix array

Suffix array

10.1007/3-540-44888-8_15

Cite

Citations (220)

Linear work suffix array construction

Journal of the ACM (2006)

Juha Kärkkäinen Peter Sanders Stefan Burkhardt

Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover . This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1, √n ], it runs in O( vn ) time using O( n / √v ) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

Compressed suffix array

Generalized suffix tree

Suffix array

10.1145/1217856.1217858

Cite

Citations (397)

Lempel–Ziv Factorization Using Less Time & Space

Mathematics in Computer Science (2008)

Gang Chen Simon J. Puglisi W. F. Smyth

Compressed suffix array

Generalized suffix tree

Suffix array