language-icon Old Web
English
Sign In

Parser combinator

In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing. In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing. Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable. They have been used extensively in the prototyping of compilers and processors for domain-specific languages such as natural-language interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated use of parser combinators to construct natural-language interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992. S.D. Swierstra also exhibited the practical aspects of parser combinators in 2001. In 2008, Frost, Hafiz and Callaghan described a set of parser combinators in Haskell that solve the long-standing problem of accommodating left recursion, and work as a complete top-down parsing tool in polynomial time and space. In any programming language that has first-class functions, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a context-free grammar (CFG) may have one or more alternatives and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives. In languages that support operator overloading, a parser combinator can take the form of an infix operator, used to glue different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the formal grammar. As such, implementations can be thought of as executable specifications with all the associated advantages. (Notably: readability) To keep the discussion relatively straightforward, we discuss parser combinators in terms of recognizers only. If the input string is of length #input and its members are accessed through an index j, a recognizer is a parser which returns, as output, a set of indices representing positions at which the parser successfully finished recognizing a sequence of tokens that began at position j. An empty result set indicates that the recognizer failed to recognize any sequence beginning at index j. A non-empty result set indicates the recognizer ends at different positions successfully. Note that there may be multiple distinct ways to parse a string while finishing at the same index: this indicates an ambiguous grammar. Simple recognizers do not acknowledge these ambiguities; each possible finishing index is listed only once in the result set. For a more complete set of results, a more complicated object such as a parse tree must be returned. Following the definitions of two basic recognizers p and q, we can define two major parser combinators for alternative and sequencing: Consider a highly ambiguous context-free grammar, s ::= ‘x’ s s | ε. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern functional language (e.g. Haskell) as s = term ‘x’ <*> s <*> s <+> empty. When the recognizer s is applied on an input sequence xxxxx at position 1, according to the above definitions it would return a result set {5,4,3,2}. Parser combinators, like all recursive descent parsers, are not limited to the context-free grammars and thus do no global search for ambiguities in the LL(k) parsing Firstk and Followk sets. Thus, ambiguities are not known until run-time if and until the input triggers them. In such cases, the recursive descent parser may default (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, resulting in semantic confusion (aliasing) in the use of the language. This leads to bugs by users of ambiguous programming languages, which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution that eliminates these bugs is to remove the ambiguities and use a context-free grammar.

[ "Grammar", "Rule-based machine translation", "Syntax", "Parsing", "Compiler-compiler", "Memoization", "Chart parser", "LALR parser", "LR parser" ]
Parent Topic
Child Topic
    No Parent Topic