Crate pars

Crate pars 

Source
Expand description

§Parser-combinator library for Rust.

pars is a general purpose parser-combinator library, with support for no_std. pars is heavily inspired by nom, but pars aims to generalize more over different kinds of input and provide more Rust-idiomatic and ergonomic combinators. pars also provides direct support for higher level concepts such as Unicode and regular expressions.

§Contents

§Example

use pars::prelude::*;
use pars::ErrorKind;
use pars::unicode::{
    strict::{char as uchar, verbatim},
    UnicodeInput as UInput, PResult,
};

#[derive(Debug, PartialEq)]
struct Color {
    red: u8,
    green: u8,
    blue: u8,
}

fn hex_digit<I: UInput>(input: I) -> PResult<u8, I> {
    uchar.try_map(|ch| match ch {
        '0'..='9' => Ok((ch as u8) - b'0'),
        'a'..='f' => Ok((ch as u8) - b'a' + 10),
        'A'..='F' => Ok((ch as u8) - b'A' + 10),
        _ => Err(ErrorKind::InvalidInput),
    }).parse(input)
}

fn hex_byte<I: UInput>(input: I) -> PResult<u8, I> {
    hex_digit.array().map(|[d0, d1]| (d0 << 4) | d1).parse(input)
}

fn hex_color<I: UInput>(input: I) -> PResult<Color, I> {
    prefix(verbatim("#"), hex_byte.array())
        .map(|[r, g, b]| Color {
            red: r,
            green: g,
            blue: b,
        })
        .parse(input)
}

fn main() {
    assert_eq!(
        hex_color.parse("#2F14DF"),
        Ok(Success(Color {
            red: 0x2f,
            green: 0x14,
            blue: 0xdf,
        }, ""))
    );
}

§Parser Combinators

Classically, parsers are defined as a grammar that parser generators, such as lex and yacc, will transpile to an implementation of a state machine. This generally produces the most efficient parsers, to be sure. Some popular parser generators for Rust include pest, LALRPOP, and peg.

A parser combinator, on the other hand, is generally a library of parser building blocks for building a recursive descent parser. The trade off here is that we gain the ability to write plain Rust (in the case of pars) as well the ability to easily reuse parsers to build more complex parsers, but at the costs associated with recursive descent parsers. The downside of recursive descent parsers is that there is no limit to how far back the parser may need to backtrack in the input stream during parsing, and thus are generally not as efficient as parser produced by parser generators. But note that, for example, most practical compilers are implemented with a recursive descent parser because of their power and flexibility, despite not being the most efficient approach. Also note that in some cases, it is possible define a complex parser without any backtracking with a little extra effort. See the example msgpack_alt.rs in the crate repo for an example of a parser with no backtracking.

§Parser Input

In pars a parser is defined as a function that accepts a stream of symbols as a parameter, and returns either a parse success or parse failure. A stream of symbols is generalized by the Input trait. Input is very similar to the Iterator trait in the Rust standard library, but it has some extra properties. Input must also implement Clone, and it must be possible to test whether two Input objects are at the same stream position. Note that for good performance, the Clone implementation should be cheap, but it is not required that it implement Copy.

Input is automatically implemented for slice references, &[T] and &str. To illustrate the properties of an Input consider input: &[u8]. Advancing the input is equivalent to input = &input[1..]. Testing that two inputs are at the same position is just testing that the underlying pointers are equal, input.as_ptr() == other_input.as_ptr(). Also, &[u8] clearly has a cheap Clone implementation.

See the documentation for Input for more details.

pars also provides the IntoInput trait, which is similarly analogous to the IntoIterator trait in the Rust standard library. The Parse::parse function accepts a value implementing IntoInput instead of Input for the convience of the caller. For example, &Vec<u8>, &Box<[u8]>, &[u8; N] all implement IntoIterator and will convert to the input type &[u8].

See the documentation for IntoInput for more details.

Since Input does not have any constraints on what a “symbol” can be, pars can be adapted to consume pretty much any kind of input stream. For example, in some cases it might be desirable to build a lexer (or scanner) to interpret the source input as a stream of tokens, and parse the token stream with pars instead of the source input. Sufficiently complex parsers, such as compiler frontends, generally prefer this approach. The lexer just needs to provide an interface with a type that implements Input. Note that it will usually be necessary for the lexer (or whatever the custom input source is) to be “multi-pass” in order to support parser backtracking.

TODO: Include a custom input example.

§Defining a Parser

Most user defined parsers should be implemented as a plain function. Types implementing Fn(I) -> PResult<T, I, E>, where I implements Input and E implements Error<I> automatically implement Parse<I, Parsed = T, Error = E>. Closures can also be used to define a parser inline, but note that occaisionally closures may need some type hinting assistance to compile properly as a parser.

use pars::{Input, PResult, Success, bytes::Error, Parse};

// `my_parser` implements `Parse<I, Parsed = i32, Error = Error<I>>`
fn my_parser<I: Input>(input: I) -> PResult<i32, I, Error<I>> {
    Success(42, input).into()
}

When needed, Parse can always be implemented manually.

§Defining a Combinator

Occaisionly, there is a need to implement a custom combinator for reuse. The simplest way is to define a function returning impl Parse.

use pars::{basic::array, Input, Parse};

// Creates a parser that applies `parser` twice and
// returns the parsed values as an array of length 2.
const fn twice<I, P>(parser: P)
    -> impl Parse<I, Parsed = [P::Parsed; 2], Error = P::Error>
where
    I: Input,
    P: Parse<I>,
{
    array(parser)
}

It is usually necessary for the impl Parse return to explicitly define the Parsed and Error associated types, in order for type inference to work properly in conjunction with other combinators.

§Parsing Errors

pars does not define a general purpose error for parsing failures. Instead, it provides the Error trait. Users can define their own parsing error types, and pars provides a basic error type for each of the modules bytes, ascii, and unicode.

A parsing error, aside from other error information it communicates, contains the input position at which the error occurred. The generic parsers and combinators defined throughout pars use the interface described by the Error trait to instantiate an error as needed, also providing where the error occurred. Note that the input position described by an Error may be different from the input position described by a Failure within a PResult. The Failure always describes the remaining input that has not been successfully parsed, but the error may have occurred further into the input and caused the parser to backtrack.

An important partner to the Error trait is the ErrorSeed trait. Essentially, a type implementing ErrorSeed is a parsing error without the accompanying input position of the error. Such a type can be combined with an input position via the ErrorSeed trait to create an actual error.

Many combinators defined in pars, particularly those named try_*, require the user to provide a function or closure that returns a Result. The Err variant of that Result must be convertible to an Error via ErrorSeed. The reason for this is combinators generally abstract away the input stream, letting the user focus more on how parsers fit together and produce a parsed value. When some intermediate operation is fallible, being able to return an ErrorSeed simplifies matters by letting the combinator implementation deal with tracking input positions.

A typical user defined error will be defined as a pair of types: One implementing ErrorSeed and another implementing Error.

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
struct MyErrorKind(&'static str);

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
struct MyError<I: Input> {
    msg: &'static str,
    pos: I,
}

impl<I: Input> ErrorSeed<I, MyError<I>> for MyErrorKind {
    fn into_error(self, pos: I) -> MyError<I> {
        MyError {
            msg: self.0,
            pos,
        }
    }
}

impl<I: Input> Error<I> for MyError<I> {
    fn need_more_input(pos: I) -> Self {
        MyErrorKind("need more input").into_error(pos)
    }

    fn expected_eof(pos: I) -> Self {
        MyErrorKind("expected end of input").into_error(pos)
    }

    fn invalid_input(pos: I) -> Self {
        MyErrorKind("invalid input").into_error(pos)
    }

    fn position(&self) -> &I { &self.pos }
}

For convenience, ErrorKind is available for use as an ErrorSeed type that can be converted to any type implementing Error.

When combining a user defined error with parsers that use a different error type, such as bytes::u8, there are a few ways this can be dealt with. The simplest way is to use Parse::err_into.

// Given `MyError` defined previously.
impl<I: Input> From<bytes::Error<I>> for MyError<I> {
    fn from(err: bytes::Error<I>) -> Self {
        match err.kind() {
            bytes::ErrorKind::NeedMoreInput => Self::need_more_input(err.position().clone()),
            bytes::ErrorKind::ExpectedEof => Self::expected_eof(err.position().clone()),
            _ => Self::invalid_input(err.position().clone()),
        }
    }
}

fn my_u8<I: bytes::ByteInput>(input: I) -> PResult<u8, I, MyError<I>> {
    // Changes `bytes::u8` to return `MyError` instead of `bytes::Error`
    bytes::u8.err_into().parse(input)
}

When relying on a From implementation doesn’t work Parse::map_err and Parse::map_res are also available to define error conversions separately for each instance.

§Features

Note that there is no std feature, since pars never makes use of types that are only available in std.

  • alloc - Enable usage of the alloc crate. This enables IntoInput impls for types such as &Vec<[T]> and &String. Enabled by default.
  • bytes - Enable the bytes module, which provides parsers and utilities for byte streams.
  • unicode - Enable the unicode module, which provides parsers and utilities for UTF-8, UTF-16, and UTF-32 streams.
  • ascii - Enable the ascii module, which provides parsers and utilities for ASCII character streams.

Modules§

asciiascii
basic
Generic parser and combinator building blocks.
bytesbytes
prelude
The pars prelude.
unicodeunicode

Structs§

Failure
Type returned by a parser when parsing fails.
Span
A slice of parser input.
Success
Type returned by a parser when parsing succeeds.

Enums§

ErrorKind
A type implementing ErrorSeed for all error and input types.

Traits§

Error
A parsing error.
ErrorSeed
A parsing error descriptor that can be used to build a parsing error.
Input
A parsable symbol stream.
IntoInput
Trait for types that can be directly converted into an Input type.
PResultExt
Additional convenience methods for PResult.
Parse
Trait implemented by all parsers.

Type Aliases§

PResult
The Result type returned by a parser.