SimpleParse A Parser Generator for mxTextTools

SimpleParse is a BSD-licensed Python package providing a simple and fast parser generator using a modified (included) version of the mxTextTools text-tagging engine. SimpleParse allows you to generate parsers directly from your EBNF grammar.

Unlike most parser generators, SimpleParse generates single-pass parsers (there is no distinct tokenization stage), an approach taken from the predecessor project (mcf.pars) which attempted to create "autonomously parsing regex objects". The resulting parsers are not as generalized as those created by, for instance, the Earley algorithm, but they do tend to be useful for the parsing of computer file formats and the like (as distinct from natural language and similar "hard" parsing problems).

As of version 2.1.0 the SimpleParse project includes a patched copy of the mxTextTools tagging library with the non-recursive rewrite of the core parsing loop. This means that you will need to build the extension module to use SimpleParse, but the effect is to provide a uniform parsing platform where all of the features of a give SimpleParse version are always available.

SimpleParse is developed on GitHub and you can create pull requests for the project there. The author is not working on the project much other than as part of maintaining OpenGLContext, so please be patient with response times.

Documentation

Scanning with SimpleParse -- describes the process of creating a Parser object with your EBNF grammar, and using that parser to scan input texts
SimpleParse Grammars -- reference to the various features of the default SimpleParse EBNF grammar variant
Processing Result Trees -- brief description of the results of the tagging/scanning process and the features available for processing (and altering) those results
Common Problems -- description of a number of common bugs, errors, pitfalls and anti-patterns when using the engine.
IBM DeveloperWorks Article by Dr. David Mertz -- discusses (and teaches the use of) SimpleParse 1.0, contrasting the EBNF-based parser with tools such as regexen for text-processing tasks. Watch also for Dr. Mertz' upcoming book Text Processing with Python
mxTextTools documentation -- documents the underlying mxTextTools engine. Hopefully most users of SimpleParse who aren't actually creating custom/prebuilt parsing elements shouldn't need this link.
PyDoc references -- automatically generated documentation on the various elements within the package. Of particular interest are the library of reusable structures (simpleparse.common) and the Parser class, which is the primary interface for the parsing system.

Acquisition and Installation

You will need a copy of Python 2.7, 3.3 or above. If you are compiling the package you'll also need a C compiler compatible with your Python. Pre-built windows whls are available.

To install the SimpleParse engine:

$ pip install SimpleParse

Features/Changelog

New in 2.2.4

Python 3.11 support

New in 2.2.3

Python 3.9 support
Build process fixes

New in 2.2.1

Python 3.8 support
Python 3.x is primary support platform

New in 2.2:

Experimental Python 3.3+ support (thanks to Anthony Tuininga). Python 2.7 is still the recommended version.
Under Python 3.x the default is to generate unicode parsers; these are *much* slower than byte parsers
Test suite is now tox based
Small code cleanups throughout

New in 2.1.1:

Fixes for stubbed-in Unicode character ranges, unicode support is still very much experimental

New in 2.1.1a2:

Disable all of the mxDebugPrintf functionality, which should allow us to build on Win32 with Mingw32 for Python 2.6

New in 2.1.1a1:

Fixes to build under Python 2.6
Rename of simpleparse.xml to simpleparse.xmlparser to avoid conflicts with standard library "xml"
Eliminate use of .message on exceptions, as this has been deprecated in Python 2.6

New in 2.1.0a1:

Includes (patched) mxTextTools extension as part of SimpleParse, no longer uses stand-alone mxTextTools installations
Retooled setup environment to build and distribute directly from the CVS checkout
Bug-fixes in c_comment and c_nest_comment common productions (thanks to Stephen Waterbury), basic tests for the comment productions added

New in 2.0.1:

Bug fix in ISO Date Time parser test cases, was assuming Canadian EST timezone.
Bug fix for error-on-fail SyntaxError's when used with optional string message (2.0.1.a3)

diff -w -r1.4 error.py 32c32
< return '%s: %s'%( self.__class__.__name__, self.messageFormat(message) )
---
> return '%s: %s'%( self.__class__.__name__, self.messageFormat(self.message) )

Case-insensitive literal values declared with c"literal" in the default grammar (new in 2.0.1a2)
Significant optimisations to the generated parse tables, can result in huge speedups for very formal grammars

New in 2.0:

New, refactored and simplified API. Most of the time you only need to deal with a single class for all your interactions with the parser system (simpleparse.parser.Parser), and one module if you decide to use the provided post-processing mechanism (simpleparse.dispatchprocessor).
"Expanded Productions" -- allow you to define productions whose children are reported as if the enclosing production did not exist (allows you to use productions for organisational as well as reporting purposes)
Exposure of callout mechanism in mxTextTools
Exposure of "LookAhead" mechanism in mxTextTools (allows you to spell "is followed by", "is not followed by", or "matches x but doesn't match y" in SimpleParse EBNF grammars). Specified with the prefix ?, as in ?-"this" which matches iff "this" is not the next item, but on matching doesn't move the read-head forward (more precisely, it causes the engine to continue processing at the previous position).
"Error on fail" error-reporting facility, allows you to raise Parser Syntax Errors when particular productions (or element tokens) fail to match. Allow for fairly flexible error reporting. To specify, just add a '!' character after the element token that must match, or include it as a stand-alone token in a sequential group to specify that all subsequent items must succeed. You can specify an error message format by using a string literal after the ! character.
Library of common constructs (simpleparse.common package) which are easily included in your grammars
Hexidecimal escapes for string and character ranges
Rewritten generators -- the generator interface has been seperated from the parser interfaces, this makes it possible to write grammars directly using generator objects if desired, and allows defining the EBNF grammar using the same tools as generate derived parsers
An XML-Parser (including DTD parsing) based on the XML specification's EBNF (this is not a production parser, merely an example for parsing a complex file format, and is not yet Unicode capable)
Example VRML97 and LISP parsers
Compatability API for SimpleParse 1.0 applications
With the non-recursive mxTextTools, can process (albeit inefficiently) recursion-as-repetition grammars
Non-recursive rewrite of mxTextTools now ~95% of the speed of the recursive version

General

Simple-to-use interface, define an EBNF and start parsing
Fast for small files -- this is primarily a feature of the underlying TextTools engine, rather than a particular feature of the parser generator.
Allows pre-built and external parsing functions, which allows you to define Python methods to handle tricky parsing tasks

"Class" of Parsers Generated

Our (current) parsers are top-down, in that they work from the top of the parsing graph (the root production). They are not, however, tokenising parsers, so there is no appropriate LL(x) designation as far as I can see, and there is an arbitrary lookahead mechanism that could theoretically parse the entire rest of the file just to see if a particular character matches). I would hazard a guess that they are theoretically closest to a deterministic recursive-descent parser.

There are no backtracking facilities, so any ambiguity is handled by choosing the first successful match of a grammar (not the longest, as in most top-down parsers, mostly because without tokenisation, it would be expensive to do checks for each possible match's length). As a result of this, the parsers are entirely deterministic.

The time/memory characteristics are such that, in general, the time to parse an input text varies with the amount of text to parse. There are two major factors, the time to do the actual parsing (which, for simple deterministic grammars should be close to linear with the length of the text, though a pathalogical grammar might have radically different operating characteristics) and the time to build the results tree (which depends on the memory architecture of the machine, the currently free memory, and the phase of the moon). As a rule, SimpleParse parsers will be faster (for suitably limited grammars) than anything you can code directly in Python. They will not generally outperform grammar-specific parsers written in C.

Missing Features

SimpleParse does not current use an Earley or similar highly generalised parser, instead, it uses a simple deterministic parsing algorithm which, though fast for certain classes of problems, is incapable of dealing with ambiguity, backtracking or cross-references
The library of common patterns is extremely sparse
Unicode support
There is no analysis and only minimal reduction done on the grammar. Having now read most of Parsing Techniques - A Practical Guide, I can see how some fairly significant changes will be required to support such operations (and thereby the more common parsing techniques).

Possible Future Directions

Alternative parsing back-ends -- the new objectgenerator module is fairly well isolated from the rest of the system, and encompasses most of the dependencies on the mxTextTools engine. Adding an optional Earley or similar back-end should be possible with minimal upset to the project. A backend using re objects is another possibility (my precursor mcf.pars engine was written to use regexen for parsing, and was an acceptable (though not stellar) performer).
Alternative EBNF grammars -- SimpleParse's EBNF, though fairly readily understood, is not by any means the only EBNF variant, providing support for a number of EBNF variants would ease the job of porting grammars to the system.
More common/library code -- common data formats, HTML and/or SGML parsers

mxTextTools Rewrite Enhancements

Case-insensitive matching commands?
Backtracking support?

Alternate C Back-end?

Given the amount of effort poured into the mxTextTools engine, this may seem silly, but it would be nice to implement a more advanced parsing algorithm directly in C, without going through the assembly-like interface of mxTextTools. Given that Marc-André isn't interested in adopting the non-recursive codebase, there's not much point retaining compatability with mxTextTools, so moving to a more parser-friendly engine might be the best approach.

mxBase/mxTextTools Installation

NOTE: This section only applies to SimpleParse versions before 2.1.0, SimpleParse 2.1.0 and above include a patched version of mxTextTools already!

You will want an mxBase 2.1.0 distribution to run SimpleParse, preferably with the non-recursive rewrite. If you want to use the non-recursive implementation, you will need to get the source archive for mxTextTools. It is possible to use mxBase 2.0.3 with SimpleParse, but not to use it for building the non-recursive TextTools engine (2.0.3 also lacks a lot of features and bug-fixes found in the 2.1.0 versions).

Note: without the non-recursive rewrite of 2.1.0 (i.e. with the recursive version), the test suite will not pass all tests. I'm not sure why they fail with the recursive version, but it does argue for using the non-recursive rewrite.

To build the non-recursive TextTools engine, you'll need to get the source distribution for the non-recursive implementation from the SimpleParse file repository. Note, there are incompatabilities in the mxBase 2.1 versions that make it necessary to use the versions specified below to build the non-recursive versions.

Python 2.2.x, mxBase 2.1b5, non-recursive 1.0.0b4
Python 2.3.x, mxBase 2.1 August 2003 Shapshot, non-recursive 1.0.0b5+

This archive is intended to be expanded over the mxBase source archive from the top-level directory, replacing one file and adding four others.

cd egenix-mx-base-2.1.0
gunzip non-recursive-1.0.0b1.tar.gz
tar -xvf non-recursive-1.0.0b1.tar

(Or use WinZip on Windows). When you have completed that, run:

setup.py build --force install

in the top directory of the eGenix-mx-base source tree.

Copyright, License & Disclaimer

The 2.1.0 and greater releases include the eGenix mxTextTools extension:

Licensed under the eGenix.com Public License see the mxLicense.html file for details on licensing terms for the original library, the eGenix extensions are:

Copyright (c) 1997-2000, Marc-Andre Lemburg
Copyright (c) 2000-2001, eGenix.com Software GmbH

Extensions to the eGenix extensions (most significantly the rewrite of the core loop) are copyright Mike Fletcher and released under the SimpleParse License below:

SimpleParse License:

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.

THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE!