SimpleParse A Parser Generator for mxTextTools
SimpleParse is a BSD-licensed Python package
providing a simple and fast parser generator using a modified (included) version
of the mxTextTools
text-tagging engine. SimpleParse allows you to generate parsers
directly from your
EBNF grammar.
Unlike most parser generators, SimpleParse generates single-pass
parsers (there is no distinct tokenization stage), an
approach taken from the predecessor project (mcf.pars) which
attempted to create "autonomously parsing regex objects". The resulting
parsers are not as generalized as those created by,
for instance, the Earley algorithm, but they do tend to be useful for
the parsing of computer file formats and the like (as distinct from
natural language and similar "hard" parsing problems).
As of version 2.1.0 the SimpleParse project includes a patched copy
of the mxTextTools tagging library with the non-recursive rewrite of
the core parsing loop. This means that you will need to build the
extension module to use SimpleParse, but the effect is to provide a
uniform parsing platform where all of the features of a give
SimpleParse version are always available.
SimpleParse is developed on GitHub
and you can create pull requests for the project there.
The author is not working on the project much other
than as part of maintaining OpenGLContext, so please be patient with
response times.
Documentation
- Scanning with SimpleParse
-- describes the process of creating a Parser object with your EBNF
grammar, and using that parser to scan input texts
- SimpleParse Grammars --
reference to the various features of the default SimpleParse EBNF
grammar variant
- Processing Result Trees
-- brief description of the results of the tagging/scanning process and
the features available for processing (and altering) those results
- Common Problems -- description
of a number of common bugs, errors, pitfalls and anti-patterns when
using the engine.
- IBM
DeveloperWorks Article by Dr. David Mertz -- discusses (and teaches
the use of) SimpleParse 1.0, contrasting the EBNF-based parser with
tools such as regexen for text-processing tasks. Watch also
for Dr. Mertz' upcoming book Text Processing with Python
- mxTextTools
documentation -- documents the underlying mxTextTools engine.
Hopefully most users of SimpleParse who aren't actually
creating custom/prebuilt parsing elements shouldn't need this link.
- PyDoc references --
automatically generated documentation on the various elements within
the package. Of particular interest are the library of
reusable structures (simpleparse.common)
and the Parser class,
which is the primary interface for the parsing system.
Acquisition and Installation
You will need a copy of Python 2.7, 3.3 or above. If you are compiling
the package you'll also need a C compiler compatible with your Python.
Pre-built windows whls are available.
To install the SimpleParse engine:
$ pip install SimpleParse
Features/Changelog
New in 2.2.4
New in 2.2.3
- Python 3.9 support
- Build process fixes
New in 2.2.1
- Python 3.8 support
- Python 3.x is primary support platform
New in 2.2:
- Experimental Python 3.3+ support (thanks to Anthony Tuininga). Python 2.7 is still the recommended version.
- Under Python 3.x the default is to generate unicode parsers; these are *much* slower than byte parsers
- Test suite is now tox based
- Small code cleanups throughout
New in 2.1.1:
- Fixes for stubbed-in Unicode character ranges, unicode support is still very much experimental
New in 2.1.1a2:
- Disable all of the mxDebugPrintf functionality, which should allow us to build on Win32 with Mingw32 for Python
2.6
New in 2.1.1a1:
- Fixes to build under Python 2.6
- Rename of simpleparse.xml to simpleparse.xmlparser to avoid conflicts with standard library "xml"
- Eliminate use of .message on exceptions, as this has been deprecated in Python 2.6
New in 2.1.0a1:
- Includes (patched) mxTextTools extension as part of SimpleParse,
no longer uses stand-alone mxTextTools installations
- Retooled setup environment to build and distribute directly from
the CVS checkout
- Bug-fixes in c_comment and c_nest_comment common productions
(thanks to Stephen Waterbury), basic tests for the comment productions
added
New in 2.0.1:
- Bug fix in ISO Date Time parser test cases, was assuming Canadian
EST timezone.
- Bug fix for error-on-fail SyntaxError's when used with optional
string message (2.0.1.a3)
diff -w -r1.4 error.py
32c32
< return
'%s: %s'%( self.__class__.__name__, self.messageFormat(message) )
---
> return
'%s: %s'%( self.__class__.__name__, self.messageFormat(self.message) )
- Case-insensitive literal values declared with c"literal" in the
default grammar (new in 2.0.1a2)
- Significant optimisations to the generated parse tables, can
result in huge speedups for very formal grammars
New in 2.0:
- New, refactored and simplified API. Most of the time you only
need to deal with a single class for all your interactions with the
parser system (simpleparse.parser.Parser),
and one module if you decide to use the provided post-processing
mechanism (simpleparse.dispatchprocessor).
- "Expanded Productions" -- allow you to define productions whose
children are reported as if the enclosing production did not exist
(allows you to use productions for organisational as well as
reporting purposes)
- Exposure of callout
mechanism in mxTextTools
- Exposure of "LookAhead" mechanism in mxTextTools (allows you to
spell "is followed by", "is not followed by", or "matches x but
doesn't match y" in SimpleParse EBNF grammars). Specified with
the prefix ?, as in ?-"this" which matches iff "this" is not the next
item, but on matching doesn't move the read-head forward (more
precisely, it causes the engine to continue processing at the previous
position).
- "Error on fail" error-reporting facility, allows you to raise
Parser Syntax Errors when particular productions (or element tokens)
fail to match. Allow for fairly flexible error reporting.
To specify, just add a '!' character after the element token
that must match, or include it as a stand-alone token in a sequential
group to specify that all subsequent items must succeed. You can
specify an error message format by using a string literal after the !
character.
- Library of common constructs (simpleparse.common package)
which are easily included in your grammars
- Hexidecimal escapes for string and character ranges
- Rewritten generators -- the generator interface has been
seperated from the parser interfaces, this makes it possible to
write grammars directly using generator objects if desired, and
allows defining the EBNF grammar using the same tools as generate
derived parsers
- An XML-Parser (including DTD parsing) based on the XML
specification's EBNF (this is not a production parser, merely an
example for parsing a complex file format, and is not yet Unicode
capable)
- Example VRML97 and LISP parsers
- Compatability API for SimpleParse 1.0 applications
- With the non-recursive mxTextTools, can process (albeit
inefficiently) recursion-as-repetition grammars
- Non-recursive rewrite of mxTextTools now ~95% of the speed of the
recursive version
General
- Simple-to-use interface, define an EBNF and start parsing
- Fast for small files -- this is primarily a feature of the
underlying TextTools engine, rather than a particular feature of the
parser generator.
- Allows pre-built and external parsing functions, which allows you
to define Python methods to handle tricky parsing tasks
"Class" of Parsers Generated
Our (current) parsers are top-down, in that they work from the top
of the parsing graph (the root production). They are not, however,
tokenising parsers, so there is no appropriate LL(x) designation as far
as I can see, and there is an arbitrary lookahead mechanism that could
theoretically parse the entire rest of the file just to see if a
particular character matches). I would hazard a guess that they
are theoretically closest to a deterministic recursive-descent parser.
There are no backtracking facilities, so any ambiguity is handled by
choosing the first successful match of a grammar (not the longest, as
in most top-down parsers, mostly because without tokenisation, it would
be expensive to do checks for each possible match's length). As a
result of this, the parsers are entirely deterministic.
The time/memory characteristics are such that, in general, the time
to parse an input text varies with the amount of text to parse. There
are two major factors, the time to do the actual parsing (which, for
simple deterministic grammars should be close to linear with the length
of the text, though a pathalogical grammar might have radically
different operating characteristics) and the time to build the results
tree (which depends on the memory architecture of the machine, the
currently free memory, and the phase of the moon). As a rule,
SimpleParse parsers will be faster (for suitably limited grammars) than
anything you can code directly in Python. They will not generally
outperform grammar-specific parsers written in C.
Missing Features
- SimpleParse does not current use an Earley or similar highly
generalised parser, instead, it uses a simple deterministic parsing
algorithm which, though fast for certain classes of
problems, is incapable of dealing with ambiguity, backtracking or
cross-references
- The library of common patterns is extremely sparse
- Unicode support
- There is no analysis and only minimal reduction done on the
grammar. Having now read most of Parsing Techniques - A
Practical Guide, I can see how some fairly significant changes will
be required to support such operations (and thereby the more common
parsing techniques).
Possible Future Directions
- Alternative parsing back-ends -- the new objectgenerator module
is fairly well isolated from the rest of the system, and
encompasses most of the dependencies on the mxTextTools engine. Adding
an optional Earley or similar back-end should be
possible with minimal upset to the project. A backend using re
objects is another possibility (my precursor mcf.pars engine was
written to use regexen for parsing, and was an acceptable (though
not stellar) performer).
- Alternative EBNF grammars -- SimpleParse's EBNF, though fairly
readily understood, is not by any means the only EBNF variant,
providing support for a number of EBNF variants would ease the job
of porting grammars to the system.
- More common/library code -- common data formats, HTML and/or
SGML parsers
mxTextTools Rewrite Enhancements
- Case-insensitive matching commands?
- Backtracking support?
Alternate C Back-end?
- Given the amount of effort poured into the mxTextTools engine,
this may seem silly, but it would be nice to implement a more advanced
parsing algorithm directly in C, without going through the
assembly-like
interface of mxTextTools. Given that Marc-André isn't
interested
in adopting the non-recursive codebase, there's not much point
retaining compatability with mxTextTools, so moving to a more
parser-friendly engine might be the best approach.
mxBase/mxTextTools Installation
NOTE: This section only
applies to SimpleParse versions before 2.1.0, SimpleParse 2.1.0 and
above include a patched version of mxTextTools already!
You will want an mxBase
2.1.0 distribution to run SimpleParse, preferably with the
non-recursive rewrite. If you want to use
the non-recursive implementation, you will need to get the source
archive for mxTextTools. It is possible to use mxBase 2.0.3 with
SimpleParse,
but not to use it for building the non-recursive TextTools engine
(2.0.3 also lacks a lot of features and bug-fixes found in the 2.1.0
versions).
Note: without the non-recursive
rewrite of 2.1.0 (i.e. with the recursive version), the test
suite will not pass all tests.
I'm not sure why they fail with the recursive version, but it
does
argue for using the non-recursive rewrite.
To build the non-recursive TextTools engine, you'll need to
get the source distribution for the non-recursive implementation from
the SimpleParse
file repository. Note,
there are incompatabilities in the mxBase 2.1 versions that make it
necessary to use the versions specified below to build the
non-recursive versions.
This archive is intended to be expanded over the
mxBase source archive from the top-level directory, replacing one file
and
adding four others.
cd egenix-mx-base-2.1.0
gunzip non-recursive-1.0.0b1.tar.gz
tar -xvf non-recursive-1.0.0b1.tar
(Or use WinZip on Windows). When you have completed that, run:
setup.py build --force install
in the top directory of the eGenix-mx-base source tree.
Copyright, License & Disclaimer
The 2.1.0 and greater releases include the eGenix mxTextTools
extension:
Licensed under
the eGenix.com Public License see the mxLicense.html
file for details on
licensing terms for the original library, the eGenix extensions are:
Copyright (c) 1997-2000, Marc-Andre Lemburg
Copyright (c) 2000-2001, eGenix.com Software GmbH
Extensions to the eGenix extensions (most significantly the rewrite
of the core loop) are copyright Mike Fletcher and released under the
SimpleParse License below:
Copyright © 2003-2020, Mike Fletcher
SimpleParse License:
Copyright © 1998-2020, Copyright by
Mike C. Fletcher; All Rights Reserved.
mailto: mcfletch@users.sourceforge.net
Permission to use, copy, modify, and
distribute this software and
its documentation for any purpose and without fee or royalty is
hereby granted, provided that the above copyright notice
appear in all copies and that both the copyright notice and
this permission notice appear in supporting documentation or
portions thereof, including modifications, that you make.
THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL
WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR
BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES
OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE!