Text Scanning with SimpleParse 2.0

SimpleParse 2.0 provides a parser generator which converts an EBNF grammar into a run-time parser for use in scanning/marking up texts. This document describes the process of developing and using an EBNF grammar to perform the text-scanning process.

Prerequisites:

Python 2.x programming
Some familiarity with EBNF grammars and other parsing terminology

Creation of a Simple Grammar

The primary function of SimpleParse is to convert an EBNF grammar into an in-memory object which can do the work of scanning (and potentially processing) data which conforms to that grammar. Therefore, to use the system effectively, we need to be able to create grammars.

For our first experiment, we'll define a simple grammar for use in parsing an INI-file-like format. Users of SimpleParse 1.0 will recognise the format from the original documentation. This version uses somewhat more features (and is shorter as a result) than was easily accomplished with SimpleParse 1.0.

Here's the grammar definition:

____ simpleexample2_1.py ____

from simpleparse.common import numbers, strings, comments

declaration = r'''# note use of raw string when embedding in python code...
file           :=  [ \t\n]*, section+
section        :=  '[',identifier!,']'!, ts,'\n', body
body           :=  statement*
statement      :=  (ts,semicolon_comment)/equality/nullline
nullline       :=  ts,'\n'
equality       :=  ts, identifier,ts,'=',ts,identified,ts,'\n'
identifier     :=  [a-zA-Z], [a-zA-Z0-9_]*
identified     :=  string/number/identifier
ts             :=  [ \t]*
'''

The first line incorporates a new feature of SimpleParse 2.0, namely the ability to automatically include (and build your own, incidentally) libraries of commonly used productions (rules/patterns/grammars). By importing these three modules, I've made the productions “string”, “number” and “semicolon_comment” (among others) available to all the Parser instances I create for the rest of this session.

New Feature Note: The identifier! and ']'! element tokens in the "section" production tell the parser generator to report a ParserSyntaxError if we attempt to parse these element tokens and fail. We could also have spelled this particular segment of the grammar:

section        :=  '[',!,identifier,']', ts,'\n', body

which spelling is often easier to use in complex grammars.

If you are not familiar with EBNF grammars, or would like a reference to the various features of the SimpleParse grammar, please see: SimpleParse Grammars . We will assume that you understand the grammars being presented.

Checking a Grammar

SimpleParse does not have a separate compilation step, but it's useful as you're writing your grammar to set up tests both for whether the grammar itself is syntactically correct, and for whether the productions match the values you expect them to (and don't match those you don't want them to).

To check that a grammar is syntactically correct, the easiest approach is to attempt to create a Parser with the grammar. The Parser will complain if your grammar is syntactically incorrect, generating a ValueError which reports the last line of the declaration which parsed correctly, and the remainder of the declaration.

from simpleparse.parser import Parser
parser = Parser( declaration)

If, for example, you had left out a comma in the “section” production between the literal ']' and ts, you would get an error like so:

S:\sp\simpleparse\examples>bad_declaration.py
Traceback (most recent call last):
  File "S:\sp\simpleparse\examples\bad_declaration.py", line 21, in ?
    parser = Parser( declaration, "file" ) # will raise ValueError
  File "S:\sp\simpleparse\parser.py", line 34, in __init__
    definitionSources = definitionSources,
  File "S:\sp\simpleparse\simpleparsegrammar.py", line 380, in __init__
    raise ValueError(
ValueError: Unable to complete parsing of the EBNF, stopped at line 3 (134 chars
 of 467)
Unparsed:
ts,'\n', body
body           :=  statement*
statement      :=  (ts,semicolon_comment)/equality/nulll...

You can see this for yourself by running examples/bad_declaration.py .

If your grammar is correct, Parser( declaration) will simply create the underlying generator objects which can produce a parser for your grammar. If you want to check that particular production has all of it's required sub-productions, you can call myparser.buildTagger( productionname ), but I normally leave that test to be caught during the “production checking” phase below.

Checking a Production

Now that we have our Parser object, and know that the grammar is syntactically correct, we can test that our productions match/don't match the values we expect. Depending on your particular philosophy, this may be done using the unittest module, or merely as informal tests during development.

In our grammar above, let's try checking that the equality production really does match some values we expect it to match:

testEquality = [
	"s=3\n",
	"s = 3\n",
	'''  s="three\\nthere"\n''',
	'''  s=three\n''',
]

production = "equality"

for testData in testEquality:
	success, children, nextcharacter = parser.parse( testData, production=production)
	assert success and nextcharacter==len(testData), """Wasn't able to parse %s as a %s (%s chars parsed of %s), returned value was %s"""%( repr(testData), production, nextcharacter, len(testData), (success, children, nextcharacter))

You should be prepared to have those tests fail a few times. It's easy to miss the effect of a particular feature of your grammar (such as the inclusion of “newline” in the equality production above). It took 3 tries before I got the tests above properly defined. Setting up your tests within an automated framework such as unittest is probably a good idea. It's also a good idea to set up tests that check that that values which shouldn't match don't.

Note: You may receive an error message from the parser.parse( ) call saying that a particular production name isn't defined within the grammar. You'll need to figure out why that name isn't there (did you include the common module you were planning to use, or did you mis-type a name somewhere?) and correct the problem before the tests will run. This error serves as a check that the production has all required sub-productions (as noted in the previous section).

Scanning Text with the Grammar

You saw the basic approach to parsing in the section on testing above, but there are a few differences when you're creating a “real world” parser. The first is that you will likely want to define a default root production for the parser. In the examples above, the “root” was specified explicitly during the call to parse to allow us to test any of the productions in the grammar. In normal use, you don't want users of your parser to need to know what production is used for parsing a buffer, so you provide a default in the Parser's initialiser:

parser = Parser( declaration, "file" )
parser.parse( testData)

Note: the root is treated differently than all other productions, as it doesn't return a result-tuple in the results tree, but instead governs the overall operation of the parser, determining whether it “succeeds” or “fails” as a whole. The children of the root production produce the top-level results of the parsing pass.

You can see the result tree returned from the parse method by running examples/simpleexample2_3.py . You can read about how to process the results tree in “Processing Result Trees”.

Up to index...

A
Open Source project