Compiler Construction/A recipe for writing a reusable grammar
So you need to write a grammar. But there are multiple important practical concerns that you will have to deal with:
- You need your parser be usable from your language.
- You also may want it be reusable from other languages.
- The language you use may have a very immature parser generator that is difficult to use for anything complex:
- It may completely lack debug tooling.
- It may return very weird error messages even though your grammar looks OK.
- Even if everything is compiled fine, the generated parser may crash in runtime.
- The parsers generated by convenient parser generators may be damn slow.
So here is how we address these issues. The general idea of attacking the issues is simple:
- do everything step-by-step;
- use the right tools.
1. Create enough simple samples of the language you want to parse.
2. Look at them. Ask yourself what minimum class of a grammar your grammar must be. Is it possible to create a context-free grammar? If it is not, is it possible to reduce your grammar to a context-free, resolving non-context-free stuff after parsing tree has been built?
3. Select the language you want to target. In my case it was Python.
4. Find parser generators for it. For each generator get known the following info:
- class of grammars it can generate;
- is it tokenizer/lexerless or tokenizerful;
- does it support modular grammars when each subpiece can be developed and tested separately;
- availability of debugging and visualization tools:
- if the parser is tokenizerful, there should be possible to print tokens before parsing;
- tree visualizer;
- availability of libraries of precreated grammars;
- availability of good documentation and examples;
- is it compiled into target language constructions doing parsing directly or into a data structure interpreted by a runtime?
- even if it is transformed into code, the code may still use some runtime. What is the efficiency and dependencies of the runtime used?
- which other programming languages does the parser generator support?
5. Search Internet for precreated grammars. It can be possible that the grammar already exists, and you need only to adapt it to your parser generator.
6. Select your initial parser generator. For now you need to write your grammar at high level and make it work. I mean your primary goal for now is to create a grammar that parses the strings correctly, not fighting with the generator.
- It should be GLR. You can lower the class and optimize the later. It spares you from resolving conflicts.
- It should be tokenizerless. It spares you from defining distinct character classes for each token.
- It must have tools allowing you to trace its execution.
- It must return a parse tree, not build it using semantic actions. Semantic actions are language-specific and their syntax is often parser-specific. In some parsers generators they allow to disambiguate using own code, but it is not universal and the API use is highly parser generator specific.
- It must be possible to visualize the parse tree.
- it must support naming tokens and skipping irrelevant tokens.
I used parglare in GLR mode. parglare has tool for debugging, allowing me to see the transition graph and path in it and possible paths and where an error has occurred.
7. Select other parser generators. The criteria are the same, but they may be of another class. Your final goal is to make your grammar compatible to all of them and wrap the resulting parse tree into an abstraction layer, so your handcode should deal only to the abstraction layer. Changing a parser generator for the same or higher class is usually easy: just determine the mapping of one grammar DSL syntax features to another grammar DSL syntax features.
Downgrading a grammar class may require changing the parser generator and the grammar. It is very important to keep all the grammars for different generators in sync to have the same structure. Your goal is to write a grammar for one parser generator the way that it can be automatically transformed for other parser generators. Keeping in sync may mean that the grammar for the higher or even the same class will get unnecessarily uglier: different parser generators have different syntax sugar, so you have to use the common subset. It doesn't matter that it goes a bit uglier. You can postprocess the tree to make it more convenient to work with later, after the tree is already parsed. What matters is that when you downgrade the parser you improve
- explainability of error messages
- increase the set of parser generators that can be used with your grammar (one can use an LL(1) grammars in PEG and LR(1) parser generators but one cannot use an LR(1) grammar in LL(1) parser generator and one cannot use GLR in LR(1) parser generator).
- you increase the set of debug tools you can use to debug the grammar when developing it further;
- your grammar gets usefulness because more people can reuse it in the parser generators available for their languages;
- you restrict your grammars to the features universal to all parser generators, so it would be easier to port further.
Fortunately I have developed a tool for it called UniGrammar. It has the following features:
- an own format based on YAML (in fact any format that can serialize any object that can be serialized into JSON can be used), so one doesn't need a grammar to parse and serialize;
- it generates grammars for other DSLs;
- It also generates wrappers (only in python for now) transforming parser-specific parse trees into ones in our own format, that are meant to be the same for the same texts parsed by different parsers generated from the same UniGrammar grammar;
- it provides unified CLI to test the grammar using any subset of the supported parsers;
- it provides a lib that may make plugging third-party tools into UniGrammar easier;
- it has a modular architecture allowing adding support to parsers generators easily;
- It has so called templates. These are high-level constructions transformen into lower-level grammar chunks. For example, you can specify that you have a list of items, each of which is specified using some (non)terminal, with separators specified using another (non)terminal between them, and the tool will generate the rules needed for that, and also a wrapper, allowing you to get the results as a real python list, so your code won't have to treat the first item specially.
The rest of this article is written with an accent to using of the tool.
8. Write a high-level GLR grammar.
- Resolve all the errors and warnings emitted by the generator.
- Debug it. Use tracing and parse tree visualization.
- Make it work anyhow on the simplest example.
- Then make it work on all the examples.
- Then make it return the parse tree you want an all the examples.
9. Optimize the grammar to make it look clean. Separate tokens from productions. Debug and test it.
9. Deoptimize the grammar:
- if the tokenizer uses regexps, replace the regexps with tokens and productions. Most of generators don't support regexps as tokens. It may degrade performance (regexps are executed in native code, but the parser is executed in python), but universality is more valuable here;
- separate groups into separate tokens/productions. Rationale: some generators don't have them;
- separate character classes from tokens consisting of them. Rationale: for some generators (more precisely, LL(1) ones) tokens cannot contain the same characters iterated as in other tokens, otherwise they cannot be disambiguated. So all the tokens have to be partitioned to distinct character classes. What once was a single tokens now will be multiple tokens within a production. For example (in parglare syntax)
terminals Something: /[^0-9]+/; Name : /[a-zA-Z_0-9-]+/;
must be transformed into
somethingSegment: NonDigitNonAlphaNum | Alpha; nameSegment: Digit | Alpha; name: nameSegment+; something: somethingSegment+; terminals Alpha: /[a-zA-Z_-]/; NonDigitNonAlphaNum: /[^0-9a-zA-Z_-]/; Digit: /[0-9]/;
This will be beneficial when transforming to LL. UniGrammar, as it must generate LL grammars, enforces this structure in an own syntax. (Note: it doesn't mean that you can specify only LL grammars in UniGrammar).
- separate stuff that is captured into own productions. Rationale: some parser generators don't have captures. So we have to implement an own surrogate: we just save a mapping of parser-specific AST nodes names that must be captured and recover capture groups from them on post-processing step.
Test the grammar after each step.
10. Commit the grammar file into the repo.
11. Translate the grammar from parser generator-specific DSL into UniGrammar DSL. Compile the grammar to your parser generator. Compare the generated file to the one you have hand-crafted. Eliminate insignificant differences, decide what to do with significant ones and do it. Commit again. Repeat until the 2 files are the same. Remove the original hand-crafted file from the repo.
13. Test your grammar using
UniGrammar test ....
12. Downgrade the grammar to LR(1) (I have used the same parglare in LR mode), resolve all the compilation issues and debug and improve it again.
12. Port the grammar to PEG. I have used waxeye. It don't have any debug tools, but it is an interpreter and a pretty simple one, I had to embed some tracing code into its python runtime. Debug, synchronize, fix the code test.
Make sure that all the backends work as intended and fix them and the grammars until everything works.
16. Downgrade the grammar to LL(1). It may require further resolution of conflicts detected in compile time the same way (some conflicts that are here in compile time are not conflicts in LL(*)). I used -fork of CoCoPy - a variant of w:CoCo/R. It has very weird error messages and almost no debugging tools. The only reason why I was able to make this grammar work on CoCo/R is because of similarity of it to other parsers generators, so adapting this grammar to them resulted in adapting this grammar to CoCo/R.
17. Optimize, refactor, clean, test everything.
18. Add to the repositories of grammars so other people can use it.