haqistan

⋔ Home ⋔ About ⋔ Hire Me ⋔ Links ⋔ License ⋔ Archives ⋔ RSS ⋔

Adventures in Ports: Parsing Expression Grammars

924 words by attila written on 2015-03-11, last edit: 2015-10-13, tags: markdown, open-source, openbsd, ports, text ⋔ Previous post: Fixed ⋔ Next post: Adventures in Ports: Tor Browser

Even in just doing the simple stuff I've tried so far in OpenBSD ports it is frequently an adventure that leads to good and unexpected places. I believe they call that fun or something.

My first idea was to do an OpenBSD port of Fletcher Penny's multimarkdown, a markdown derivative. It adds footnotes, tables and metadata. The latter is especially important because it allows for much greater flexibility and power when translating from !MultiMarkdown into other formats. When used with the memoir LaTeX environment you can do some pretty sophisticated stuff. For instance, here's my skeletal template for a book written in MultiMarkdown:

    Title: Am I There?
    Author: Portnoy Semantic
    Date: DD-MMM-YYYY
    Base Header Level: 2
    latex mode:         book
    latex input:        mmd-tufte-book-header
    latex input:        mmd-tufte-book-begin-doc
    latex input:        mmd-tufte-book-footer

# First Glance #
    
A snark glowered in the window.  Ravens crowed and crowed until the
sun came out, then went away.  All was darkness.  The cats clinked and
clinked as they patted each other with great affection and sharp
claws.
    
## A Moment ##
    
A moment in time.  I guess this is literature...
    
## Another ##
    
Another sibling moment in time.
    
# Second Heart #
    
"Ipsum lorem" cried the potters.
    
## The Moon ##
    
Rising, and yet setting.  I'm upside-down.

The stuff at the top is MultiMarkdown's way of letting you attach arbitrary metadata to your files, but in this case it is not arbitrary. It is input to the LaTeX generation back end in the multimarkdown tool. By issuing the commands:

$ multimarkdown -t latex -o book.tex book.md
$ pdflatex book.tex

I end up generating this PDF. Of course I have to arrange to get a few CTAN modules installed and my ~/texmf set up properly. I'm certainly no LaTeX wizard, but it was a hell of a lot nicer to just drop a few text files in the right place than screw around with an all-singing, all-dancing GUI that makes my machine foam at the mouth. The end result is pretty reasonable considering how light the markup is.

Okay, so great, I do this port. In the process of doing it I run into Parsing Expression Grammars, and that's really what this post is about. MultiMarkdown uses a PEG for its parser; it wanted a parser generator called greg but he decided to include it as a submodule in his repository. OpenBSD ports doesn't do git submodules right now (maybe never I hope since they are kind of evil) so I decided the best idea was to do a greg port first, to support the port of MultiMarkdown. This dragged me into the fascinating world of PEGs, totally on a tangent.

Parsing expression grammars are an alternate formalism to context-free grammars first published by Bryan Ford in POPL2004 (pdf). The paper is well worth reading if you are at all interested in parsing, translation and other related subjects: it is short, clear and thorough.

Here's the big idea: all that power in context-free grammars that Chomsky was on about is necessary for parsing human languages but not so much for machine-oriented parsing tasks, like configuration files or programming notations. His proposal in PEGs is to replace the unordered choice operation implicit in CFGs, where the right-hand side of a production is a set (inherently unordered) with an ordered choice primitive, where the RHS is a list and the first match wins. He uses forward-slash for his prioritized choice operator:

    ... The EBNF rules `A -> a | a b` and `A -> a b | a` are
    equivalent in a CFG, but the PEG rules `A <- a b / a` and `A
    <- a / a b` are different.  The second alternative in the latter
    PEG rule will never succeed because the first choice is always taken
    if the input string to be recognized begins with *a*.

Combined with some regular expression-like features the resulting notation he proposes as a concrete syntax for PEGs is remarkably powerful. It allows one to do away completely with a separate lexical analyzer; instead, a PEG describes both the hierarchical syntax and the lexical syntax of a language in a single spot. The resulting parsers are linear in time, have predictable memory requirements (no leaks!) and are, at least in my experience, much easier to debug than the standard stack of stuff we've become used to.

There's nothing not to like about this. In fact this is the first bit of computer science that really piqued my interest in a very long time... not that I'm all that great about keeping up on things, but PEGs felt like a breath of fresh air. I really want to go back and revisit some of my old parsers (all written with Yacc, Bison or Lemon combined with flex). The greg tool is pretty good, although I want to look at its output in more detail and do some more testing.

Anyone who has suffered through writing a parser of any complexity using the standard tools should consider checking greg out. It was accepted into ports in January 2015 (my first!) and I hope to do some cool things with it.

Frequently all the fun is on the tangents. In fact after I wrote this I realized I had forgotten to pop back up a level and see what was up with the multimarkdown port. It will become evident after my next post what other tangent distracted me...