Thoughts on Parsing
2022-09-16I’ve been working on a parser for djot, a light markup language similar to CommonMark. The parser is written in zig, so I’ve named it djot.zig
. In this series of posts I’ll share some of the thoughts I’ve had while writing it.
Note that djot.zig
is not yet finished, and the code in this post is for example only.
Parsed Representation
I’m designing djot.zig
to have a small in memory representation once parsed. My design looks something like this at the moment:
const Document = struct {
source: []const u8,
events: []Event,
/// Where each event starts in source
event_source: []u32,
};
const Event = enum(u8) {
text,
start_paragraph,
close_paragraph,
start_list,
close_list,
start_list_item,
close_list_item,
};
This design is very much inspired by the Zig compiler’s internals and data oriented design.
Thus the following markup would be parsed as so:
Hello world!
- a bullet
- another bullet
- more bullet
idx | event | src | source text |
---|---|---|---|
0 | .start_paragraph | 0 | "" |
1 | .text | 0 | "Hello, world" |
2 | .close_paragraph | 12 | "\n\n" |
3 | .start_list | 14 | "" |
4 | .start_list_item | 14 | "- " |
5 | .text | 16 | "a bullet\n" |
6 | .close_list_item | 25 | "" |
7 | .start_list_item | 25 | "- " |
8 | .text | 27 | "another bullet\n" |
9 | .close_list_item | 42 | "" |
10 | .start_list_item | 42 | "- " |
11 | .text | 44 | "more bullet" |
12 | .close_list_item | 54 | "" |
Which is 13 bytes for the events
list, and 52 bytes for the event_source
list, and 54 bytes for source
itself.
We can then turn this abstract representation into html by looping over the list of events:
pub fn toHtml(writer: anytype, doc: Document) !void {
for (doc.events) |event, i| {
switch (event) {
.text => try writer.writeAll(doc.text(i)),
.start_paragraph => try writer.writeAll("<p>"),
.close_paragraph => try writer.writeAll("</p>"),
.start_list => try writer.writeAll("<ul>"),
.close_list => try writer.writeAll("</ul>"),
.start_list_item => try writer.writeAll("<li>"),
.close_list_item => try writer.writeAll("</li>"),
}
}
}
Read Cursors
While working on djot.zig
, I’ve gravitated towards a pattern I’m taking to calling the Cursor
pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I’m using Cursor
s to implement Recursive Descent Parsing).
What is a Cursor
The simplest Cursor
is just a pointer to an index. Let’s make a function that parses an integer using this pattern.
fn parseInt(text: []const u8, parent: *usize) ?u64 {
const start = parent.*;
var index = start;
// Keep going until we find the first non-digit character
while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }
// There weren't any digits; return null
if (index - start < 1) return null;
// Parse an integer, or return null.
const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;
// "Move" the parent cursor to the end of the integer we just parsed
parent.* = index;
return integer;
}
The key difference between this and a Reader
comes right at the end of the function where we update the parent
index. This parseInt
function only “moves” the cursor if it succeeds. If parseInt
fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.
Here’s an example of parsing a list of integers with this pattern:
fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
const start = parent.*;
var index = start;
// A list starts with `[`
_ = parseExpectCharacter(text, &index, '[') orelse return null;
var list = std.ArrayList(u64).init(allocator);
defer list.deinit();
while (true) {
try list.append(parseInt(text, &index) orelse break);
_ = parseExpectCharacter(text, &index, ',') orelse break;
}
_ = parseExpectCharacter(text, &index, ']') orelse return null;
parent.* = index;
return list.toOwnedSlice();
}
What does a Cursor
need?
A Cursor
must fulfill a couple properties:
- Easy to Copy: so we can try reading something without committing to it
- Easy to Update: to make chaining calls together easy
The word Transactional neatly sums up those properties. A Cursor
should be Transactional.
Where did this come from? Why?
While I’m parsing a djot
file, I don’t always know ahead of time if I’m parsing the correct thing.
Take, for example, two overlapping formatting spans:
*This is [strong*, not a link.](https://djot.net)
Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:
<strong>This is [strong</strong>, not a link.](https://djot.net)
When we get to the [
character, we don’t know if we need to produce a link, or if it should be interepted it as text. Using the Cursor
pattern means I can try parsing the link, and if it doesn’t work out because an earlier span closes it out, or because a it’s missing the parenthesis, we can instead parse it as text.
While the Cursor
makes it easy to roll back text that was just read in, it can also be applied to writing a list of events.
Write Cursors
In djot.zig
, the parsed document is, at it’s core, a list of events. While I’m parsing the document, I want to reduce the amount of redundant/temporary allocations that are done, so I put all of the events directly into the ArrayList
that will make up the finished Document
. To handle resetting the state, I use a pattern similar to the Read Cursors.
What is a Write Cursor
Like a read cursor, it’s really just a pointer to an index. However this will be an index into a growable array, not a slice. Let’s modify the parseIntList
function from last time to use this pattern:
fn parseIntList(list: *std.ArrayList(u64), parent_event_index: *usize, text: []const u8, parent: *usize) !?void {
const start = parent.*;
var index = start;
var event_index = parent_event_index.*;
// A list starts with `[`
_ = parseExpectCharacter(text, &index, '[') orelse return null;
while (true) {
// Here we ignore the `.len` field on the ArrayList.
// We use `event_index` as our source of truth of where
// to write to.
const int = parseInt(text, &index) orelse break;
try list.resize(event_index + 1);
list.items[event_index] = int;
event_index += 1;
_ = parseExpectCharacter(text, &index, ',') orelse break;
}
// Ends with a `]`
_ = parseExpectCharacter(text, &index, ']') orelse return null;
parent.* = index;
parent_event_index.* = event_index;
return;
}
Using an ArrayList
like this let’s us fulfill the properties I set out for Read Cursor’s above:
- Easy to Copy: It’s a pointer to an ArrayList and an index into it
- Easy to Update: The
ArrayList
handles copying over data when it needs to grow the list
Now we have a Transactional way to write data.
Importance
djot.zig
gets a couple of benefits from using this pattern:
- Reduced allocations: Since I can reuse a single
ArrayList
,djot.zig
not only avoids allocating lists that would’ve just been copied over anyway, it also benefits from the amortization of allocation thatArrayList
s provide. - Simpler lifetime management: As nested objects grow, it becomes more difficult to track what can be freed and what needs to stick around. The single
ArrayList
of events gives us a single space to check, and a single region of memory to free.
In the next post I’ll talk about how I’m wrapping these patterns into a concrete types.