Thoughts on Parsing

2022-09-16

I’ve been working on a parser for djot, a light markup language similar to CommonMark. The parser is written in zig, so I’ve named it djot.zig. In this series of posts I’ll share some of the thoughts I’ve had while writing it.

Note that djot.zig is not yet finished, and the code in this post is for example only.

Parsed Representation

I’m designing djot.zig to have a small in memory representation once parsed. My design looks something like this at the moment:

const Document = struct {
  source: []const u8,

  events: []Event,

  /// Where each event starts in source
  event_source: []u32,
};

const Event = enum(u8) {
  text,
  start_paragraph,
  close_paragraph,
  start_list,
  close_list,
  start_list_item,
  close_list_item,
};

This design is very much inspired by the Zig compiler’s internals and data oriented design.

Thus the following markup would be parsed as so:

Hello world!

- a bullet
- another bullet
- more bullet

idx	event	src	source text
0	`.start_paragraph`	0	`""`
1	`.text`	0	`"Hello, world"`
2	`.close_paragraph`	12	`"\n\n"`
3	`.start_list`	14	`""`
4	`.start_list_item`	14	`"- "`
5	`.text`	16	`"a bullet\n"`
6	`.close_list_item`	25	`""`
7	`.start_list_item`	25	`"- "`
8	`.text`	27	`"another bullet\n"`
9	`.close_list_item`	42	`""`
10	`.start_list_item`	42	`"- "`
11	`.text`	44	`"more bullet"`
12	`.close_list_item`	54	`""`

Which is 13 bytes for the events list, and 52 bytes for the event_source list, and 54 bytes for source itself.

We can then turn this abstract representation into html by looping over the list of events:

pub fn toHtml(writer: anytype, doc: Document) !void {
  for (doc.events) |event, i| {
    switch (event) {
      .text => try writer.writeAll(doc.text(i)),
      .start_paragraph => try writer.writeAll("<p>"),
      .close_paragraph => try writer.writeAll("</p>"),
      .start_list => try writer.writeAll("<ul>"),
      .close_list => try writer.writeAll("</ul>"),
      .start_list_item => try writer.writeAll("<li>"),
      .close_list_item => try writer.writeAll("</li>"),
    }
  }
}

Read Cursors

While working on djot.zig, I’ve gravitated towards a pattern I’m taking to calling the Cursor pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I’m using Cursors to implement Recursive Descent Parsing).

What is a Cursor

The simplest Cursor is just a pointer to an index. Let’s make a function that parses an integer using this pattern.

fn parseInt(text: []const u8, parent: *usize) ?u64 {
    const start = parent.*;
	var index = start;
	
	// Keep going until we find the first non-digit character
	while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }

	// There weren't any digits; return null
	if (index - start < 1) return null;

    // Parse an integer, or return null.
    const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;
	
	// "Move" the parent cursor to the end of the integer we just parsed
	parent.* = index;
	return integer;
}

The key difference between this and a Reader comes right at the end of the function where we update the parent index. This parseInt function only “moves” the cursor if it succeeds. If parseInt fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.

Here’s an example of parsing a list of integers with this pattern:

fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
    const start = parent.*;
	var index = start;
	
	// A list starts with `[`
    _ = parseExpectCharacter(text, &index, '[') orelse return null;
    
    var list = std.ArrayList(u64).init(allocator);
	defer list.deinit();
	while (true) {
		try list.append(parseInt(text, &index) orelse break);
	    _ = parseExpectCharacter(text, &index, ',') orelse break;
	}
	
	_ = parseExpectCharacter(text, &index, ']') orelse return null;
	parent.* = index;
	return list.toOwnedSlice();
}

What does a `Cursor` need?

A Cursor must fulfill a couple properties:

Easy to Copy: so we can try reading something without committing to it
Easy to Update: to make chaining calls together easy

The word Transactional neatly sums up those properties. A Cursor should be Transactional.

Where did this come from? Why?

While I’m parsing a djot file, I don’t always know ahead of time if I’m parsing the correct thing.

Take, for example, two overlapping formatting spans:

*This is [strong*, not a link.](https://djot.net)

Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:

<strong>This is [strong</strong>, not a link.](https://djot.net)

When we get to the [ character, we don’t know if we need to produce a link, or if it should be interepted it as text. Using the Cursor pattern means I can try parsing the link, and if it doesn’t work out because an earlier span closes it out, or because a it’s missing the parenthesis, we can instead parse it as text.

While the Cursor makes it easy to roll back text that was just read in, it can also be applied to writing a list of events.

Write Cursors

In djot.zig, the parsed document is, at it’s core, a list of events. While I’m parsing the document, I want to reduce the amount of redundant/temporary allocations that are done, so I put all of the events directly into the ArrayList that will make up the finished Document. To handle resetting the state, I use a pattern similar to the Read Cursors.

What is a Write Cursor

Like a read cursor, it’s really just a pointer to an index. However this will be an index into a growable array, not a slice. Let’s modify the parseIntList function from last time to use this pattern:

fn parseIntList(list: *std.ArrayList(u64), parent_event_index: *usize, text: []const u8, parent: *usize) !?void {
    const start = parent.*;
	var index = start;
	var event_index = parent_event_index.*;
	
	// A list starts with `[`
    _ = parseExpectCharacter(text, &index, '[') orelse return null;
    
	while (true) {
		// Here we ignore the `.len` field on the ArrayList.
		// We use `event_index` as our source of truth of where
		// to write to.
		const int = parseInt(text, &index) orelse break; 
		try list.resize(event_index + 1);
		list.items[event_index] = int;
		event_index += 1;
		
	    _ = parseExpectCharacter(text, &index, ',') orelse break;
	}
	
	// Ends with a `]`
	_ = parseExpectCharacter(text, &index, ']') orelse return null;
	
	parent.* = index;
	parent_event_index.* = event_index;
	return;
}

Using an ArrayList like this let’s us fulfill the properties I set out for Read Cursor’s above:

Easy to Copy: It’s a pointer to an ArrayList and an index into it
Easy to Update: The ArrayList handles copying over data when it needs to grow the list

Now we have a Transactional way to write data.

Importance

djot.zig gets a couple of benefits from using this pattern:

Reduced allocations: Since I can reuse a single ArrayList, djot.zig not only avoids allocating lists that would’ve just been copied over anyway, it also benefits from the amortization of allocation that ArrayLists provide.
Simpler lifetime management: As nested objects grow, it becomes more difficult to track what can be freed and what needs to stick around. The single ArrayList of events gives us a single space to check, and a single region of memory to free.

In the next post I’ll talk about how I’m wrapping these patterns into a concrete types.