Friday, April 19, 2013

Node Data Evolution

While working on Thinker I searched for a data syntax that could gracefully handle the kind of challenges that would Thinker would present. I wanted a human readable, minimal syntax that handled node-based data. These considerations were mostly for the sake of the developer (in this case me); though, I was also thinking of the user. If the user ever needed to export or convert Thinker data into text for backup or transport, I wanted it to be something they could understand and pick up easily.
In the beginning, Thinker data was based in XML. XML certainly satisfied the node-based requirement. You can stick many nodes of the same kind under one parent, and XML even has explicit metadata syntax for each node in the form of attributes. But XML, despite being human readable, is still a chore to work with manually. Markup really is best used for marking text and not complex data. And despite having explicit metadata, the metadata doesn't really have any structure. If you attempt to store structured metadata in an XML attribute (such as styles in HTML) the results are hardly readable.
Then Thinker used JSON. JSON was nice because it was easy to work with in Javascript, to stringify and transport objects. The data was easier to manually work with than XML; there were no lengthy closing tags to deal with. But JSON is not inherently node-based. You can make node structures in JSON: child nodes are just nested in an array, but nesting objects in an array gets very messy in JSON when you're stacking all those brackets, braces, and commas together.
Then I got into CoffeeScript. I've always liked the beauty of languages with whitespace syntax though I hardly use Ruby or Python. So I though I would just use CoffeeScript syntax to work with my data in the form of CSON if you will. (That's actually a thing) But despite the beautiful whitespace syntax of objects in CoffeeScript there's one glaring problem: you still need brackets for arrays. And putting indented objects into an array over multiple lines wasn't much better than JSON with node-based structures... So close, yet so far.
But then I had a brilliant idea! Why not just use colons without keys to represent indexed items in an array. Since "arrays" in Javascript are generally just objects (sparse hashes) anyway, colons without keys would just be shorthand for assigning indices. So I posted on a ticket on Github...
Alas, they get people complaining about the array syntax all the time, and they are probably tired of hearing about it, especially since there are alternatives like Coco and LiveScript. Though I actually agree with jashkenas that stars and hyphens are not the best solution to the problem. Though one problem my solution would be the risk of accidentally mixing keyed and unkeyed properties. Arrays with keyed properties work in runtime, but not in a precompiled language. It would have to be a compiler error.
So ultimately, I had no choice but to create my own data syntax, which really means I had to create my own parser. The reason I didn't just go with YAML at this point was that, 1) I don't actually prefer it's hyphen syntax, and 2) I thought I would go all the way and create a syntax that satisfied my node-based requirements as well. Also, YAML doesn't accept tabs in whitespace. Yes, I see that as a flaw, and I don't care who knows.
So to start, I just created a parser that parses objects like CoffeeScript (or actually like YAML with unquoted strings). Then I added the ability to convert objects with unkeyed properties to arrays. Since this data will be parsed at runtime, it allows for arrays with some keyed properties. Of course, those properties don't output to JSON, but they could be set in the result.
So this:
obj:
 foo: bar
 key: val

arr:
 : 2
 : -1
 :
  foo: bar
  zip: zap
 :
  : 1
  : 2
  : 3
Translates to this in JSON:
{
        "obj": {
                "foo": "bar",
                "key": "val"
        },
        "arr": [
                2,
                -1,
                {
                        "foo": "bar",
                        "zip": "zap"
                },
                [
                        1,
                        2,
                        3
                ]
        ]
}
Awesome! It's the most straightforward and simple hierarchical data syntax ever. But while I'm at it with a custom parser  I could go ahead and create an expanded syntax with support for node structures. So I added some extra key symbols for Thinker node data shorthand.
So this:
root node
 some node prop: 0
 position:
  x: 0
  y: 0
 - link
  link prop: 0
  > a child node
   more node props:
    :1
    :2
    :3
   -
    > grandchild node
 -
  > another child node
...is equivalent to:
_value: root node
some node prop: 0
position:
 x: 0
 y: 0
_links:
 :
  _label: link
  link prop: 0
  _target:
   _value: a child node
   more node props:
    :1
    :2
    :3
   _links:
    :
     _target:
      _value: grandchild node
 :
  _target:
   _value: another child node
...which in JSON looks like this :
{
        "_value": "root node",
        "some node prop": 0,
        "position": {
                "x": 0,
                "y": 0
        },
        "_links": [
                {
                        "_label": "link",
                        "link prop": 0,
                        "_target": {
                                "_value": "a child node",
                                "more node props": [
                                        1,
                                        2,
                                        3
                                ],
                                "_links": [
                                        {
                                                "_target": {
                                                        "_value": "grandchild node"
                                                }
                                        }
                                ]
                        }
                },
                {
                        "_target": {
                                "_value": "another child node"
                        }
                }
        ]
}
...which would be a real chore to work with manually. This node structure uses the new link paradigm. Note that the links themselves are a kind of node and can contain their own properties. The hyphens in this case don't just designate array items, but link nodes that are added to a links array. I also added "=" as shorthand for a meta link. Carets point to link "targets" which are actual nodes, which themselves can contain links to other nodes. I even added some support for node id's and references to serve as the basis for associative links in Thinker. Normal node properties use the same basic object structure from before.
This expanded node data syntax is obviously not quite as straightforward as the basic object structure, but it is much better than the alternative for node-based data. I'm not even going to share the parser code here because I know nothing about writing parsers, and it's probably the most clumsy, inefficient, fragile parser ever written. It just works... most of the time. So For now I'll just use it internally.

Edit: mixed up the terms compiler and parser...

No comments:

Post a Comment