🎉 Exercism Research is now launched. Help Exercism, help science and have some fun at research.exercism.io 🎉
Avatar of pazzo83

pazzo83's solution

to SGF Parsing in the Python Track

Published at May 03 2021 · 0 comments
Test suite

Parsing a Smart Game Format string.

SGF is a standard format for storing board game files, in particular go.

SGF is a fairly simple format. An SGF file usually contains a single tree of nodes where each node is a property list. The property list contains key value pairs, each key can only occur once but may have multiple values.

An SGF file may look like this:


This is a tree with three nodes:

  • The top level node has three properties: FF[4] (key = "FF", value = "4"), C[root](key = "C", value = "root") and SZ[19] (key = "SZ", value = "19"). (FF indicates the version of SGF, C is a comment and SZ is the size of the board.)
    • The top level node has a single child which has a single property: B[aa]. (Black plays on the point encoded as "aa", which is the 1-1 point (which is a stupid place to play)).
      • The B[aa] node has a single child which has a single property: W[ab].

As you can imagine an SGF file contains a lot of nodes with a single child, which is why there's a shorthand for it.

SGF can encode variations of play. Go players do a lot of backtracking in their reviews (let's try this, doesn't work, let's try that) and SGF supports variations of play sequences. For example:


Here the root node has two variations. The first (which by convention indicates what's actually played) is where black plays on 1-1. Black was sent this file by his teacher who pointed out a more sensible play in the second child of the root node: B[dd] (4-4 point, a very standard opening to take the corner).

A key can have multiple values associated with it. For example:


Here AB (add black) is used to add three black stones to the board.

There are a few more complexities to SGF (and parsing in general), which you can mostly ignore. You should assume that the input is encoded in UTF-8, the tests won't contain a charset property, so don't worry about that. Furthermore you may assume that all newlines are unix style (\n, no \r or \r\n will be in the tests) and that no optional whitespace between properties, nodes, etc will be in the tests.

The exercise will have you parse an SGF string and return a tree structure of properties. You do not need to encode knowledge about the data types of properties, just use the rules for the text type everywhere.

Exception messages

Sometimes it is necessary to raise an exception. When you do this, you should include a meaningful error message to indicate what the source of the error is. This makes your code more readable and helps significantly with debugging. Not every exercise will require you to raise an exception, but for those that do, the tests will only pass if you include a message.

To raise a message with an exception, just write it as an argument to the exception type. For example, instead of raise Exception, you should write:

raise Exception("Meaningful message indicating the source of the error")

Running the tests

To run the tests, run pytest sgf_parsing_test.py

Alternatively, you can tell Python to run the pytest module: python -m pytest sgf_parsing_test.py

Common pytest options

  • -v : enable verbose output
  • -x : stop running tests on first failure
  • --ff : run failures from previous test before running other test cases

For other options, see python -m pytest -h

Submitting Exercises

Note that, when trying to submit an exercise, make sure the solution is in the $EXERCISM_WORKSPACE/python/sgf-parsing directory.

You can find your Exercism workspace by running exercism debug and looking for the line that starts with Workspace.

For more detailed information about running tests, code style and linting, please see Running the Tests.

Submitting Incomplete Solutions

It's possible to submit an incomplete solution so you can see how others have completed the exercise.


import unittest

from sgf_parsing import parse, SgfTree

# Tests adapted from `problem-specifications//canonical-data.json`

class SgfParsingTest(unittest.TestCase):
    def test_empty_input(self):
        input_string = ""
        with self.assertRaisesWithMessage(ValueError):

    def test_tree_with_no_nodes(self):
        input_string = "()"
        with self.assertRaisesWithMessage(ValueError):

    def test_node_without_tree(self):
        input_string = ";"
        with self.assertRaisesWithMessage(ValueError):

    def test_node_without_properties(self):
        input_string = "(;)"
        expected = SgfTree()
        self.assertEqual(parse(input_string), expected)

    def test_single_node_tree(self):
        input_string = "(;A[B])"
        expected = SgfTree(properties={"A": ["B"]})
        self.assertEqual(parse(input_string), expected)

    def test_multiple_properties(self):
        input_string = "(;A[b]C[d])"
        expected = SgfTree(properties={"A": ["b"], "C": ["d"]})
        self.assertEqual(parse(input_string), expected)

    def test_properties_without_delimiter(self):
        input_string = "(;A)"
        with self.assertRaisesWithMessage(ValueError):

    def test_all_lowercase_property(self):
        input_string = "(;a[b])"
        with self.assertRaisesWithMessage(ValueError):

    def test_upper_and_lowercase_property(self):
        input_string = "(;Aa[b])"
        with self.assertRaisesWithMessage(ValueError):

    def test_two_nodes(self):
        input_string = "(;A[B];B[C])"
        expected = SgfTree(properties={"A": ["B"]}, children=[SgfTree({"B": ["C"]})])
        self.assertEqual(parse(input_string), expected)

    def test_two_child_trees(self):
        input_string = "(;A[B](;B[C])(;C[D]))"
        expected = SgfTree(
            properties={"A": ["B"]},
            children=[SgfTree({"B": ["C"]}), SgfTree({"C": ["D"]})],
        self.assertEqual(parse(input_string), expected)

    def test_multiple_property_values(self):
        input_string = "(;A[b][c][d])"
        expected = SgfTree(properties={"A": ["b", "c", "d"]})
        self.assertEqual(parse(input_string), expected)

    def test_escaped_property(self):
        input_string = "(;A[\\]b\nc\nd\t\te \n\\]])"
        expected = SgfTree(properties={"A": ["]b\nc\nd  e \n]"]})
        self.assertEqual(parse(input_string), expected)

    # Utility functions
    def assertRaisesWithMessage(self, exception):
        return self.assertRaisesRegex(exception, r".+")

if __name__ == "__main__":
## SGF Parser
import re

class SgfTree:
    def __init__(self, properties=None, children=None):
        self.properties = properties or {}
        self.children = children or []

    def __eq__(self, other):
        if not isinstance(other, SgfTree):
            return False
        for k, v in self.properties.items():
            if k not in other.properties:
                return False
            if other.properties[k] != v:
                return False
        for k in other.properties.keys():
            if k not in self.properties:
                return False
        if len(self.children) != len(other.children):
            return False
        for a, b in zip(self.children, other.children):
            if a != b:
                return False
        return True

    def __ne__(self, other):
        return not self == other

def _is_boundary(char: str):
    if char == "(" or char == ")" or char == ";":
        return True
    return False

def _parse_property_key(property_string: str):
    i = 0
    # we will loop through the property string until we hit a bracket, indicating the value
    while i < len(property_string):
        char = property_string[i]
        if char == "[":
            # we are in the value part of the string
        elif char < "A" or char > "Z":
            # here we check to make sure the key is only upper case letters
            raise ValueError(f"Character {char} is invalid.  Only upper case letters are permitted for keys")
        i += 1
    # what we've iterated over thus far represents the key, and we also return the rest of the string to
    # later parse the value
    return property_string[:i], property_string[i:]

def _extract_property_value(property_string: str):
    property_string = re.sub(r"\t|\\\n|\\\r", " ", property_string)
    return re.sub(r"\\(?!\\)", "", property_string)

def _parse_property_values(property_string: str):
    values = []
    escaped = False
    in_property = False
    i = 0
    value_idx = 0

    while i < len(property_string):
        if escaped:
            escaped = False
            char = property_string[i]
            if char == "[":
                # we are in the property
                # we set the part of the string with the value to start at the next index
                value_idx = i + 1
                in_property = True
            elif char == "]":
                # we are exiting the property value
                # now we can extract the value itself
            elif char == "\\":
                escaped = True
            elif in_property == False:
        i += 1
    if len(values) == 0:
        raise ValueError("This is not a valid SGF string - no values present")
    return values, property_string[i:]

def _parse_property(property_string: str):
    key, property_string = _parse_property_key(property_string)
    values, property_string = _parse_property_values(property_string)

    return {key: values}, property_string

def _parse_properties(properties_string: str):
    properties = {}
    while properties_string != "":
        new_property, properties_string = _parse_property(properties_string)
        # update dictionary
        properties = {**properties, **new_property}
    return properties

def _parse_inner_node(node_string: str):
    # nothing is inside this node, something is wrong
    if node_string == "" or node_string[0] != ";":
        raise ValueError("The node string is not well-formed - missing semi-colon!")
    escaped = False

    # similarly here, we start at 1 because we've already processed the boundary character
    i = 1
    while i < len(node_string):
        if escaped:
            escaped = False
            char = node_string[i]
            if _is_boundary(char):
                # we've reached a boundary signifying the end of properties in this node string
                # we exit the loop
            elif char == "\\":
                escaped = True
        i += 1
    # our properties are now one past the initial index (so one past the boundary) to the end boundary we
    # reached by iterating
    properties_string = node_string[1:i]
    # the rest of our node is now left for processing
    node_string = node_string[i:]

    # first, extract the properites from our properties string
    properties = _parse_properties(properties_string)

    # then, loop through the rest of our node string to see if we find any child nodes
    children = []

    # we loop while there is still a string left to process
    while len(node_string) > 0:
        # recursive call here - we parse a node if we find it and add it to this node's children
        child, node_string = _parse_node(node_string)
        if child is None:
            # that parsing didn't give us anything, let's try to parse the inner node
            if len(children) == 0:
                child, node_string = _parse_inner_node(node_string)
                raise ValueError("The SGF string is not valid.")
    # finally we return the SgfTree object and what is left of the string to process
    return SgfTree(properties, children), node_string

def _parse_node(input_string: str):
    # To help with recursive nature, if we hit a node that is just an empty string
    # or does not begin with the proper boundary, we just return None and continue.
    if input_string == "" or input_string[0] != "(":
        return None, input_string

    # initialize node level so we know when to break out of the while loop
    node_level = 1
    escaped = False

    # we start at 1 because we've already processed the boundary character
    i = 1

    while i < len(input_string):
        if escaped:
            escaped = False
            char = input_string[i]
            if char == ")":
                # at the end of this node, let's move up a level
                node_level -= 1
                if node_level == 0:
                    # if we have moved back to the base node, we are done
            elif char == "(":
                # starting a new node
                node_level += 1
            elif char == "\\":
                # escape sign
                escaped = True
        i += 1

    # restrict our node string to being from index 1 (so all but the first character, which is a boundary) to the index we've
    # iterated through (which is the other boundary).
    node_string = input_string[1:i]
    # parse this node
    node, _ = _parse_inner_node(node_string)

    # now that we've processed this node, we move one beyond the boundary and return that string,
    # as well as the SgfTree we created
    return node, input_string[i+1:]

def parse(input_string: str ) -> SgfTree:
    """Parse an SGF String."""
    sgfTree, input_string = _parse_node(input_string)
    if sgfTree is None or len(input_string) > 0:
        raise ValueError("Error parsing SGF string.")
    return sgfTree

Community comments

Find this solution interesting? Ask the author a question to learn more.

What can you learn from this solution?

A huge amount can be learned from reading other people’s code. This is why we wanted to give exercism users the option of making their solutions public.

Here are some questions to help you reflect on this solution and learn the most from it.

  • What compromises have been made?
  • Are there new concepts here that you could read more about to improve your understanding?