CMSC 14100 — Lecture 20

Trees

Even though lists are recursive, we don't typically think of them as recursive because their linearity makes it very convenient to approach them iteratively. This is not the case for all recursive structures.

Recall that we can view lists as a pair of a value and a list. But what if we generalize this: instead of pairing a value with just one list, what if we had a value with multiple lists? Of course, this definition would carry on for those sublists as well. Such a structure couldn't be called a list anymore...

Trees are a fundamental structure in discrete mathematics and computer science. Informally, you can think of a tree as a node with a value and multiple nodes as "branches" (there are those multiple lists).

In general, trees are used to model hierarchical relationships and structures. For example, a document tree gives the outline of a document (sections, subsections, etc.). A phylogentic tree gives a representation of how species evolve and change. A parse tree encodes the grammatical structure of an expression.

From this description, it's not obvious that trees are a recursive structure. We will give the following definition.

A tree is

empty, or
a node containing
- a value, and
- zero or more subtrees, also called its children.

Notice the similarity between this definition and the one we had for lists.

The fact that a tree can be empty seems a bit strange, but it's no different from any of the other recursive definitions we've seen before—for instance, we have empty lists and empty strings.

Meanwhile, a non-empty tree can be thought of as a node that contains a value and acts as the root for some subtrees. Taking this and thinking back to lists, we can think of non-empty lists as nodes that have exactly zero or one children (a sub-list).

Now, let's immediately criticize this definition.

In fact, there are some applications and formulations of the tree where empty trees may not make sense. In cases like these, the base case is often taken to be a node with no children. However, there are also situations where we really would like to have empty trees as a base case.
Note that in the textbook, the tree definition includes something that is not present in the basic definition of a tree: a key. A key isn't necessary for a tree, but having a key has some nice conveniences for things like searching and visualization. However, we'll be focusing on the simplest definition. And we can of course either extend our definition or consider a (key, value) pair to be the value stored in the node—this specific point is not that important.

One can get special kinds of trees by restricting the number of children. For instance, a tree in which every node has exactly two subtrees is called a binary tree.

Representing trees

Unlike lists, trees are not a built-in data type in Python, so we will have to define our own. We have a few decisions to make. At the very least, we know that if a tree isn't empty, it needs to keep track of three things: a key, a value, and its subtrees.

We need to make a class for trees and the key, value, and children of the tree need to be attributes. The key and value is easy to take care of. Since our definition doesn't restrict the number of subtrees a tree can have, we can use a list.

However, we have another issue: we also need to represent empty trees. We will make the decision that if the key we store is None, then our tree is empty.

This gives us the following class definition.


    class Tree:
        """
        A class representing a tree. More specifically, an instance of this 
        class will represent either an empty tree or a tree with a node and 
        zero or more subtrees.
        """
        
        def __init__(self, value=None):
            """
            Creates either an empty tree or a tree with a node and no 
            subtrees. The node contains a value.
            
            Input:
                value (Any): Value for node. If key is None, then value 
                    must be None.
            """
            self.value = value 
            
            if value is None:
                self.children = None
            else:
                self.children = []

Our constructor will take in an optional value, value. If no value is provided, then we create an empty tree. This is similar to how the list constructor will create an empty list if no arguments are provided. We make the decision not to specify children and instead add them later.

Currently, our constructor distinguishes between two cases: empty or not. Since we define an empty tree to have a value of None, we enforce the condition that if value is None, then children must also be None.

Then if the tree is non-empty, it must have a value and children is initialized to an empty list.

One thing that would be helpful to have is a way to determine whether the tree is empty.


def is_empty(self):
    """
    Returns True if the tree is the empty tree, False otherwise.
    """
    return (self.value is None and 
            self.children is None)

So we can create trees like so.


    >>> empty_tree = Tree()
    >>> t1 = Tree(87)
    >>> t2 = Tree('dog')

Right now, we don't have any visibility into our tree, so let's whip up a quick implementation of __repr__.


    def __repr__(self):
        """
        Internal representation of Tree
        """
        if self.is_empty():
            return 'Tree()'
        else:
            return f'Tree({self.value}, {self.children})'


    >>> empty_tree
    Tree()
    >>> t1
    Tree(87, [])
    >>> t2
    Tree(dog, [])

Since we can't specify children when initializing the tree, we need to add them after creation. We do this with the add_child method, which takes as input a tree to be added as a child. The add_child method is pretty straightforward (i.e. not that interesting): it just takes a tree and adds it to the children list.


    def add_child(self, tree):
        """
        Input:
            tree (Tree): child tree
        """
        assert isinstance(tree, Tree)
        self.children.append(tree)


    >>> t1.add_child(t2)
    >>> t1
    Tree(87, [Tree(dog, [])])

Remember though that what we've defined are objects, so this means that we're linking two objects via their references. This means that if we still have a refernce to a child, we can mutate it and this change will be retained when we view it from other objects.


    >>> t2.add_child(empty_tree)
    >>> t2.add_child(Tree(55))
    >>> t2
    Tree(dog, [Tree(), Tree(55, [])])
    >>> t1
    Tree(87, [Tree(dog, [Tree(), Tree(55, [])])])

Functions on trees

The recursive structure of trees gives a natural way to process them, in the same way we've thought about functions on other recursive structures. First, let's consider the problem of counting how many nodes are in our tree. We can think of this as the size of the tree.

So as with all recursive algorithms, we want to think back to the definition of our structure. Well, what is a tree? It's either empty or a node. So we have two cases.

First, if the tree is empty, then it has 0 nodes.
But if the tree is a node, then we know it contains at least 1 node. But it could have more. How many more? Well, as many as are in its subtrees. But how do we know the size of the subtrees? Well, we can just run our size function on them, since they're trees!

This gives us the following definition. The size of a tree $T$ is the number of nodes in $T$, denoted $\operatorname{size}(T)$, and is defined as follows:

If the tree is empty, then $\operatorname{size}(T) = 0$.
If the tree is a node with $k$ subtrees, then \[\operatorname{size}(T) = 1 + \sum_{i=1}^k \operatorname{size}(T_k),\] where $T_k$ is the $k$th subtree of $T$.

As with the recursive functions we've seen, we can easily turn this definition into a function. Recall that we can override __len__ for this purpose.


    def __len__(self):
        """
        The number of nodes in the tree.
        """
        if self.is_empty():
            return 0
        else:
            count = 1 # count the current node
            for t in self.children:
                count = count + t.__len__()
            return count


    >>> len(empty_tree)
    0
    >>> len(t2)
    2
    >>> len(t1)
    3
    >>> t1.add_child(Tree(3000))
    >>> t1
    Tree(87, [Tree(dog, [Tree(), Tree(55, [])]), Tree(3000, [])])
    >>> len(t1)
    4

Just like the recursive functions we've seen before, we can "templatize" this.


    def tree_function(tree):
        if tree.is_empty():
            # base case
        else:
            # recursive case
            for t in tree.children:
                # do something with tree_function(t) and value

Containment

Suppose now that we wanted to determine whether a item, say $k$, is present in our tree. Again, we consider how to solve this recursively. There are two possibilities for our tree: it's empty or it's not. If our tree is empty, then it obviously does not contain $k$.

But what if our tree is a node? One thing we can do is check whether the item in our node is $k$. If it is, then we can report that our tree does contain $k$. But what if it isn't? Well, it might be in one of the subtrees. So if it is in one of our subtrees, we can report this fact. Otherwise, we know the tree doesn't contain $k$.

Let's call this $\operatorname{contains}(T,k)$ and let's call the item stored in the node at $T$ (if it's a node) $\operatorname{value}(T)$. This is going to be a boolean: it's either true or false that $T$ contains $k$. We then have two possibilities.

If $T$ is empty, then $\operatorname{contains}(T,k) = \texttt{False}$.
If $T$ is a node with $m$ subtrees, then \[(\operatorname{value}(T) = k) \vee \bigvee_{i=1}^m \operatorname{contains}(T_i, k).\]

This gives us the following code.


    def __contains__(self, item):
        if self.is_empty():
            return False
        else:
            if item == self.value:
                return True
            else:
                for t in self.children:
                    if t.__contains__(item):
                        return True
                return False

Here, we implement __contains__, which gets called when the operator in is used on our object (in the same way that __len__ gets called when len is called).

This looks very similar to the definition we worked out and the other definitions we saw before. Again, it's helpful to keep the idea of using the recursive definition of the structure you're working on in mind when coming up with recursive programs.

2009 Stimulus Package Data

In February 2009, Congress passed the American Recovery and Reinvestment Act (aka the Stimulus), which provided funds for many different purposes, including infrastructure projects. The data is provided in a CSV (really separated by semicolons, not commas) file with the fields

City
State
Project description
Jobs created by project
Project cost
Project program

Here are the first few lines of the file.


Anchorage;AK;Replace Airport Terminal and Administrative Offices (Merrill Field);50;15000000;Airport
Anchorage;AK;Connextion Day-Hab Center;5;200000;CDBG
Anchorage;AK;Catholic Social Services Building;10;600000;CDBG
Anchorage;AK;Eagle River Town Center & EOC;20;2000000;CDBG

The stimulus data can be considered as a tree. We can take this opportunity to point out other parts of a tree.

The root of the tree is the sole node that doesn't have a parent—this represents the whole country.
The interior nodes in the tree are nodes that have children—these represent states, cities, and categories of projects.
The leaf nodes or leaves are the nodes with no children—these represent specific projects.

One problem with this is that we only have cost values for the individual projects, since that's the data contained in the file. This means that our tree only have dollar amounts at the leaf nodes. What we might want to do is aggregate these values so, for example, the IL node contained the sum of all the project nodes under it.

So let's think about how we might do this.