CMSC 14100 — Lecture 20

Trees

Even though lists are recursive, we don't typically think of them as recursive because their linearity makes it very convenient to approach them iteratively. This is not the case for all recursive structures.

Recall that we can view lists as a pair of a value and a list. But what if we generalize this: instead of pairing a value with just one list, what if we had a value with multiple lists? Of course, this definition would carry on for those sublists as well. Such a structure couldn't be called a list anymore...

Trees are a fundamental structure in discrete mathematics and computer science. Informally, you can think of a tree as a node with a value and multiple nodes as "branches" (there are those multiple lists).

In general, trees are used to model hierarchical relationships and structures. For example, a document tree gives the outline of a document (sections, subsections, etc.). A phylogentic tree gives a representation of how species evolve and change. A parse tree encodes the grammatical structure of an expression.

From this description, it's not obvious that trees are a recursive structure. We will give the following definition.

A tree is

Notice the similarity between this definition and the one we had for lists.

The fact that a tree can be empty seems a bit strange, but it's no different from any of the other recursive definitions we've seen before—for instance, we have empty lists and empty strings.

Meanwhile, a non-empty tree can be thought of as a node that contains a value and acts as the root for some subtrees. Taking this and thinking back to lists, we can think of non-empty lists as nodes that have exactly zero or one children (a sub-list).

Now, let's immediately criticize this definition.

One can get special kinds of trees by restricting the number of children. For instance, a tree in which every node has exactly two subtrees is called a binary tree.

Representing trees

Unlike lists, trees are not a built-in data type in Python, so we will have to define our own. We have a few decisions to make. At the very least, we know that if a tree isn't empty, it needs to keep track of three things: a key, a value, and its subtrees.

We need to make a class for trees and the key, value, and children of the tree need to be attributes. The key and value is easy to take care of. Since our definition doesn't restrict the number of subtrees a tree can have, we can use a list.

However, we have another issue: we also need to represent empty trees. We will make the decision that if the key we store is None, then our tree is empty.

This gives us the following class definition.


    class Tree:
        """
        A class representing a tree. More specifically, an instance of this 
        class will represent either an empty tree or a tree with a node and 
        zero or more subtrees.
        """
        
        def __init__(self, value=None):
            """
            Creates either an empty tree or a tree with a node and no 
            subtrees. The node contains a value.
            
            Input:
                value (Any): Value for node. If key is None, then value 
                    must be None.
            """
            self.value = value 
            
            if value is None:
                self.children = None
            else:
                self.children = []
    

Our constructor will take in an optional value, value. If no value is provided, then we create an empty tree. This is similar to how the list constructor will create an empty list if no arguments are provided. We make the decision not to specify children and instead add them later.

Currently, our constructor distinguishes between two cases: empty or not. Since we define an empty tree to have a value of None, we enforce the condition that if value is None, then children must also be None.

Then if the tree is non-empty, it must have a value and children is initialized to an empty list.

One thing that would be helpful to have is a way to determine whether the tree is empty.


def is_empty(self):
    """
    Returns True if the tree is the empty tree, False otherwise.
    """
    return (self.value is None and 
            self.children is None)
    

So we can create trees like so.


    >>> empty_tree = Tree()
    >>> t1 = Tree(87)
    >>> t2 = Tree('dog')
    

Right now, we don't have any visibility into our tree, so let's whip up a quick implementation of __repr__.


    def __repr__(self):
        """
        Internal representation of Tree
        """
        if self.is_empty():
            return 'Tree()'
        else:
            return f'Tree({self.value}, {self.children})'
    

    >>> empty_tree
    Tree()
    >>> t1
    Tree(87, [])
    >>> t2
    Tree(dog, [])
    

Since we can't specify children when initializing the tree, we need to add them after creation. We do this with the add_child method, which takes as input a tree to be added as a child. The add_child method is pretty straightforward (i.e. not that interesting): it just takes a tree and adds it to the children list.


    def add_child(self, tree):
        """
        Input:
            tree (Tree): child tree
        """
        assert isinstance(tree, Tree)
        self.children.append(tree)
    

    >>> t1.add_child(t2)
    >>> t1
    Tree(87, [Tree(dog, [])])
    

Remember though that what we've defined are objects, so this means that we're linking two objects via their references. This means that if we still have a refernce to a child, we can mutate it and this change will be retained when we view it from other objects.


    >>> t2.add_child(empty_tree)
    >>> t2.add_child(Tree(55))
    >>> t2
    Tree(dog, [Tree(), Tree(55, [])])
    >>> t1
    Tree(87, [Tree(dog, [Tree(), Tree(55, [])])])
    

Functions on trees

The recursive structure of trees gives a natural way to process them, in the same way we've thought about functions on other recursive structures. First, let's consider the problem of counting how many nodes are in our tree. We can think of this as the size of the tree.

So as with all recursive algorithms, we want to think back to the definition of our structure. Well, what is a tree? It's either empty or a node. So we have two cases.

This gives us the following definition. The size of a tree $T$ is the number of nodes in $T$, denoted $\operatorname{size}(T)$, and is defined as follows:

As with the recursive functions we've seen, we can easily turn this definition into a function. Recall that we can override __len__ for this purpose.


    def __len__(self):
        """
        The number of nodes in the tree.
        """
        if self.is_empty():
            return 0
        else:
            count = 1 # count the current node
            for t in self.children:
                count = count + t.__len__()
            return count
    

    >>> len(empty_tree)
    0
    >>> len(t2)
    2
    >>> len(t1)
    3
    >>> t1.add_child(Tree(3000))
    >>> t1
    Tree(87, [Tree(dog, [Tree(), Tree(55, [])]), Tree(3000, [])])
    >>> len(t1)
    4
    

Just like the recursive functions we've seen before, we can "templatize" this.


    def tree_function(tree):
        if tree.is_empty():
            # base case
        else:
            # recursive case
            for t in tree.children:
                # do something with tree_function(t) and value
    

Containment

Suppose now that we wanted to determine whether a item, say $k$, is present in our tree. Again, we consider how to solve this recursively. There are two possibilities for our tree: it's empty or it's not. If our tree is empty, then it obviously does not contain $k$.

But what if our tree is a node? One thing we can do is check whether the item in our node is $k$. If it is, then we can report that our tree does contain $k$. But what if it isn't? Well, it might be in one of the subtrees. So if it is in one of our subtrees, we can report this fact. Otherwise, we know the tree doesn't contain $k$.

Let's call this $\operatorname{contains}(T,k)$ and let's call the item stored in the node at $T$ (if it's a node) $\operatorname{value}(T)$. This is going to be a boolean: it's either true or false that $T$ contains $k$. We then have two possibilities.

This gives us the following code.


    def __contains__(self, item):
        if self.is_empty():
            return False
        else:
            if item == self.value:
                return True
            else:
                for t in self.children:
                    if t.__contains__(item):
                        return True
                return False
    

Here, we implement __contains__, which gets called when the operator in is used on our object (in the same way that __len__ gets called when len is called).

This looks very similar to the definition we worked out and the other definitions we saw before. Again, it's helpful to keep the idea of using the recursive definition of the structure you're working on in mind when coming up with recursive programs.

2009 Stimulus Package Data

In February 2009, Congress passed the American Recovery and Reinvestment Act (aka the Stimulus), which provided funds for many different purposes, including infrastructure projects. The data is provided in a CSV (really separated by semicolons, not commas) file with the fields

  1. City
  2. State
  3. Project description
  4. Jobs created by project
  5. Project cost
  6. Project program

Here are the first few lines of the file.


Anchorage;AK;Replace Airport Terminal and Administrative Offices (Merrill Field);50;15000000;Airport
Anchorage;AK;Connextion Day-Hab Center;5;200000;CDBG
Anchorage;AK;Catholic Social Services Building;10;600000;CDBG
Anchorage;AK;Eagle River Town Center & EOC;20;2000000;CDBG
    

The stimulus data can be considered as a tree. We can take this opportunity to point out other parts of a tree.

One problem with this is that we only have cost values for the individual projects, since that's the data contained in the file. This means that our tree only have dollar amounts at the leaf nodes. What we might want to do is aggregate these values so, for example, the IL node contained the sum of all the project nodes under it.

So let's think about how we might do this.