CMSC 14100 — Lecture 22

2009 Stimulus Package Data

In February 2009, Congress passed the American Recovery and Reinvestment Act (aka the Stimulus), which provided funds for many different purposes, including infrastructure projects. The data is provided in a CSV file with the fields

City
State
Project description
Jobs created by project
Project cost
Project program

The stimulus data can be considered as a tree. We can take this opportunity to point out other parts of a tree.

The root of the tree is the sole node that doesn't have a parent—this represents the whole country.
The interior nodes in the tree are nodes that have children—these represent states, cities, and categories of projects.
The leaf nodes or leaves are the nodes with no children—these represent specific projects.

One problem with this is that we only have cost values for the individual projects, since that's the data contained in the file. This means that our tree only have dollar amounts at the leaf nodes. What we might want to do is aggregate these values so, for example, the IL node contained the sum of all the project nodes under it.

This involves two things:

computing a value based on the subtrees, and
mutating the tree so we store the values that we get.

Let's think about how we might do this. We have a tree, so let's break it up into the base and recursive cases.

Here, the base case is not the empty tree! It's actually the leaf nodes, since they contain the values we're looking for. Then the value we want is just the value of the leaf node and we don't need to change anything about the node.
Now, consider the recursive case. Our problem is that we want the value of the node to take on the sum of all the nodes underneath it. This means we need to inspect each node and compute its sum. This is where we apply our recursive call. We take this result and set the value in our node and then pass on the result so the parent node can use it.

This gives us the following function (notice it's not a method, sometimes we want to compute things about trees).


def aggregate_stimulus_values(t):
    """
    Given a tree, aggregate the values of the tree and fill in the 
    value.

    Input:
        tree (Tree): tree containing integers as values
    
    Output: sum of values in subtrees of tree (int)
    """
    if tree.children == []:
        return tree.value
    else:
        total_cost = 0
        for subtree in tree.children:
            total_cost += aggregate_stimulus_values(subtree)
        tree.value = total_cost
        return total_cost

Binary search trees

Trees are useful not just because they can represent hierarchical data, but because, like lists, they are a natural starting point for many other data structures. One such data structure is a binary search tree.

The idea behind a binary search tree is that it's a way to store a set of data that has an ordering and stores it in a way that it can be retrieved quickly. BSTs have two important properties that make them different from ordinary trees.

BSTs are binary trees. That is, every tree has exactly two children. These are called the left child and the right child. It is important that they're ordered and it is important that they must both exist. (Recall that a tree can be empty!)
BSTs have the binary search property:
Given a non-empty tree with key $k$, all the nodes in the left child must have keys that are strictly less than $k$ and all the nodes in the right child must have keys that are strictly greater than $k$.

Here's an example. (Here, I've chosen to render only the keys, since they're what is relevant)

 48
  ├──28
  │  ├──15
  │  │  ├──EMPTY
  │  │  └──21
  │  │     ├──EMPTY
  │  │     └──EMPTY
  │  └──31
  │     ├──EMPTY
  │     └──EMPTY
  └──56
     ├──54
     │  ├──EMPTY
     │  └──EMPTY
     └──93
        ├──92
        │  ├──EMPTY
        │  └──EMPTY
        └──94
           ├──EMPTY
           └──EMPTY

In this rendering, each node has two subtrees: the "upper" one is the left subtree and the "lower" one is the right subtree.

So our BST class will look a bit different from the ordinary Tree class.


class BinarySearchTree:
    """
    A class representing a binary search tree. A binary search tree is 
    either
    - empty, or
    - a node containing a key and value with a left subtree and a right 
      subtree, such that the binary search tree condition is satisfied.

    Attributes:
    - _key: a key, or None if empty
    - _value: a value, which must be None if key is None
    - _left: BinarySearchTree, or None if empty
    - _right: BinarySearchTree, or None if empty
    """

    def __init__(self, key=None, value=None):
        """
        Creates a new BST with root value. If the given value is None,
        then this creates an empty BST.
        """
        assert key is not None or value is None
        if key is None:
            self._key = None
            self._value = None
            self._left = None
            self._right = None
        else:
            self._key = key
            self._value = value
            self._left = BinarySearchTree()
            self._right = BinarySearchTree()

    def is_empty(self):
        """
        Returns True if the tree is empty and False otherwise.
        """
        return self._key is None

Not only is the structure of the tree constrained, but since we have a particular property we need to preserve, we need to control access to it. We've seen this idea before—stacks and queues can be based on lists but have their access controlled. So similarly, BSTs are baesd on trees and have their access controlled. There are many other data structures (quadtrees, binary heaps, etc.) that are based on restrictions on trees.

So instead of using add_child, we need an insert function specifically for BSTs. There is an algorithm for this and we find that as we insert items into the tree, there's exactly one place for each item to go. This'll be the subject of discussions next week. But one observation that we can make is that the exact tree you get depends on the order that you insert the items.

For example, here is another possible BST using the same integers as above. In the above example, we inserted items into the tree in the following order: 48, 28, 56, 15, 31, 54, 93, 21, 92, 94. In the following example, we insert items in the following order: 54, 21, 92, 15, 28, 56, 93, 31, 94, 48.

 54
  ├──21
  │  ├──15
  │  │  ├──EMPTY
  │  │  └──EMPTY
  │  └──28
  │     ├──EMPTY
  │     └──31
  │        ├──EMPTY
  │        └──48
  │           ├──EMPTY
  │           └──EMPTY
  └──92
     ├──56
     │  ├──EMPTY
     │  └──EMPTY
     └──93
        ├──EMPTY
        └──94
           ├──EMPTY
           └──EMPTY

Instead, let's talk about how BSTs improves lookup by implementing __contains__. Recall that for Tree, we needed to check every child of a node for a key. We claim that a BST improves on this.

Consider the base case: an empty tree. Then clearly our key is not in this tree, so the result is False.
Then the recursive case is a node with two subtrees.
- If the key is in the current node, then we're done! The answer is True.
- But what if it's not? For an ordinary tree, we need to check each subtree. But in a BST, there are only two! Not only that, we know something important about these subtrees: everything on the left is guaranteed to have a key less than that of the current node. Conversely, everything on the right is guaranteed to have a key greater than that of the current node. This gives us a way forward.
  - If our key is less than the key of the current node, we only check the left subtree.
  - If our key is greater than the key of the current node, we only check the right subtree.

This gives us the following algorithm.


    def __contains__(self, key):
        """
        Determine whether the given key is present in the tree.

        Input:
            key (int): the key to be found

        Output: True if key is in the tree, False otherwise (bool)
        """
        if self.is_empty():
            return False
        else:
            if key == self._key:
                return True
            elif key < self._key:
                return self._left.__contains__(key)
            else:
                return self._right.__contains__(key)

We can even adapt this so that it retrieves the associated value instead of simply determining whether the key is in the tree.


    def find(self, key):
        """
        Retrieve the value stored at the node with the given key, if it
        exists, and False otherwise

        Input:
            key (int): the key to be found

        Output: the value associated with key if key is in the tree, 
            False otherwise (Any | False)
        """
        if self.is_empty():
            return False
        else:
            if key == self._key:
                return self._value
            elif key < self._key:
                return self._left.find(key)
            else:
                return self._right.find(key)

In theory, this is a huge improvement: we only ever go down exactly one branch of the tree. With a list or an ordinary tree, we have to check every element, since we don't know anything about how the data is organized. But here, our property allows us to cut down our search space by up to a half on each step in ideal conditions.

This sounds very impressive until you realize that I've couched my words a bit. Why only up to half the search space? What does ideal conditions mean?

Suppose we insert our items from our running example into the tree in order from least to greatest: 15, 21, 28, 31, 48, 54, 56, 92, 93, 94. We get the following tree.

 15
  ├──EMPTY
  └──21
     ├──EMPTY
     └──28
        ├──EMPTY
        └──31
           ├──EMPTY
           └──48
              ├──EMPTY
              └──54
                 ├──EMPTY
                 └──56
                    ├──EMPTY
                    └──92
                       ├──EMPTY
                       └──93
                          ├──EMPTY
                          └──94
                             ├──EMPTY
                             └──EMPTY

Everything is on one side! Even worse, this is basically a list: there's exactly one path to travel down for the entire tree. So this doesn't really save us any time at all. This suggests that an additional property we might want for our tree is that it should be balanced. By balaned, we mean that each side of the tree has roughly the same number of elements.

One strategy we can take is to randomize the items before insertion. However, this only works when we first populate our tree. Unless we can guarantee that items will continue to be evenly distributed, there's no way to guarantee that our tree remains balanced. And with our current definition, there is no way to guarantee this.

To guarantee balance, we would need additional restrictions on how values are stored in the tree. There are a class of data structures that are self-balancing binary search trees, like AVL trees and Red-black trees. Such trees maintain additional properties in the tree that can place provable guarantees on the number of elements on each side (or how different they can be).