CMSC 14100 — Lecture 22

2009 Stimulus Package Data

In February 2009, Congress passed the American Recovery and Reinvestment Act (aka the Stimulus), which provided funds for many different purposes, including infrastructure projects. The data is provided in a CSV file with the fields

  1. City
  2. State
  3. Project description
  4. Jobs created by project
  5. Project cost
  6. Project program

The stimulus data can be considered as a tree. We can take this opportunity to point out other parts of a tree.

One problem with this is that we only have cost values for the individual projects, since that's the data contained in the file. This means that our tree only have dollar amounts at the leaf nodes. What we might want to do is aggregate these values so, for example, the IL node contained the sum of all the project nodes under it.

This involves two things:

  1. computing a value based on the subtrees, and
  2. mutating the tree so we store the values that we get.

Let's think about how we might do this. We have a tree, so let's break it up into the base and recursive cases.

This gives us the following function (notice it's not a method, sometimes we want to compute things about trees).


def aggregate_stimulus_values(t):
    """
    Given a tree, aggregate the values of the tree and fill in the 
    value.

    Input:
        tree (Tree): tree containing integers as values
    
    Output: sum of values in subtrees of tree (int)
    """
    if tree.children == []:
        return tree.value
    else:
        total_cost = 0
        for subtree in tree.children:
            total_cost += aggregate_stimulus_values(subtree)
        tree.value = total_cost
        return total_cost
    

Binary search trees

Trees are useful not just because they can represent hierarchical data, but because, like lists, they are a natural starting point for many other data structures. One such data structure is a binary search tree.

The idea behind a binary search tree is that it's a way to store a set of data that has an ordering and stores it in a way that it can be retrieved quickly. BSTs have two important properties that make them different from ordinary trees.

  1. BSTs are binary trees. That is, every tree has exactly two children. These are called the left child and the right child. It is important that they're ordered and it is important that they must both exist. (Recall that a tree can be empty!)
  2. BSTs have the binary search property:
    Given a non-empty tree with key $k$, all the nodes in the left child must have keys that are strictly less than $k$ and all the nodes in the right child must have keys that are strictly greater than $k$.

Here's an example. (Here, I've chosen to render only the keys, since they're what is relevant)

 48
  ├──28
  │  ├──15
  │  │  ├──EMPTY
  │  │  └──21
  │  │     ├──EMPTY
  │  │     └──EMPTY
  │  └──31
  │     ├──EMPTY
  │     └──EMPTY
  └──56
     ├──54
     │  ├──EMPTY
     │  └──EMPTY
     └──93
        ├──92
        │  ├──EMPTY
        │  └──EMPTY
        └──94
           ├──EMPTY
           └──EMPTY

In this rendering, each node has two subtrees: the "upper" one is the left subtree and the "lower" one is the right subtree.

So our BST class will look a bit different from the ordinary Tree class.


class BinarySearchTree:
    """
    A class representing a binary search tree. A binary search tree is 
    either
    - empty, or
    - a node containing a key and value with a left subtree and a right 
      subtree, such that the binary search tree condition is satisfied.

    Attributes:
    - _key: a key, or None if empty
    - _value: a value, which must be None if key is None
    - _left: BinarySearchTree, or None if empty
    - _right: BinarySearchTree, or None if empty
    """

    def __init__(self, key=None, value=None):
        """
        Creates a new BST with root value. If the given value is None,
        then this creates an empty BST.
        """
        assert key is not None or value is None
        if key is None:
            self._key = None
            self._value = None
            self._left = None
            self._right = None
        else:
            self._key = key
            self._value = value
            self._left = BinarySearchTree()
            self._right = BinarySearchTree()

    def is_empty(self):
        """
        Returns True if the tree is empty and False otherwise.
        """
        return self._key is None
    

Not only is the structure of the tree constrained, but since we have a particular property we need to preserve, we need to control access to it. We've seen this idea before—stacks and queues can be based on lists but have their access controlled. So similarly, BSTs are baesd on trees and have their access controlled. There are many other data structures (quadtrees, binary heaps, etc.) that are based on restrictions on trees.

So instead of using add_child, we need an insert function specifically for BSTs. There is an algorithm for this and we find that as we insert items into the tree, there's exactly one place for each item to go. This'll be the subject of discussions next week. But one observation that we can make is that the exact tree you get depends on the order that you insert the items.

For example, here is another possible BST using the same integers as above. In the above example, we inserted items into the tree in the following order: 48, 28, 56, 15, 31, 54, 93, 21, 92, 94. In the following example, we insert items in the following order: 54, 21, 92, 15, 28, 56, 93, 31, 94, 48.

 54
  ├──21
  │  ├──15
  │  │  ├──EMPTY
  │  │  └──EMPTY
  │  └──28
  │     ├──EMPTY
  │     └──31
  │        ├──EMPTY
  │        └──48
  │           ├──EMPTY
  │           └──EMPTY
  └──92
     ├──56
     │  ├──EMPTY
     │  └──EMPTY
     └──93
        ├──EMPTY
        └──94
           ├──EMPTY
           └──EMPTY

Instead, let's talk about how BSTs improves lookup by implementing __contains__. Recall that for Tree, we needed to check every child of a node for a key. We claim that a BST improves on this.

This gives us the following algorithm.


    def __contains__(self, key):
        """
        Determine whether the given key is present in the tree.

        Input:
            key (int): the key to be found

        Output: True if key is in the tree, False otherwise (bool)
        """
        if self.is_empty():
            return False
        else:
            if key == self._key:
                return True
            elif key < self._key:
                return self._left.__contains__(key)
            else:
                return self._right.__contains__(key)
    

We can even adapt this so that it retrieves the associated value instead of simply determining whether the key is in the tree.


    def find(self, key):
        """
        Retrieve the value stored at the node with the given key, if it
        exists, and False otherwise

        Input:
            key (int): the key to be found

        Output: the value associated with key if key is in the tree, 
            False otherwise (Any | False)
        """
        if self.is_empty():
            return False
        else:
            if key == self._key:
                return self._value
            elif key < self._key:
                return self._left.find(key)
            else:
                return self._right.find(key)
    

In theory, this is a huge improvement: we only ever go down exactly one branch of the tree. With a list or an ordinary tree, we have to check every element, since we don't know anything about how the data is organized. But here, our property allows us to cut down our search space by up to a half on each step in ideal conditions.

This sounds very impressive until you realize that I've couched my words a bit. Why only up to half the search space? What does ideal conditions mean?

Suppose we insert our items from our running example into the tree in order from least to greatest: 15, 21, 28, 31, 48, 54, 56, 92, 93, 94. We get the following tree.

 15
  ├──EMPTY
  └──21
     ├──EMPTY
     └──28
        ├──EMPTY
        └──31
           ├──EMPTY
           └──48
              ├──EMPTY
              └──54
                 ├──EMPTY
                 └──56
                    ├──EMPTY
                    └──92
                       ├──EMPTY
                       └──93
                          ├──EMPTY
                          └──94
                             ├──EMPTY
                             └──EMPTY

Everything is on one side! Even worse, this is basically a list: there's exactly one path to travel down for the entire tree. So this doesn't really save us any time at all. This suggests that an additional property we might want for our tree is that it should be balanced. By balaned, we mean that each side of the tree has roughly the same number of elements.

One strategy we can take is to randomize the items before insertion. However, this only works when we first populate our tree. Unless we can guarantee that items will continue to be evenly distributed, there's no way to guarantee that our tree remains balanced. And with our current definition, there is no way to guarantee this.

To guarantee balance, we would need additional restrictions on how values are stored in the tree. There are a class of data structures that are self-balancing binary search trees, like AVL trees and Red-black trees. Such trees maintain additional properties in the tree that can place provable guarantees on the number of elements on each side (or how different they can be).