CMSC 14100 — Lecture 14

Complexity

So far, we have very vague notions of what it means for our programs to be "efficient". It's time to make that notion a bit more concrete.

What is a good way to think about efficiency? This was a big question in the 50s and 60s. You may recall that I mentioned at the very beginning of the quarter that of the many things under the umbrella of computer science, it's the notion of complexity that is the main contribution of computer science to solving problems.

Computer science started with the idea that some problems are logically impossible to solve, due to the work of Alan Turing and Alonzo Church, among others, in the 1930s. However, after the Second World War, computers began to proliferate. It's at this point in time that as more and more computations were being carried out by computers that many computer scientists started thinking about frameworks for understanding the efficiency of algorithms.

The obvious start to this is to just use time and empirical measurements. IPython offers some handy timing tools.


    import random
    %time random.sample(range(1000), 50)
    %timeit random.sample(range(1000), 50)
    

However, we can quickly see that there are problems. For instance, how do we account for variability between machines? Or even the same machine in slightly different conditions? How do we judge how algorithms behave on particular inputs?

Instead of measuring solely based on the literal time, we consider the number of steps or operations that are executed. This eliminates a big factor: how fast a particular computer is. However, this still raises a lot of other questions: what exactly is a "step" or "operation"? Unfortunately, this is a bit hard to pin down.

For instance, we can consider things like assignments, list accesses, evaluation of boolean and arithmetic expressions, and so on to be "a step". Of course, it's hard to draw the line and we won't try to do that. But it's important to start to build a sense for what may or may not be reasonable. It's also important to be aware that not every built-in operation has the same cost.

However, this framework is still enough to give us a reasonable, if rough, idea of how many steps our algorithm takes: simply count them for some execution of the algorithm. Here, we run into the next problem: which execution do we care about? Some inputs are obviously "easier" than others, so we want to consider the worst case. In other words, what is the worst our algorithm will do? What is the most number of steps it will take?

Broadly speaking, we want to know how many steps our function takes in the worst case based on the size of our input. For example, if we have a list, then we should expect our algorithm to take more steps the longer it is. We want to know what the exact relationship here is.

Consider an input of size $n$. How many steps does it take? Presumably, if we counted very carefully, we would get a sequence: for $n = 1$, it takes $t_1$ steps, for $n = 2$, it takes $t_2$ steps, and so on. This sequence would describe a function $T(n) = t_n$, which is a function $T: \mathbb N \to \mathbb N$. The question we want to answer is: What is the relationship between the size of the input and the number of steps our algorithm takes—how fast does this function $T(n)$ grow with respect to $n$?

Let's look at a few examples. First, summing up the squares of a list of numbers.


    sum = 0
    for item in lst:
        sum = sum + item ** 2
    print(sum)
    

Let's count.

If we count everything up, for a list of size $n$, we require $4n + 2$ steps.

Now, we can consider a similar piece of code that does the same thing.


    new_lst = []
    for item in lst:
        new_item = item ** 2
        new_lst.append(new_item)

    sum = 0
    for item in new_lst:
        sum = sum + item
    print(sum)
    

Again, we can count.

If we add everything up, we have $7n + 3$ steps. Is this bad? Maybe, maybe not. We'll see that in the grand scheme of things the increase could be considered marginal.

But first, let's take a look at a more complicated example. We will consider the following code, which finds all unordered pairs of elements from a list.


    pairs = []
    for i, val_i in enumerate(lst):
        for val_j in lst[i+1:]:
            pairs.append((val_i, val_j))
    

Let's count.

Adding up the steps for this is a bit trickier because each iteration of the outer loop performs a different number of steps. If we add them all up, we get the following expression: \[2 + 5(n-1) + 2 + 5(n-2) + \cdots + 2 + 5 \times 1 + 2 + 5 \times 0.\] If we group up terms, we get the following, \[2n + 5 \cdot \sum_{i=1}^n (n-i).\] Doing some simplification gets us \[2n + 5 \cdot \sum_{i=1}^{n-1} i = 2n + 5 \cdot \frac{n(n-1)}2 = \frac 5 2 n^2 - \frac 1 2 n.\]

But we don't actually want the exact function, because it doesn't really tell us anything useful. Instead, we turn to an idea from analysis: talk about the asymptotic growth of the function. In other words, we consider an estimate of how fast our function grows, because what we really care about is what happens when $n$ is very large and the specifics of what it does when it's small don't really matter.

The tool for discussing asymptotic complexity is Big O notation, due to Landau in the late 19th century and was borrowed and popularized by Knuth for algorithms analysis in the 1970s.

There is a formal definition for Big O which we will put aside. The big idea is to think about ballpark estimates: think about the terms we use to describe large quantities. We always say "hundreds" or "tens of thousands" rather than give an exact number.

The rule with big O is to talk about the dominating term of your function. For example, in a polynomial, this would be whatever the degree of the polynomial is. Some other general rules include: exponentials always grow faster than polynomials, which always grow faster than logarithms.

This means that:

It's important to take care when using big O notation. Part of its advantages is that it abstracts certain specifics away, but it's also very easy to be mislead by it. In particular, one easy point to trip up on is this idea of the "size" of the input. If you're not careful, you can arrive at incorrect conclusions. For example, a common mistake is to think of an integer $n$ as having size $n$. This would make our primality test from earlier a linear time algorithm, but this is actually not the case!

Counting $k$-mers

Let's work through a more substantial problem. In computational biology, we can view a sequence of DNA as a string made up of four symbols: A, C, G, and T. A common operation in computational biology is finding the number of $k$-mers in a DNA sequence. $k$-mers are substrings of length $k$.

Here's a first attempt at the problem.

  1. Generate all $k$-mers.
  2. For each $k$-mer, count how many there are in the given sequence.

Let's think about how long this takes. First: we generate all $k$-mers. There are $4^k$ of these. To count one particular $k$-mer, we do a comparison of a length $k$-substring across a string of length $n$, which is about $n \times k$. So in total, we do this task $n \times k \times 4^k$ times. This is not very good as $k$ grows (exponents are very bad).