CMSC 14100 — Lecture 15

Here's a function for generating all possible $k$-mers of a given length $k$.


    def generate_kmers(k):
        """
        Produce a list of all strings of length k over the DNA alphabet.

        Input:
            k (int): length of mers

        Output: (list[str]) list of k-mers
        """
        bases = ['A', 'C', 'G', 'T']

        last = bases
        current = []
        for i in range(k-1):
            for b in bases:
                for l in last:
                    current.append(l + b)
            last = current
            current = []
        return last

Here's a function for counting a particular $k$-mer in a sequence.


    def count_mer(mer, seq):
        """
        Given a particular k-mer, count how many there are in the given 
        sequence.

        Input:
            mer (str): the k-mer to be counted
            seq (str): a DNA sequence

        Output: number of mers in seq (int)
        """
        k = len(mer)
        count = 0
        for i in range(0, len(seq) - k + 1):
            if mer == seq[i:i+k]:
                count = count + 1
        return count

Then we do this for each $k$-mer.


    def kmer_count(k, seq):
        """
        Count number of k-mers in the given sequence, for each k-mer.

        Input:
            k (int): length
            seq (str): a DNA sequence

        Output: list of pairs of k-mers and their counts 
          (list[tuple[str,int]])
        """
        mers = generate_kmers(k)
        counts = []
        for m in mers:
            count = count_mer(m, seq)
            if count != 0:
                counts.append((m, count))
        return counts

But what if we use a dictionary?


    def kmer_count(k, seq):
        """
        Count number of $k$-mers in the given sequence, for each 
        $k$-mer.

        Input:
            k (int): length
            seq (str): a DNA sequence

        Output: dictionary of k-mers and their counts (dict[str,int])
        """
        counts = {}
        for i in range(len(seq) - k + 1):
            subseq = seq[i:i+k]
            num = counts.get(subseq, 0)
            counts[subseq] = num + 1
        return counts

Now, you might say okay, all we have to do is avoid precomputing the $4^k$ $k$-mers. But what would such a strategy involve?

Suppose we kept a list that we added $k$-mers to as we encountered them. Every time we wanted to add to our count, we would need to search the list for the appropriate $k$-mer and update the $k$-mer/count pair. But in the worst case, we do encounter all $4^k$ $k$-mers and every time we want to update a count, we have look through a list of $4^k$ items. Since there are roughly $n$ $k$-mers, this stil gives us $n \cdot k \cdot 4^k$.

But why can we avoid that search with dictionaries? It turns out this is a special property of dictionaries: key lookup is considered a single step (note that the mechanics of this are fairly complicated and you'll learn about this in detail in future courses). In other words, we don't need to search through the dictionary. Because we eliminate the searching step, the $4^k$ factor goes away.

When I was a postdoc, there was a graduate student who was working on a project in my lab that used $k$-mer counts to classify species. The innovation in this approach was that it was computationally less expensive to count and process $k$-mers than performing sequence alignment, which many other species classification tools used. But sequence alignment required quadratic time—very costly for large DNA sequences.

They reported that the best balance of accuracy and speed is with $k = 6$. This doesn't seem particularly long, but $4^6 = 4096$. If we ran our list implementation with a sequence of length one million, this means we're performing something on the order of 24 billion operations.

Abstract Data Types

At the beginning of the quarter, we learned that every value has a type. Since then, we've been adding to our repertoire of types. We began with simple data types, like integers, floats, booleans, and strings. Then, we realized that we often want to put pieces of data together and worked with compound data: lists, tuples, and dictionaries.

As we've seen, Python offers some very convenient data types. However, as we think more about more complex representations of data, we will begin to feel that these types are not quite the right representations. This will lead us to define our own types.

When we mentioned the idea of types at the beginning of the quarter, we said that types tell us not only what kind of data is represented but also what we can do with it. In other words, what are the operations and actions we can perform on this data?

It's this question of what we can do with the data that we'll consider first. We define an abstract data type (ADT) based on the things it can do, without necessarily thinking about the specifics of how they might be implemented.

The idea that we might be able to use a type without understanding how exactly it works might make you feel uneasy at first, but in fact, we've been doing this all along: we don't really know how strings or lists work, but we happily use them and the built-in functions that are applied to them.

All you, the programmer, needs to know about these types is:

which functions are available,
a description of what they do, and
their expected inputs and outputs and any conditions on the inputs.

You may notice that this happens to describe a function's header and docstring. This list of functions and their documentation describes an interface through which we can use the data. More formal interfaces, called application programming interfaces (API), are often provided by various software packages.

Stacks

A stack is a collection of values with controlled access: we only have access to the last item that was added to the stack. There's a pithy phrase for this: "last in, first out".

Broadly, there are two main operations we can perform on a stack that make it a stack:

push, which inserts an element into the stack, and
pop, which removes the last element that was inserted into the stack.

The name of the type, stack, is illustrative. We can imagine a stack as a physical stack of items. We can only add things to the stack by putting them on the very top, and we can only remove things from the stack by taking the top item off (otherwise, we end up in a dangerous Jenga-like situation).

Of course, we can define other operations: creating a stack, checking whether it's empty, and so on. But it's pushing and popping that distinguishes the stack from other similar data types.

A question one might have is why we might want to whip up a data type that seems to do less or is more constrained than something we already have. For instance, why not just use a list? Of course, we asked this same question about why we have both tuples and lists. The answer is similar: while stacks are more constrained than lists, these constraints allow them to make guarantees about the data that they represent.

For example, if we are given a list, all we know about it is that it is an ordered collection of items that can change. However, if we are given a stack, because the only way stacks can interact is via popping or pushing, we know and are guaranteed something about the order in which we receive items when we start retrieving them from the stack.

So if we sketch out the actions we want to perform on our stack, we will get something like the following.


    def create():
        """
        Creates an empty stack.

        Input: 
            None
        
        Output: empty stack (Stack)
        """

    def is_empty(st):
        """
        Determine whether the given stack is empty
        Input:
            st (Stack): the stack

        Output: True if stack is empty, False otherwise (bool)
        """

    def push(st, item):
        """
        Puts an item on the top of the stack.

        Input:
            st (Stack): the stack
            item (Any): an item

        Output: None
        """

    def pop(st):
        """
        Removes the item on the top of the stack and produces it.

        Input:
            st (Stack): the stack

        Output: the item on the top of the stack (Any)
        """

Notice that a specification of what these functions do is all we need to interact with a stack. As long as we adhere to the conditions specified, we are guaranteed that the stack will work as expected. We do not need to know how the stack is implemented.

However, things are different if we are to implement a stack. We need to choose a data structure to represent a stack and ensure that it obeys the conditions.

An obvious choice for an implementation is using lists. If we use a list to represent a stack, we may come up with an implementation like the following.


    def create():
        """
        Creates an empty stack.

        Input: 
            None
        
        Output: empty stack (list)
        """
        return []

    def is_empty(st):
        """
        Determine whether the given stack is empty
        Input:
            st (list): the stack

        Output: True if stack is empty, False otherwise (bool)
        """
        return st == []

    def push(st, item):
        """
        Puts an item on the top of the stack.

        Input:
            st (list): the stack
            item (Any): an item

        Output: None
        """
        st.append(item)

    def pop(st):
        """
        Removes the item on the top of the stack and produces it.

        Input:
            st (list): the stack

        Output: the item on the top of the stack (Any)
        """
        return st.pop()

We can then define a stack and use it with the given operations.


    import stack
    s = stack.create()
    stack.push(s, 23)
    stack.push(s, -9)
    stack.push(s, "wa")
    x = stack.pop(s)

There's an obvious issue here. The treatment of our stack and our use of the stack operations depends solely on our good behaviour. There is nothing to stop us from treating the stack like a list, because it really is just a list. In other words, there is no way for us to enforce how we access a stack in the same way that, say, attempting to mutate a tuple will immediately stop the show, so to speak.

This can be problematic for a number of reasons. For instance, at this point, we have chosen to use a list to implement our stack, but later on, we can choose some other way to implement a stack. Since we've detailed the expected actions and conditions in our interface, this shouldn't pose a problem—if you were a good programmer and followed the instructions. But if you treated the stack like a list, you may find that your program stops working if the implementation ever changes.

Soon, we will introduce mechanisms that allow us to hide this implementation information and enforce conditions for desired behaviour.

Queues

Stacks encode the idea of recency (last in, first out). There is also the opposite idea: first in, first out. This is another natural form of controlled access: waiting in line.

The ADT that represents this form of access is called a queue (the word that the British use for a line).

Like stacks, there are two operations in particular that distinguish a queue:

enqueue, which inserts an element into the queue, and
dequeue, which removes the first element that was inserted into the queue.

Just as we used a list to implement a stack, we can use a list to implement a queue. However, we didn't really interrogate how or why we chose the list operations we did for stacks because it just seemed natural to add and take things off of the end of the list. However, with a queue, we have to consider both ends of the list. So which end is which?

It turns out there is a difference in efficiency whether we add an item to the front of a list (list.insert) or at the back (list.append). Internally, when we add an item to the front of a list, we must shift all items in the list forward—this is an $O(n)$ cost. But when we add to the back, we don't need to do this, so it is an $O(1)$ cost.

As it turns out, this same difference holds for when removing items from a list. It is $O(n)$ to remove an item at the front of the list and $O(1)$ to do so from the back. This means that there really wasn't much of a difference between the two options after all. (Of course, we needed to reason our way here in the first place)


    def create():
        """
        Creates an empty queue.

        Input:
            None

        Output: empty queue (Queue)
        """
        return []

    def is_empty(qu):
        """
        Determines whether the given queue is empty.

        Input:
            qu (Queue): the queue

        Output: True if queue is empty, False otherwise (bool)
        """
        return qu == []

    def enqueue(qu, item):
        """
        Adds an item to the back of the queue.

        Input:
            qu (Queue): the queue
            item (Any): a value

        Output: None
        """
        qu.append(item)

    def dequeue(qu):
        """
        Removes the item at the front of the queue and produces it.

        Input:
            qu (Queue): the queue

        Output: item at the front of the queue (Any)
        """
        return qu.pop(0)