CMSC 14100 — Lecture 13

Like lists, we can put arbitrary values, like lists or even other dictionaries inside of dictionaries as values (not keys though). Here's an example: instead of just counting the word, record the location of the word, creating an index.


    def locate_all_words(words):
        """
        For each word in the list, record each position in which it occurs.

        Input:
            words (list[str]): list of words

        Output (dict[str,list[int]]): dictionary that maps a word to a list
            containing the positions at which the word occurs in the list
        """
        positions = {}
        for i, word in enumerate(words):
            if word not in positions:
                positions[word] = []
            positions[word].append(i)
        return positions


    >>> locate_all_words(['it', 'is', 'a', 'melancholy', 'object', 'to',
    ... 'those', 'who', 'walk', 'through', 'this', 'great', 'town', 'or',
    ... 'travel', 'in', 'the', 'country', 'when', 'they', 'see', 'the',
    ... 'streets', 'the', 'roads', 'and', 'cabbin-doors', 'crowded', 'with',
    ... 'beggars', 'of', 'the', 'female', 'sex', 'followed', 'by', 'three',
    ... 'four', 'or', 'six', 'children', 'all', 'in', 'rags', 'and',
    ... 'importuning', 'every', 'passenger', 'for', 'an', 'alms', 'these',
    ... 'mothers', 'instead', 'of', 'being', 'able', 'to', 'work', 'for',
    ... 'their', 'honest', 'livelihood', 'are', 'forced', 'to', 'employ',
    ... 'all', 'their', 'time', 'in', 'stroling', 'to', 'beg', 'sustenance',
    ... 'for', 'their', 'helpless', 'infants', 'who', 'as', 'they', 'grow',
    ... 'up', 'either', 'turn', 'thieves', 'for', 'want', 'of', 'work',
    ... 'or', 'leave', 'their', 'dear', 'native', 'country', 'to', 'fight',
    ... 'for', 'the', 'pretender', 'in', 'spain', 'or', 'sell',
    ... 'themselves', 'to', 'the', 'barbadoes']) 
    {'it': [0], 'is': [1], 'a': [2], 'melancholy': [3], 'object': [4],
     'to': [5, 57, 65, 72, 97, 107], 'those': [6], 'who': [7, 79],
     'walk': [8], 'through': [9], 'this': [10], 'great': [11], 'town': [12],
     'or': [13, 38, 91, 104], 'travel': [14], 'in': [15, 42, 70, 102],
     'the': [16, 21, 23, 31, 100, 108], 'country': [17, 96], 'when': [18],
     'they': [19, 81], 'see': [20], 'streets': [22], 'roads': [24],
     'and': [25, 44], 'cabbin-doors': [26], 'crowded': [27], 'with': [28],
     'beggars': [29], 'of': [30, 54, 89], 'female': [32], 'sex': [33],
     'followed': [34], 'by': [35], 'three': [36], 'four': [37], 'six': [39],
     'children': [40], 'all': [41, 67], 'rags': [43], 'importuning': [45],
     'every': [46], 'passenger': [47], 'for': [48, 59, 75, 87, 99],
     'an': [49], 'alms': [50], 'these': [51], 'mothers': [52],
     'instead': [53], 'being': [55], 'able': [56], 'work': [58, 90],
     'their': [60, 68, 76, 93], 'honest': [61], 'livelihood': [62],
     'are': [63], 'forced': [64], 'employ': [66], 'time': [69],
     'stroling': [71], 'beg': [73], 'sustenance': [74], 'helpless': [77],
     'infants': [78], 'as': [80], 'grow': [82], 'up': [83], 'either': [84],
     'turn': [85], 'thieves': [86], 'want': [88], 'leave': [92],
     'dear': [94], 'native': [95], 'fight': [98], 'pretender': [101],
     'spain': [103], 'sell': [105], 'themselves': [106], 'barbadoes': [109]}

Counting $k$-mers

Let's transform our word count problem into a slightly different domain. In computational biology, we can view a sequence of DNA as a string made up of four symbols: A, C, G, and T. A common operation in computational biology is finding the number of $k$-mers in a DNA sequence. $k$-mers are substrings of length $k$.

We can translate our word count pretty easily.


    def kmer_count(k, seq):
        """
        Count number of k-mers in the given sequence.

        Input:
            k (int): length of subsequence
            seq (str): a DNA sequence

        Output (list[tuple[str,int]]): list of (k-mer, count) pairs
        """
        counts = []
        for i in range(len(seq) - k + 1):
            subseq = seq[i:i+k]
            found = False
            for j, (kmer, count) in enumerate(counts):
                if kmer == subseq:
                    result[j] = (kmer, count + 1)
                    found = True
                    break
            if not found:
                counts.append((subseq, 1))
        return counts

Let's think about how long this takes. Let $n$ be the length of seq and $k$ be k.

Consider what happens when we consider a particular subsequence. In order to locate it in our list, we need to search through it. How long could this list be? Since we're dealing with DNA, we're dealing with all ways to create strings of length $k$ out of four letters. This is $4^k$.
Now, how many times do we do this? We do this once for each substring of length $k$. How many of those are there in our string? Well, our loop counts from 0 up to $n-k$, so this is pretty close to $n$.
Finally, there's one piece that we take for granted: how long does it take to check whether two strings are the same? This depends on how long our strings are: there can be a big difference between $k = 3$ and $k = 30$.

In total, we have something like approximately $4^k \times n \times k$ operations to do, or at least the number of operations is proportional to this number. Notice that we can't get away with some of this cost: we will always need to search through all subsequences, so we're stuck with the factor of $n$ and we will always need to do string comparison of our subsequence, so we will always have a factor of $k$. The big problem (which is noticeable if you run the code for even relatively small $k$, like 5 or 6), is the factor of $4^k$—exponential functions grow very, very quickly.

Now, let's consider the dictionary.


    def kmer_count(k, seq):
        """
        Count number of k-mers in the given sequence.

        Input:
            k (int): length of subsequence
            seq (str): a DNA sequence

        Output (dict[str,int]): dictionary mapping k-mers to counts
        """
        counts = {}
        for i in range(len(seq) - k + 1):
            subseq = seq[i:i+k]
            if subseq not in counts:
                counts[subseq] = 0
            counts[subseq] = counts[subseq] + 1
        return counts

Here, we seem to do less: we still need to consider each substring and we still need to do a comparison, but we avoid having to search through a list. So we can think of this as $n \times k \times c$, where $c$ is the cost of looking a key up in the dictionary.

What exactly is $c$? It's a bit hard to get into at this point. There is a lot of interesting mathematics that goes into this but the end result is that keys are organized and computed in such a way that it is much faster than relying on sequential ordering. The difference here is a constant $c$ against a function that's exponential in the size of the $k$-mers, $4^k$.

Why might we be interested in this? When I was a postdoc, there was a graduate student in my group who was working on a project that used $k$-mer counts to classify species. The innovation in this approach was that it was computationally less expensive to just count $k$-mers (not even trying to figure out their order) than trying to align two different sequences, which many other species classification tools used. Trying to align two large DNA sequences can be very costly.

Sets

The existence of dictionaries as an unordered collection leads to an interesting question: why don't we have unordered collections that aren't key-value structures?

Conceptually, this is the same thing as a set in mathematics. Sets are collections of values that are

distinct: an item is either in the set or it isn't, and there can't be multiples of an item in a set; and
unordered: the items are simply in the set or not, and they aren't in any particular order.

The answer is that there is such a data type: set. And just like in mathematics, we define sets using braces.


    >>> {3, 1, 5}
    {1, 3, 5}
    >>> {'a', 23}
    {23, 'a'}

And just as in mathematics, the basic things we want to do with sets are

test for equality


            >>> {3, 1, 5} == {4}
            False
            >>> {'a', 23} = {23, 'a'}
            True

and,

test for membership.


            >>> 3 in {3, 1, 5} 
            True
            >>> {23} in {'a', 23}
            False

The way we performed these operations is very similar to lists. However, because lists are ordered, they may contain repeats of elements and equality is only true if the elements of the lists are in the same order.

This notation can be a bit confusing because both dictionaries and sets are defined using braces. One way to think about sets is as a collection of keys without values—keys in dictionaries are also distinct and unordered. In fact, many people approach sets as 'value-less' dictionaries or dictionaries that only have keys. But because they share a bit of the same notation, the following is an important distinction to keep in mind:

{} denotes an empty dictionary. There is never a situation where this is used to denote an empty set.
```
        >>> type({})
        dict
        
```

set() is used to denote and define an empty set.


        >>> type(set())
        set

Python also has built in set operations, like the ones that you might remember from mathematics. Consider the two sets $A$ and $B$,


    >>> a = set('abracadabra')
    >>> b = set('alacazam')
    >>> a                                 
    {'a', 'r', 'b', 'c', 'd'}
    >>> b                                  
    {'a', 'c', 'l', 'm', 'z'}

Then we have the following set operations:

union, $A \cup B$:


        >>> a | b                              
        {'a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'}

intersection, $A \cap B$:


        >>> a & b                              
        {'a', 'c'}

set difference, $A \setminus B$:


        >>> a - b                              
        {'r', 'd', 'b'}