CMSC 14100 — Lecture 13

Like lists, we can put arbitrary values, like lists or even other dictionaries inside of dictionaries as values (not keys though). Here's an example: instead of just counting the word, record the location of the word, creating an index.


    def locate_all_words(words):
        """
        For each word in the list, record each position in which it occurs.

        Input:
            words (list[str]): list of words

        Output (dict[str,list[int]]): dictionary that maps a word to a list
            containing the positions at which the word occurs in the list
        """
        positions = {}
        for i, word in enumerate(words):
            if word not in positions:
                positions[word] = []
            positions[word].append(i)
        return positions
    

    >>> locate_all_words(['it', 'is', 'a', 'melancholy', 'object', 'to',
    ... 'those', 'who', 'walk', 'through', 'this', 'great', 'town', 'or',
    ... 'travel', 'in', 'the', 'country', 'when', 'they', 'see', 'the',
    ... 'streets', 'the', 'roads', 'and', 'cabbin-doors', 'crowded', 'with',
    ... 'beggars', 'of', 'the', 'female', 'sex', 'followed', 'by', 'three',
    ... 'four', 'or', 'six', 'children', 'all', 'in', 'rags', 'and',
    ... 'importuning', 'every', 'passenger', 'for', 'an', 'alms', 'these',
    ... 'mothers', 'instead', 'of', 'being', 'able', 'to', 'work', 'for',
    ... 'their', 'honest', 'livelihood', 'are', 'forced', 'to', 'employ',
    ... 'all', 'their', 'time', 'in', 'stroling', 'to', 'beg', 'sustenance',
    ... 'for', 'their', 'helpless', 'infants', 'who', 'as', 'they', 'grow',
    ... 'up', 'either', 'turn', 'thieves', 'for', 'want', 'of', 'work',
    ... 'or', 'leave', 'their', 'dear', 'native', 'country', 'to', 'fight',
    ... 'for', 'the', 'pretender', 'in', 'spain', 'or', 'sell',
    ... 'themselves', 'to', 'the', 'barbadoes']) 
    {'it': [0], 'is': [1], 'a': [2], 'melancholy': [3], 'object': [4],
     'to': [5, 57, 65, 72, 97, 107], 'those': [6], 'who': [7, 79],
     'walk': [8], 'through': [9], 'this': [10], 'great': [11], 'town': [12],
     'or': [13, 38, 91, 104], 'travel': [14], 'in': [15, 42, 70, 102],
     'the': [16, 21, 23, 31, 100, 108], 'country': [17, 96], 'when': [18],
     'they': [19, 81], 'see': [20], 'streets': [22], 'roads': [24],
     'and': [25, 44], 'cabbin-doors': [26], 'crowded': [27], 'with': [28],
     'beggars': [29], 'of': [30, 54, 89], 'female': [32], 'sex': [33],
     'followed': [34], 'by': [35], 'three': [36], 'four': [37], 'six': [39],
     'children': [40], 'all': [41, 67], 'rags': [43], 'importuning': [45],
     'every': [46], 'passenger': [47], 'for': [48, 59, 75, 87, 99],
     'an': [49], 'alms': [50], 'these': [51], 'mothers': [52],
     'instead': [53], 'being': [55], 'able': [56], 'work': [58, 90],
     'their': [60, 68, 76, 93], 'honest': [61], 'livelihood': [62],
     'are': [63], 'forced': [64], 'employ': [66], 'time': [69],
     'stroling': [71], 'beg': [73], 'sustenance': [74], 'helpless': [77],
     'infants': [78], 'as': [80], 'grow': [82], 'up': [83], 'either': [84],
     'turn': [85], 'thieves': [86], 'want': [88], 'leave': [92],
     'dear': [94], 'native': [95], 'fight': [98], 'pretender': [101],
     'spain': [103], 'sell': [105], 'themselves': [106], 'barbadoes': [109]}
    

Counting $k$-mers

Let's transform our word count problem into a slightly different domain. In computational biology, we can view a sequence of DNA as a string made up of four symbols: A, C, G, and T. A common operation in computational biology is finding the number of $k$-mers in a DNA sequence. $k$-mers are substrings of length $k$.

We can translate our word count pretty easily.


    def kmer_count(k, seq):
        """
        Count number of k-mers in the given sequence.

        Input:
            k (int): length of subsequence
            seq (str): a DNA sequence

        Output (list[tuple[str,int]]): list of (k-mer, count) pairs
        """
        counts = []
        for i in range(len(seq) - k + 1):
            subseq = seq[i:i+k]
            found = False
            for j, (kmer, count) in enumerate(counts):
                if kmer == subseq:
                    result[j] = (kmer, count + 1)
                    found = True
                    break
            if not found:
                counts.append((subseq, 1))
        return counts
    

Let's think about how long this takes. Let $n$ be the length of seq and $k$ be k.

In total, we have something like approximately $4^k \times n \times k$ operations to do, or at least the number of operations is proportional to this number. Notice that we can't get away with some of this cost: we will always need to search through all subsequences, so we're stuck with the factor of $n$ and we will always need to do string comparison of our subsequence, so we will always have a factor of $k$. The big problem (which is noticeable if you run the code for even relatively small $k$, like 5 or 6), is the factor of $4^k$—exponential functions grow very, very quickly.

Now, let's consider the dictionary.


    def kmer_count(k, seq):
        """
        Count number of k-mers in the given sequence.

        Input:
            k (int): length of subsequence
            seq (str): a DNA sequence

        Output (dict[str,int]): dictionary mapping k-mers to counts
        """
        counts = {}
        for i in range(len(seq) - k + 1):
            subseq = seq[i:i+k]
            if subseq not in counts:
                counts[subseq] = 0
            counts[subseq] = counts[subseq] + 1
        return counts
    

Here, we seem to do less: we still need to consider each substring and we still need to do a comparison, but we avoid having to search through a list. So we can think of this as $n \times k \times c$, where $c$ is the cost of looking a key up in the dictionary.

What exactly is $c$? It's a bit hard to get into at this point. There is a lot of interesting mathematics that goes into this but the end result is that keys are organized and computed in such a way that it is much faster than relying on sequential ordering. The difference here is a constant $c$ against a function that's exponential in the size of the $k$-mers, $4^k$.

Why might we be interested in this? When I was a postdoc, there was a graduate student in my group who was working on a project that used $k$-mer counts to classify species. The innovation in this approach was that it was computationally less expensive to just count $k$-mers (not even trying to figure out their order) than trying to align two different sequences, which many other species classification tools used. Trying to align two large DNA sequences can be very costly.

Sets

The existence of dictionaries as an unordered collection leads to an interesting question: why don't we have unordered collections that aren't key-value structures?

Conceptually, this is the same thing as a set in mathematics. Sets are collections of values that are

  1. distinct: an item is either in the set or it isn't, and there can't be multiples of an item in a set; and
  2. unordered: the items are simply in the set or not, and they aren't in any particular order.

The answer is that there is such a data type: set. And just like in mathematics, we define sets using braces.


    >>> {3, 1, 5}
    {1, 3, 5}
    >>> {'a', 23}
    {23, 'a'}
    

And just as in mathematics, the basic things we want to do with sets are

The way we performed these operations is very similar to lists. However, because lists are ordered, they may contain repeats of elements and equality is only true if the elements of the lists are in the same order.

This notation can be a bit confusing because both dictionaries and sets are defined using braces. One way to think about sets is as a collection of keys without values—keys in dictionaries are also distinct and unordered. In fact, many people approach sets as 'value-less' dictionaries or dictionaries that only have keys. But because they share a bit of the same notation, the following is an important distinction to keep in mind:

Python also has built in set operations, like the ones that you might remember from mathematics. Consider the two sets $A$ and $B$,


    >>> a = set('abracadabra')
    >>> b = set('alacazam')
    >>> a                                 
    {'a', 'r', 'b', 'c', 'd'}
    >>> b                                  
    {'a', 'c', 'l', 'm', 'z'}
    

Then we have the following set operations: