CMSC 14100 — Lecture 9

Tuples

What does enumerate produce? Like range, it appears at first to be special. So we put what it produces into a list and see what we get.


    >>> values = [2, 5, -7, "honesty"]
    >>> enumerate(values)
    <enumerate at 0x10b6a5490>
    >>> list(enumerate(values))
    [(0, 2), (1, 5), (2, -7), (3, 'honesty')]
    >>> t = list(enumerate(values))[2]
    >>> t
    (2, -7)
    >>> type(t)
    tuple
    

We see that enumerate creates tuples. Tuples are another data structure for representing a collection or sequence of values. Tuples are denoted by parentheses instead of square brackets.


    >>> t = (2, 5, "bread", 24.34)
    >>> t
    (2, 5, 'bread', 24.34)
    >>> type(t)
    tuple
    

Like lists, tuples can contain any number of items of any type. The main difference between tuples and lists is that tuples are immutable. That is, tuples cannot be changed in the same way lists can and we can treat them essentially like values. One consequence of this is that tuples can be thought of as having their sizes fixed once they are created.


    >>> t[2] = "egg"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'tuple' object does not support item assignment
    

We can make tuples of any size/dimension we like, including 0 or 1.


    >>> empty = ()
    >>> empty
    ()
    >>> type(empty)
    tuple
    >>> one_item = (3,)
    >>> one_item
    (3,)
    >>> type(one_item)
    tuple
    >>> another_item = (4)
    >>> another_item
    4
    >>> type(another_item)
    int
    

Notice the quirk in the syntax for defining a tuple of one item. We require the trailing comma because (3) would be interpreted as an expression (recall that parentheses are used to group expressions).

There is a mathematical intepretation for tuples: while lists are an unbounded structure and can be of arbitrary length, tuples can be thought of as an element of a cartesian product of sets. Such a product has a fixed size and cannot be unbounded.

For example, a tuple of two floats representing a point (x,y) is the same as an element $(x,y) \in \mathbb R \times \mathbb R$.

(Here's a fun question to ponder: If tuples are elements of a cartesian product, what is the mathematical definition of a list? Does one even exist? (hint: yes))

The fact that tuples are of fixed size means that we should treat and use them differently than lists. Recall that lists are mutable and may be arbitrarily large. This means that they are appropriate for representing sequential data. Tuples are immutable and of fixed size. This means that they are useful for representing compound data, or data that is made up of a fixed number of components.

Since we know the the most used is the ability to pack and unpack tuples.


    >>> t = 72, "south"         # packing 
    >>> t
    (72, 'south')
    >>> count, direction = t    # unpacking
    >>> count
    72
    >>> direction
    'south'
    >>> x, y = -3.1, 9.3        # both at once
    >>> x
    -3.1
    >>> y
    9.3
    

In the first case above, one can "pack" multiple items into a single assignment. In practice, this is basically the same whether you put the parentheses around or not.

The second case is more interesting: we take a tuple and perform the corresponding assignments to names on the left hand side. This requires knowing that our tuple is of the right size and holds the right values. But this is convenient because we don't have to do any fishing via indices.

The third case above is a special case of doing both packing and unpacking at once. You can think of it as matching. One example of unpacking we saw already was with enumerate, when it seemed like we had two loop variables. Here's another example.


    >>> triples = [("a", 1, 5.5), ("b", 2, 6.7), ("d", 4, 7.8)]
    >>> for key, pos, val in triples:
    ...    print("key:", key, "pos:", pos, "val:", val)
    key: a pos: 1 val: 5.5
    key: b pos: 2 val: 6.7
    key: d pos: 4 val: 7.8
    

But what if we only want to iterate over the first two components of the tuple? Or maybe the outer two? The limitation that we must have the same number of components is getting in the way. Instead, we use _ as a placeholder that will ignore the corresponding component.


    >>> for key, _, val in triples:
    ...    print("key:", key, "val:", val)
    key: a val: 5.5
    key: b val: 6.7
    key: d val: 7.8
    

Note that there actually isn't anything special about _. Its use as a placeholder is just convention and you really shouldn't be naming anything _ anyway, but it can be treated like a normal variable.


    >>> for key, _, val in triples:
    ...    print("key:", key, "val:", val, "?", _)
    key: a val: 5.5 ? 1
    key: b val: 6.7 ? 2
    key: d val: 7.8 ? 4
    

A final note: A common temptation is to use lists for all compound data. However, it is worth thinking through whether this really makes sense. For example, if we want to represent vectors in $\mathbb R^3$, we know that all of our data will always have three components. We don't need many of the features of Python lists.

Strings

As we saw earlier, strings are the data type for storing text data. They can be viewed as sequences of single characters (though there is no "character" data type), which means that we can actually use many of the indexing facilities that we saw with lists for working with strings.


    >>> s = "banana"
    >>> s[1]
    'a'
    >>> s[-2]
    'n'
    >>> s[2:5]
    'nan'
    >>> s[::-1]
    'ananab'
    

However, unlike lists, strings are immutable. This means attempting to perform an assignment like the following will fail.


    >>> s[3] = 'z'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'str' object does not support item assignment
    

Like lists, there are an enormous number of string methods which you can find in the textbook or in the official documentation. Let's highlight a few.

The split method will split the string into a list of strings.


    >>> "Don't put    more than one    space between   words".split()
    ["Don't", 'put', 'more', 'than', 'one', 'space', 'between', 'words']
    

By default, this will be based on whitespace in the string, no matter how many spaces are between words. Note that this means that anything that's not a space will be included in your "words".

However, an alternate string can be specified. This is particularly useful for comma separated values. CSV files are a common plain text format for storing tabular data. In a CSV file, each row is a data point and the values for each column are separated by comma.

For example, here is a line from the dataset for the daily ridership on the CTA by bus route.


    "171,05/27/2022,W,1247"
    

The order of the values is the route number, the date, the day type (weekday, Saturday, or Sunday/Holiday), and number of rides. We can split this data up by calling split and passing a comma "," into it. This results in a list that we can then use as we like.


    >>> "171,05/27/2022,W,1247".split(",")
    ['171', '05/27/2022', 'W', '1247']
    

Kind of the converse operation is the join method. Suppose we have a list of strings, like the one we just created. We would like to put them together into one string, but perhaps using a different string to separate the values. For example, if we wanted to put two hyphens in between, so that the string looks like "Item 1 -- Item 2". We would call


    >>> " -- ".join(['171', '05/27/2022', 'W', '1247'])
    '171 -- 05/27/2022 -- W -- 1247'
    

Notice that the method belongs to the delimiter and we pass in the list to be "joined". This can be any string—even the empty string.


    >>> "".join(['171', '05/27/2022', 'W', '1247'])
    '17105/27/2022W1247'
    

f-strings

We've seen already that there are ways for us to transform other kinds of data into strings for the purpose of displaying them. This is such a common action that there are built-in mechanisms for formatting data into strings. These are called f-strings.


    >>> num = 24
    >>> f"There are {num} cats in the yard."
    'There are 24 cats in the yard.'
    >>> how_many = "ten thousand"
    >>> f"There are {how_many} cats in the yard."
    'There are ten thousand cats in the yard.'
    

These special strings must be prefixed by f. An especially useful feature of string formatting is the ability to control how certain values look. For example, how many digits should numbers be?


    >>> price = 243.31597
    >>> f"These cost ${price:.2f} each."
    'These cost $243.32 each.'
    >>> place = 27
    >>> f"Your ticket number is {place:0>4d}."
    'Your ticket number is 0027.'
    

There are a lot of ways to use either method for formatting strings. One convenient resource for this is pyformat.info.

"Characters" and text representation

As mentioned earlier, strings have the ability to be compared. There is the obvious comparison, equality, but strings have a more interesting property: they are ordered. This means that we can compare the order of the strings in the same way as numbers:


    >>> "corn" < "potato"
    True
    

This appears to work as we might expect, in alphabetical order. However, things don't take very long to get a bit trickier. The above comparison is obviously true, but what about this one?


    >>> "corn" < "Potato"
    False
    

This should prompt us to question how these comparisons are being made, which should then lead us to questions about how strings are really represented.

As we recall, everything in a computer is represented in some way by bits: 1s and 0s, and these bits form numbers. The same is true for text. Each character is really represented as a number. Python has a built-in function, ord that lets us see this.


    >>> ord("c")
    99
    >>> ord("P")
    80
    

This explains the discrepancy we saw above—but wait! Assuming this is all very orderly, we haev that the uppercase letters come before the lowercase letters, but this puts "A" at 65. How was this decided?

These assignments of numbers to characters is called an encoding. The most prolific of these is ASCII, the American Standard Code for Information Interchange. ASCII defines an encoding for characters using 7 bits, which works out to $2^7$ numbers, giving us 128 characters.

An interesting quirk of history is that not all of these characters are actually visible text. For example, there is a "new line" character that tells the computer to go to the next line: this is represented in Python strings as "\n".


    >>> '\n'
    '\n'
    >>> ord('\n')
    10
    >>> print('\n')
    

    >>>
    

Notice that Python faithfully produces the string representation of '\n' when we query it in the REPL, but if we print it, we get an actual new line.

Another interesting character is the "bell" character: when printed, this character forces your computer to beep. This character is encoded as 7, but we can't type it. However, Python has another built-in function, chr, which produces the corresponding character given an integer. So we can try to "print" the bell character in the following way:


    >>> print(chr(7))
    

You should do this with your speakers on.

However, if you do a bit of quick math, you'll see a small issue: how do those who don't speak Western European languages encode the text of their languages?

For a long time, different language groups had different encodings. Western Europeans would need an extended Latin character encoding for characters with accents like é. Other languages like Japanese needed entirely different encodings. There was a time when web browsers would need to provide a menu setting for selecting encodings (you can see the vestiges of this when you see HTML files specify their encodings).

Nowadays, almost everyone uses one encoding: Unicode. Unicode is intended to be a single encoding for every language and collection of symbols. Obviously, every major writing system is supported, but there are significantly more writing systems out there than you may have imagined that have representation in Unicode. Currently, there are close to 150000 symbols defined in Unicode, though the standard allows for over 1000000 characters defined over four 8-bit bytes.

Python strings are encoded in Unicode. This means that pretty much any language you can type in can have strings represented in Python.


    >>> "ᓄᓇᕗᑦ ᓴᙱᓂᕗᑦ"
    'ᓄᓇᕗᑦ ᓴᙱᓂᕗᑦ'
    

Each of these characters also has a representation as an integer. For example, ord("ア") is 12450 and ord("អ") is 6050. Notice that these are much larger than 128.

This means that methods like upper and lower are actually much more complicated than you might have first thought. Since Python uses Unicode strings, these algorithms are based on the Unicode case algorithms and you can use them on any writing system that has a notion of case.


    >>> "aêβдꭳⰼ".upper()
    'AÊΒДᎣⰌ'
    

But Unicode doesn't just include writing systems. It includes all sorts of symbols that are used in typesetting. One of the currently most prolific of these are emoji.


    >>> ord("🤔")
    129300