CMSC 14100 — Lecture 8

A note on types

Since we don't tell Python the type of a variable, Python must figure it out, or infer the type of the variable. Since the value of a variable can change over the course of the execution of a program, it can be the case that the type of the value that gets assigned changes, thereby changing the type of the variable as well.

One can use the type function to determine the type of a value or variable.


    a = 17
    print(type(a))
    a = "good"
    print(type(a))

Values (and by extension variables) have a single type. But what happens when we need to combine two values of different types? For example, consider the following scenario, where we would like a string that is dependent on some numeric value.


    n = 40
    s = "I have " + n + " cakes."

We get an error here because Python is unable to infer the type of the variable s. This can be frustrating because it seems very obvious to us that we would like to construct a string using the value n. However, from the perspective of the computer, it is not clear that this is the right inference to make. For example, maybe we actually wanted to turn the surrounding strings into integers and add the terms together.

So in cases like this, it is necessary to explicitly tell Python that we would like to treat a value as a particular type. This is called casting. To cast a value, as a type, we use the type name:


    n = 40
    s = "I have " + str(n) + " cakes."

One should be careful when attempting to cast values. It is possible that there are some values that can't be converted to another type. For instance, int("14") gives us what we might expect, but int("hat") will fail. Or, for a more insidious example, consider int("3.0") or int(3.14).

Along these lines, it's important to recognize that even though two values may appear to have the same value to us, they are considered to be different values. Why is this? Recall that all of these values have some representation. So the integer 5 has a different representation than the string "5" or even the float 5.0.

Strings

As we saw earlier, strings are the data type for storing text data. They can be viewed as sequences of single characters (though there is no "character" data type), which means that we can actually use many of the list manipulation facilities for working with strings.

This means we can use indices and slices with strings in exactly the same way as lists. For example, if we have the string s = "banana",

s[1] is the string "a". Curiously, since (as mentioned before) there is no "character" type, the result of this is still a string in exactly the same way (it's considered a sequence of length 1).
s[-2] is the string "n".
s[2:5] is the string nan. Like slicing with lists, you can also leave off the ends.
s[::-1] is the string "ananab". Stepping works in the same way too.

However, unlike lists, strings are immutable. This means attempting to perform an assignment like the following will fail.


    s[3] = "z"

We saw already that the basic operation on strings is concatenation. A natural extension of this is repetition.


    ("ba" + "ab") * 4

Like lists, there are an enormous number of string methods which you can find in the textbook or in the official documentation. I'd like to highlight two in particular.

The split method will split the string into a list of strings.


    "Don't put    more than one    space between   words".split()

will produce the list


    ["Don't", 'put', 'more', 'than', 'one', 'space', 'between', 'words']

By default, this will be based on whitespace in the string, no matter how many spaces are between words. Note that this means that anything that's not a space will be included in your "words".

However, an alternate string can be specified. This is particularly useful for comma separated values. CSV files are a common plain text format for storing tabular data. In a CSV file, each row is a data point and the values for each column are separated by comma.

For example, here is a line from the dataset for the daily ridership on the CTA by bus route.


    "171,05/27/2022,W,1247"

The order of the values is the route number, the date, the day type (weekday, Saturday, or Sunday/Holiday), and number of rides. We can split this data up by calling split and passing a comma "," into it.


    "171,05/27/2022,W,1247".split(",")

This results in a list that we can then use as we like.

Kind of the converse operation is the join method. Suppose we have a list of strings, like the one we just created.


    ['171', '05/27/2022', 'W', '1247']

We would like to put them together into one string, but using a different string to separate the values. For example, if we wanted to put two hyphens in between, so that the string looks like "Item 1 -- Item 2". We would call


    " -- ".join(['171', '05/27/2022', 'W', '1247'])

Notice that the method belongs to the delimiter and we pass in the list to be "joined". This results in the following string.


    "171 -- 05/27/2022 -- W -- 1247"

Format

We've seen already that there are ways for us to transform other kinds of data into strings for the purpose of displaying them. This is such a common action that there are built-in mechanisms for formatting data into strings. These are called f-strings.


    num = 24
    f"There are {num} cats in the yard."
    how_many = "ten thousand"
    f"There are {how_many} cats in the yard."

These special strings must be prefixed by f. An especially useful feature of string formatting is the ability to control how certain values look. For example, how many digits should numbers be?


    price = 243.31597
    f"These cost ${price:.2f} each."
    place = 27
    f"Your ticket number is {place:0>4d}."

f-strings are currently the preferred way to format strings. An older but similar method uses the format method.


    num = 24
    "There are {} cats in the yard.".format(num)

There are a lot of ways to use either method for formatting strings. One convenient resource for this is pyformat.info.

"Characters" and text representation

As mentioned earlier, strings have the ability to be compared. There is the obvious comparison, equality, but strings have a more interesting property: they are ordered. This means that we can compare the order of the strings in the same way as numbers:


    "corn" < "potato"

This appears to work as we might expect, in alphabetical order. However, things don't take very long to get a bit trickier. The above comparison is obviously true, but what about this one?


    "corn" < "Potato"

This should prompt us to question how these comparisons are being made, which should then lead us to questions about how strings are really represented.

As we recall, everything in a computer is represented in some way by bits: 1s and 0s, and these bits form numbers. The same is true for text. Each character is really represented as a number. Python has a built-in function, ord that lets us see this.

We see that based on this, ord("c") is 99 but ord("P") is 80, which explains the discrepancy above. But wait, assuming this is all very orderly, this puts "A" at 65. How was this decided?

These assignments of numbers to characters is called an encoding. The most prolific of these is ASCII, the American Standard Code for Information Interchange. ASCII defines an encoding for characters over $2^7$ numbers, which gives us 128 characters.

An interesting quirk of history is that not all of these characters are actually visible text. For example, there is a "new line" character that tells the computer to go to the next line: this is represented in Python strings as "\n" and we can see that ord("\n") is 10.

Another interesting character is the "bell" character: when printed, this character forces your computer to beep. This character is encoded as 7, but we can't type it. However, Python has another built-in function, chr, which produces the corresponding character given an integer. So we can try to "print" the bell character in the following way:


    print(chr(7))

However, if you do a bit of quick math, you'll see a small issue: how do those who don't speak Western European languages encode the text of their languages?

For a long time, different language groups had different encodings. Western Europeans would need an extended Latin character encoding for characters with accents like é. Other languages like Japanese needed entirely different encodings. There was a time when web browsers would need to provide a menu setting for selecting encodings (you can see the vestiges of this when you see HTML files specify their encodings).

Nowadays, almost everyone uses one encoding: Unicode. Unicode is intended to be a single encoding for every language and collection of symbols. Obviously, every major writing system is supported, but there are significantly more writing systems out there than you may have imagined that have representation in Unicode. Currently, there are close to 150000 symbols defined in Unicode, though the standard allows for over 1000000 characters defined over four 8-bit bytes.

Python strings are encoded in Unicode. This means that pretty much any language you can type in can have strings represented in Python.


    "ᓄᓇᕗᑦ ᓴᙱᓂᕗᑦ"

Each of these characters also has a representation as an integer. For example, ord("ア") is 12450 and ord("អ") is 6050.

This means that methods like upper and lower are actually much more complicated than you might have first thought. Since Python uses Unicode strings, these algorithms are based on the Unicode case algorithms and you can use them on any writing system that has a notion of case.


    "aêβдꭳⰼ".upper()

But Unicode doesn't just include writing systems. It includes all sorts of symbols that are used in typesetting. One of the most currently prolific of these are emoji.


    ord("🤔")