CMSC 14100 — Lecture 13

Dictionaries

So far all of the more complex data types that are made of collections of other data have been sequential: lists, tuples, and strings all rely on data being stored in a linear sequence. But this is not the only way to manage collections of data. After all, not all data has a natural linear ordering.

Dictionaries are data structures that associate values with "keys" rather than position. For this reason, dictionaries are sometimes called key-value structures or mapping structures. The idea is the same: rather than refer to a value by its position in a list, we can refer to it by a key that is mapped to it.

Dictionaries are denoted by braces. We can create an empty dictionary in this way.


    d = {}

We can add items to the dictionary by assignment, in a way that looks very similar to assignment on lists.


    d["A"] = 90

Unlike lists, which would give you an out of range error, if the entry doesn't exist in the dictionary already, it gets added. If we inspect d, we see the following.


    {'A': 90}

Just as lists are represented by a comma-delimited list of items within square brackets and tuples are the same but with parentheses, we represent dictionaries as key-value pairs inside of braces, where each pair is delimited by commas and the key and value are separated by a colon.


    {key_1: value_1, key_2: value_2, ..., key_n: value_n}

Using this notation, we can directly specify a dictionary.


    {"A": 90, "A-": 75, "B+": 60, "B": 50}

It's clear that we can store any value we like, as usual, but what about keys? So far, we've been using strings as keys, but what about other types? Can we use integers or floats? The short answer is yes, but you should be careful. In general, you should use keys that are immutable, though the specific constraints are more complicated. You should also be careful not to treat a dictionary as a list if you use integer keys, since dictionaries are not sequential.

Using the "subscript" notation, we can access dictionary items in the same way as lists, just using a key rather than an index.


    points = d["A"]

If we attempt to access a key that doesn't exist in the dictionary, we get an error, just like we would with lists. On the other hand, if we attempt to "reassign" a key to another value in our dictionary, we can do this—dictionaries are mutable. This also means that keys are unique—you can only assign one value to a particular key (hence why dictionaries are mapping structures).

Just like lists, we can use in to test membership. Note that this tests whether the key is contained in the dictionary, not the value. Similarly, we can use del to remove items from the dictionary, again by key.

In fact, we almost always interact with the dictionary based on keys. For instance, as with other data types that are collections of data, we can iterate through the dictiionary with for. But notice that we iterate through its keys.


    for i in d:
        print(i)

It's important to remember that although this seems to give us some ordering for the dictionary (and there is an underlying one), we shouldn't think too hard about it because dictionaries are treated as unordered. That is, two dictionaries with the same key-value pairs are considered equal.

It's possible to access only the keys or only the values through the dictionary methods keys and values. However, if we want to iterate through both keys and values, it is more useful to use the items method, which will produce key-value pairs in the form of tuples.


    for k, v in d.items():
        print(k, v)

Something you may ask from this is what order the keys get iterated in. The official answer is that dictionaries do preserve the order of the items in the order that they were inserted in. However, you shouldn't rely on this fact, since dictionaries are considered equal if they contain the same key-value pairs, no matter what order they're in.

Because dictionaries allow us to "name" certain values, we can treat them as "structures" with named fields. These associate names with particular kinds of values. For example, we can represent homework assignments in the following way.


hw1 = {"name": "Homework #1",
       "short_name": "hw1",
       "deadline": "2022/10/04",
       "num_submissions": 379}

This idea of a structure is very useful for managing data that is often found in plain text data formats. We will discuss two common ones.

CSV

CSV (Comma Separated Values) files are files that represent tabular data. In CSV files, rows are represented by lines and columns are delimited by commas.

Consider the file cta-bus-2022.csv, which contains daily ridership numbers for each CTA bus route in 2022. This file contains around 24000 lines. Here are the first few.

route,date,daytype,rides
1,01/03/2022,W,385
1,01/04/2022,W,470
1,01/05/2022,W,289
1,01/06/2022,W,339
    ...

The first line of a CSV file gives the names for each column.

route is the route number, in plain text (not a number!)
date is the date, in mm/dd/yyyy format
daytype is an indicator: "W" for Weekday, "A" for Saturday, and "U" for Sunday/Holidays (specifically, New Year's, Memorial Day, Independence Day, Labour Day, Thanksgiving, and Christmas).
rides is the number of rides for the route on the given day, as a number.

We can process CSV files with the knowledge we have so far.


    ridership = []
    with open("cta-bus-2022.csv") as f:
        headers = f.readline().strip().split(",") # read first line
        
        for row in f:
            rides = {}
            fields = row.strip().split(",")
            
            for i, header in enumerate(headers):
                rides[header] = fields[i]
            ridership.append(rides)

However, since CSVs are so common, there are built-in libraries specifically for working with CSVs. We can use the csv module to use some of these tools.


    import csv

    ridership = []
    with open("cta-bus-2022.csv") as f:
        reader = csv.DictReader(f)
        for row in reader:
            ridership.append(row)

Here, the csv.DictReader function reads a CSV file and produces an object that will give you each row of the file as a dictionary. This dictionary has keys named based on the first row of the CSV file and associates each key with the corresponding value in the row.

Note that by default, everything is assumed to be a string. This is because all text data is a string. So all of this data is a string until we treat it otherwise. For instance, this can mean casting rides to integers and other similar processing. (It would not make sense to cast route to integers though—why?)

The existence of csv.DictReader implies that there is a csv.DictWriter for writing CSV files from dictionaries, as well as the existence of other kinds of readers and writers for other data structures.

JSON

JSON, JavaScript Object Notation, is (as suggested by the name) a data format originally created for use with the JavaScript programming language which drives user-facing interactions for almost every website out there (and even some apps). JSON files consist of

Objects: These describe the eponymous "objects" in JSON, which are a series of attributes and properties that appear similar to key-value pairs in Python dictionaries. They are indicated in the same way: key: value.
Arrays: These appear as lists of "objects". They are treated sequentially and are not keyed.
Values: These are literal values that can be stored in JSON.

Similar to the csv module, the json module has functions for converting data to and from JSON. json.dumps will convert any Python data structure into JSON (not just strings!).


    json.dumps([3, "st", {"p": 3.4, "a": "aa"}, [4,-4]])

Conversely, json.loads will take arbitrary JSON and load it into the appropriate data structures in Python.


    json.loads('[3, "st", {"p": 3.4, "a": "aa"}, [4,-4]]')

There are also corresponding functions for dumping and loading to/from files rather than Python data structures. Here, we'll use a list of Divvy stations as of October 25, 2022.


    with open("divvy-station_information-2022-10-25.json") as f:
        station_info = json.load(f)

What we'll find is something with the following keys.


    dict_keys(['data', 'last_updated', 'ttl'])

What we actually want is found a few levels deep:


    stations = station_info['data']['stations']

Since this was a JSON array, stations was loaded into a list and we can work with it as usual. One of the downsides of this is that the stations are arbitrarily ordered, rather than keyed by some identifier, we have to search through the list to find certain stations.


    found_stations = []
    for i, station in enumerate(stations):
        if "Hyde Park" in station['name']:
            found.append(i)