CMSC 14100 — Lecture 12

Files

So far, we have been dealing with relatively simple input: numbers, strings, even lists can be considered relatively simple. Realistically, we will often want to deal with input that is much, much larger than we can type into an interpreter. Such data will likely have been generated by some other program.

How do we "transport" data from one program to another? How do we keep large amounts of it? The answer is by using files.

Files can be classified very broadly into two main types: files containing text or binary. The main difference between the two is that text files are often human-readable, while binary data is not. Note that this doesn't imply that text data is necessarily meant primarily to be read by people, only that it can.

An example of this difference can be seen in image files. It's clear that very few people read image files. Most image file formats are binary, so they can't be read—you can try opening a PNG file in a text editor, for example. However, there is an example of a text file format for images sitting in your repository right now.

If you look in hw5/tests, you'll find images in the form of PPM files. If you open one of those in a program like Mac OS Preview, you'll see a very tiny image (zoom in to see it more clearly). But if you open it up in a text editor, you'll find something surprisingly readable compared to a PNG file.

The programs we write will work with files in much the same way we (humans) do:

Open a file.
Read the file.
After we're done with the file, close the file.
Do stuff with what we read.
Maybe write/save to a file (requires opening the file).

Suppose we have a text file called instructor-emails.txt that we would like to work with. This file contains the following.

amr@cs.uchicago.edu
aelmore@cs.uchicago.edu
timng@cs.uchicago.edu
kalea@uchicago.edu

In Python, files are treated as another type of data. As usual, we can ask what kinds of actions we can perform on this type. Well, in order to work with our file, we have to open it.


    f = open("file-name.txt")

Here, we use the built in function open to open a file. The argument we provide is a string containing the name of the file we want to open. Python will find the file and create a file value for it, which we assign to the variable f.

A question you might have is how Python finds files. As with many simple programs of this type, we have to remember the directory-based point of view—Python will start whereever you are running the code. If you're in IPython, this will be whereever you ran it from. If you're running Python code in a file, it will be wherever that file is.

This means that Python will always start looking for files assuming it is in that directory. If you want to go anywhere else, you will have to provide a path to that file.

Once we have opened a file, we can read it. We use the read method on our file.


    data = f.read()

In effect, this "reads" the file and assigns the result, a string, to the variable data. In this view, a file is just one very long string:


    "amr@cs.uchicago.edu\naelmore@cs.uchicago.edu\ntimng@cs.uchicago.edu\nkalea@uchicago.edu"

Note that this reads the entire file all at once. Sometimes that is not feasible. This suggests that there are ways to control how far to read in the file. Because of this, Python keeps track of where you've read up to. In our case, since we've read the entire file, if we try to read more, we will not get any more.

After we're done with the file, we need to close it:


    f.close()

Perhaps predictably, this is something that everyone forgets to do. This happens so often that there is a special mechanism that we use to take care of it for us, the with statement.


    with open("instructor-emails.txt") as f:
        emails = f.read()

(What does the with statement really mean? Unfortunately, we can't get into it right now)

After the with statement is finished, the file gets closed automatically.


    with open("instructor-emails.txt") as f:
        for line in f:
            print(line)

So it was mentioned earlier that there are ways to work with files that don't involve reading the entire thing in one go. This is often necessary because real files can get large enough that the entire thing can't be stored in memory.

For text data, a very common way to advance through a file is line by line. This is so common that files in Python can be iterated by line using a for loop.


    with open("instructor-emails.txt") as f:
        for line in f:
            print(line)

Something you may recall is that text files include newline characters. You can think of the iteration over a file as Python reading up until the next "\n" it sees on each iteration.

Now, when we have a line, we often want to get rid of these newline characters, so it is common to use the string method strip to remove any trailing whitespace.

We can then take these strings and store them in a more convenient form, like in a data structure.


    emails = []
    with open("instructor-emails.txt") as f:
        for line in f:
            email = line.strip()
            emails.append(email)

Finally, we also have the ability to write files. In much the same process as reading a file, we must open a file in order to write to it. We still use the open function, but we add a "w" flag to signal that we intend to write to this file instead of read it.


    with open("file.txt", "w") as f:
        f.write()

An extremely important thing to remember is that when a file is opened for writing, it overwrites it if it already exists. Be very careful when you're writing to files! If the file doesn't already exist, then it will be created.

We can put all the things we saw together into one program like the following.


    emails = []
    with open("instructor-emails.txt") as f:
        for line in f:
            email = line.strip()
            emails.append(email)

    cnetids = []
    for email in emails:
        cnetid, domain = email.split("@")
        cnetids.append(cnetid)

    with open("instructor-cnetids.txt", "w") as f:
        for cnetid in cnetids:
            f.write(cnetid + "\n")