CMSC 14100 — Lecture 23

Binary search

Here's another place where the immediate recursive solution gives us something that's not as fast as it could be: searching a sorted list.

Why? Following the recursive structure of the list means we have to look at each item in order. Obviously to do any faster we have to break out of that and "skip" some elements somehow.

If we think back to binary search trees, we can borrow an idea: try to figure out where to look.


    def binary_search(lst, item):
        """
        Searches a sorted list for the given item.

        Input:
            lst (list[int]): list of integers
            item (int): item to find

        Output: True if item is in lst, False otherwise.
        """
        if lst == []:
            return False
        else:
            mid = len(lst) // 2
            if item == lst[mid]:
                return True
            elif item < lst[mid]:
                return binary_search(lst[0:mid], item)
            else:
                return binary_search(lst[mid+1:], item)
    

This is like our binary search trees, where we can rule out portions of the search space because of our organization of the data—in this case, by sorting it. Though how the search space was divided in a binary search tree can change, with a sorted list, we guarantee splitting the search space in half on each step, since we always look at the middle element. This means that we will always be in the $\log_2 n$ situation that is ideal for binary search trees.

There is one problem with this: we're creating new lists whenever we do some slicing. While we're creating smaller and smaller lists, this cost still adds up.

One trick is to use what we learned with our discussion about accumulation: pass some information along. The observation here is that instead of recreating copies of the relevant parts of the list, we just pass the endpoints for the sublist and restrict our search there. So instead of re-creating the left half of the list, we just say "look in between these two indices"). This way, we're not creating entire lists at each step.


    def binary_search(lst, item, low, high):
        """
        Performs binary search for item in list in-place.

        Input:
            lst (list[int]): list of integers
            item (int): item to look for
            low (int): left index of search space in list
            high (int): right index of search space in list
        
        Output: index of item if item is lst, False otherwise
        """
        if (low >= high):          
            return False
        else:
            mid = (low + high)//2;
            if item == lst[mid]:
                return mid
            elif item < lst[mid]:
                return binary_search(lst, item, low, mid)
            else:
                return binary_search(lst, item, mid+1, high)
    

Since this approach only passes some integers down to the next function call, we get substantial improvements in speed.

Files

So far, we have been dealing with relatively simple input: numbers, strings, even lists can be considered relatively simple. Realistically, we will often want to deal with input that is much, much larger than we can type into an interpreter. Such data will likely have been generated by some other program.

How do we "transport" data from one program to another? How do we keep large amounts of it? The answer is by using files.

Files can be classified very broadly into two main types: files containing text or binary. The main difference between the two is that text files are often human-readable, while binary data is not. Note that this doesn't imply that text data is necessarily meant primarily to be read by people, only that it can.

Any programs we write will work with files in much the same way we (humans) do:

  1. Open a file.
  2. Read the file.
  3. After we're done with the file, close the file.
  4. Do stuff with what we read.
  5. Maybe write/save to a file (requires opening the file).

For example, recall that CSV (Comma Separated Values) files are text files that represent tabular data. In CSV files, rows are represented by lines and columns are delimited by commas. This was how the Divvy data that we worked with was stored.

Consider the file divvy-stations-20191231.csv, which contains data about Divvy stations as of December 31, 2019. The first few lines of the file look like:


    id,timestamp,name,total_docks,docks_in_service,available_docks,available_bikes,percent_full,status,latitude,longitude
    2,12/31/2019 11:55:54 PM,Buckingham Fountain,39,38,28,10,26,In Service,41.876511,-87.620548
    3,12/31/2019 11:55:54 PM,Shedd Aquarium,55,54,42,12,22,In Service,41.867226,-87.615355
    4,12/31/2019 11:55:54 PM,Burnham Harbor,23,23,11,12,52,In Service,41.856268,-87.613348
    

In Python, files are treated as another type of data. As usual, we can ask what kinds of actions we can perform on this type. Well, in order to work with our file, we have to open it.


    f = open("file-name.txt")
    

Here, we use the built in function open to open a file. The argument we provide is a string containing the path to the file we want to open. Python will find the file and create a file object for it, which we assign to the variable f.


    >>> f = open('divvy-stations-20191231.csv')
    >>> f
    <_io.TextIOWrapper name='divvy-stations-20191231.csv' mode='r' encoding='UTF-8'>
    

A question you might have is how Python finds files. As with many simple programs of this type, we have to remember the directory-based point of view—Python will start whereever you are running the code. If you're in IPython, this will be whereever you ran it from. If you're running Python code in a file, it will be wherever that file is.

This means that Python will always start looking for files assuming it is in that directory. If you want to go anywhere else, you will have to provide a path to that file.

Once we have opened a file, we can read it. We use the read method on our file.


    >>> data = f.read()
    >>> data
    'id,timestamp,name,total_docks,docks_in_service,available_docks,available_bikes,percent_full,status,latitude,longitude\n2,12/31/2019 11:55:54 PM,Buckingham Fountain,39,38,28,10,26,In Service,41.876511,-87.620548\n3,12/31/2019 ...
    

In effect, this "reads" the entire file and assigns the result, a string, to the variable data. In other words, the contents of the file are just treated as one giant string.

Note that when reading a file, the file is read sequentially and the location of where we're "looking" in the file is remembered. Since we just read the entire file, if we try to read the file again, we will not "see" anything.


    >>> more_data = f.read()
    >>> more_data
    ''
    

So the read method reads the entire file all at once. Sometimes that is not feasible, such as when dealing with very large files. This suggests that there are ways to control how far to read in the file. It is for this reason that Python keeps track of where you've read up to.

After we're done with the file, we need to close it. After closing the file, we will be unable to access the file.


    >>> f.close()
    >>> f.read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: I/O operation on closed file.
    

Perhaps predictably, closing files is something that everyone forgets to do. This happens so often that there is a special mechanism that we use to take care of it for us, the with statement.


    with open("divvy-stations-20191231.csv") as f:
        data = f.read()
    

How this works is that the file will be opened and assigned to f. Then it will be open to execute the statements within the indented block. After those statements have been executed the file will be closed.

(What exactly does the with statement really mean? Unfortunately, we can't get into it right now)

So it was mentioned earlier that there are ways to work with files that don't involve reading the entire thing in one go. This is often necessary because real files can get large enough that the entire thing can't be stored in memory.

For text data, a very common way to advance through a file is line by line. This makes sense for CSV files—after all, each line is a row. This is so common that file objects in Python can be iterated by line using a for loop.


    >>> with open('divvy-stations-20191231.csv') as f:
    ...     for line in f:
    ...         print(line)
    ... 
    id,timestamp,name,total_docks,docks_in_service,available_docks,available_bikes,percent_full,status,latitude,longitude

    2,12/31/2019 11:55:54 PM,Buckingham Fountain,39,38,28,10,26,In Service,41.876511,-87.620548

    3,12/31/2019 11:55:54 PM,Shedd Aquarium,55,54,42,12,22,In Service,41.867226,-87.615355
    ...
    

Something you may recall is that text files include newline characters. You can think of the iteration over a file as Python reading up until the next "\n" it sees on each iteration. Notice in the above code, there is a blank line in between each line. This is because each line ends with a "\n" and print adds another newline.

Now, when we have a line, we often want to get rid of these newline characters, so it is common to use the string method strip to remove any trailing whitespace.

We can then take these strings and store them in a more convenient form, like in a data structure.


    >>> stations = []
    >>> with open('divvy-stations-20191231.csv') as f:
    ...     for row in f:
    ...             stations.append(row.strip())
    ... 
    >>> stations[5]
    '6,12/31/2019 11:55:54 PM,Dusable Harbor,39,39,35,4,10,In Service,41.886976,-87.612813'
    

Finally, we also have the ability to write files. In much the same process as reading a file, we must open a file in order to write to it. We still use the open function, but we add a "w" flag to signal that we intend to write to this file instead of read it.


    >>> with open('file.txt', 'w') as f:
    ...     f.write('contents to be written (must be a string)')
    ... 
    41
    

Notice that a number was produced as output. This is the number of characters that was written to the file, which is the length of the string. Why does this number get reported? This is one way to tell whether our program successfully wrote anything to the file.

An extremely important thing to remember is that when a file is opened for writing, it overwrites it if it already exists. Be very careful when you're writing to files! If the file doesn't already exist, then it will be created.

We can put all the things we saw together into one program like the following.


    stations = []
    headers = []
    with open('divvy-stations-20191231.csv') as f:
        # read first line to get headers
        headers = f.readline().strip().split(',')  

        # turn each row into a list
        for row in f:
            stations.append(row.strip().split(','))
        
    with open('stations-lists.txt', 'w') as f:
        # create index list
        for i, header in enumerate(headers):
            f.write(f'{i}: {header}\n')

        # write each list out
        for station in stations:
            f.write(f'{station}\n')