CMSC 14100 — Lecture 17

Modeling with classes

Many large software systems can be modelled by classes and objects that interact with each other. We'll walk through a larger example of this using Divvy data. To keep things manageable, we'll be considering a snapshot of trips on December 31, 2019. We'll be working with two datasets: Divvy trips and Divvy stations as of 2019-12-31, taken from the City of Chicago's Open Data Portal.

The goal is to answer the following question: What is the total duration and distance of all Divvy trips taken on this date? The problem itself is not too hard: For each trip, figure out the distance between the start and stop stations and get the duration of the trip. Then we just sum over these two pieces of information.

All of this data is provided in CSV files. CSV stands for comma-separated values and is just a text file in which each line is considered a row and the values in each row are separated into columns by commas. The first row is reserved for column names (this might start to sound familiar...)

Here are the first few lines of each dataset.


    id,timestamp,name,total_docks,docks_in_service,available_docks,available_bikes,percent_full,status,latitude,longitude
    2,12/31/2019 11:55:54 PM,Buckingham Fountain,39,38,28,10,26,In Service,41.876511,-87.620548
    3,12/31/2019 11:55:54 PM,Shedd Aquarium,55,54,42,12,22,In Service,41.867226,-87.615355
    4,12/31/2019 11:55:54 PM,Burnham Harbor,23,23,11,12,52,In Service,41.856268,-87.613348
    

    trip_id,starttime,stoptime,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype
    25962904,12/31/2019 11:57:17 PM,12/31/2019 11:59:18 PM,5930,120,256,Broadway & Sheridan Rd,240,Sheridan Rd & Irving Park Rd,Subscriber
    25962903,12/31/2019 11:57:11 PM,01/01/2020 12:05:45 AM,2637,514,623,Michigan Ave & 8th St,52,Michigan Ave & Lake St,Subscriber
    25962902,12/31/2019 11:57:05 PM,01/01/2020 12:05:46 AM,863,520,623,Michigan Ave & 8th St,52,Michigan Ave & Lake St,Subscriber
    

Examining the headers of the CSV files, we begin to see that there are some natural classes that arise from these two sets of data. The headers provide possible attributes: things like an ID, locations, timestamps, and so on.

First, we'll need to read the data from the files and process it. There are built-in tools in Python to facilitate this for many common file types. For instance, we can use the csv module to store the data in a list of dictionaries.


    >>> station_dicts[305]
    {'id': '328',
     'timestamp': '12/31/2019 11:55:54 PM',
     'name': 'Ellis Ave & 58th St',
     'total_docks': '19',
     'docks_in_service': '19',
     'available_docks': '12',
     'available_bikes': '7',
     'percent_full': '37',
     'status': 'In Service',
     'latitude': '41.788746',
     'longitude': '-87.601334'}
    >>> trip_dicts[1376]
    {'trip_id': '25961451',
     'starttime': '12/31/2019 12:22:55 PM',
     'stoptime': '12/31/2019 12:32:46 PM',
     'bikeid': '1105',
     'tripduration': '591',
     'from_station_id': '328',
     'from_station_name': 'Ellis Ave & 58th St',
     'to_station_id': '417',
     'to_station_name': 'Cornell Ave & Hyde Park Blvd',
     'usertype': 'Subscriber'}
    

This is the same thing we've seen: dictionaries can be used to represent an entity or object that is "real" in some sense. But representing data as dictionaries can lead to the same issues as representing stacks as lists. That is, we're always left with working with a dictionary underneath. This means that whenever we try to write functions for manipulating such entities, we always have to ensure that we're providing the "right" dictionaries to use.

Here, we're going to be juggling two different kinds of dictionaries: dictionaries representing stations and dictionaries representing trips. Since they have similar fields, we have to be extra careful that we're working with the right data at the right time.

That approach might work if all you're doing is computing one thing and you never expect to use that data again. However, it's often the case that you'll revisit a dataset or model multiple times for different reasons. In cases like these, it's worth doing some work to set up a model to facilitate those kinds of computations.

So instead, we can use classes to model this information. The keys we've been using in the dictionaries now have a very obvious analogue as attributes in a class. But now we can also associate methods with our class as well. This way, we won't have to worry about what happens if someone calls a function with an arbitrary dictionary.

As a result we can get a fairly obvious class structure by looking at the CSV headers and just using those fields as attributes. We'll also throw in some string representations to make things convenient for us to read.


    class DivvyStation:
        """
        Represents a Divvy station
        """
        def __init__(self, station_id, name, docks, lat, lon):
            """
            Constructor for DivvyStation
            
            Input:
                station_id (int): unique identifier for the station
                name (str): friendly name for station
                docks (int): number of docks at station
                lat (float): latitude of station
                lon (float): longitude of station
            """
            self.station_id = station_id
            self.name = name
            self.docks = docks
            self.lat = lat
            self.lon = lon

        def __repr__(self):
            """
            Internal string representation for DivvyStation
            """
            return (f"<DivvyStation: station_id={self.station_id},"
                    f" name='{self.name}'>")

        def __str__(self):
            """
            String value representation for DivvyStation
            """
            return f"Divvy Station #{self.station_id} ({self.name})"
    

In this defintion, we've made a few decisions. First, we don't include all of the information. Namely, we have omitted a lot of the "real-time" information: things like the number of bikes that are available. This is reasonable, since we're not really interested in (or capable of) using such information.

This has the obvious benefits we discussed: being able to specify functions that work with our class specifically. But using classes allows us to control the kinds of actions we can take much more closely. For instance, we can ensure in our constructor that the data that is being supplied makes sense.

Ideally, we would have code that could read a file and construct the desired objects. This is something that's not too hard to write. We could also write a function that takes the dictionaries we have and construct objects out of that.


    def station_dict_to_object(station_dicts):
        """
        Converts a list of Divvy station dictionaries into DivvyStations.

        Input:
            station_dicts (list[dict[str,Any]]): list of dictionaries

        Output (dict[int,DivvyStation]): dictionary mapping station IDs 
            to DivvyStations
        """
        stations = {}
        for station in station_dicts:
            st_id = int(station['id'])
            name = station['name']
            docks = int(station['total_docks'])
            lat = float(station['latitude'])
            lon = float(station['longitude'])
            stations[st_id] = DivvyStation(st_id, name, docks, lat, lon)
        return stations
    

One thing to be aware of is that when reading the information in from files, you're really dealing with a big string, so the values that you read are treated as strings at first. You will need to make some decisions when further processing the information to make sure the values are all of the correct type, or that you use them to create the right secondary structures.

Here, we make the decision to put our DivvyStation objects into a dictionary, associated with the station ID as the key. This is because we will need to refer to specific stations in the next step, when we create objects for our trips.


    >>> stations = station_dict_to_object(station_dicts)
    >>> stations[328]
    <DivvyStation: station_id=328, name='Ellis Ave & 58th St'>
    >>> stations[328].lat, stations[328].lon
    (41.788746, -87.601334)
    >>> print(stations[328])
    Divvy Station #328 (Ellis Ave & 58th St)
    

We can also do the same kind of translation from dictionary to class for trips. However, we can again take the opportunity to think through some of the data structures and objects we might want to use to model the attributes in this class.

For instance, how should we represent the timestamp? A timestamp as a string does not afford any interesting uses without further processing, so why not process it into a more useful form first? Python provides a datetime class for representing dates and times, as well as constructing them from strings.

We also have DivvyStations we can use. Instead of keeping the station ID and name, we can refer to the station object itself, giving us access to its attributes and methods.

    
class DivvyTrip:
    """
    Represents a Divvy trip.
    """
    def __init__(self, trip_id, starttime, stoptime, bikeid, 
                 tripduration, from_station, to_station, usertype): 
        """
        Constructor

        Input:
            trip_id (int): unique identifier for trip
            starttime (datetime): timestamp for start of trip
            stoptime (datetime): timestamp for end of trip
            bikeid (int): identifier for bike used
            tripduration (int): duration of trip, in 
            from_station (DivvyStation): 
            to_station (DivvyStation): 
            usertype (str): whether the user is a one-off or subscriber
        """
        self.trip_id = trip_id
        self.starttime = starttime
        self.stoptime = stoptime
        self.bike_id = bikeid
        self.tripduration = tripduration
        self.from_station = from_station
        self.to_station = to_station
        self.usertype = usertype

    def __repr__(self):
        """
        Internal string representation for DivvyTrip
        """
        return (f"<DivvyTrip: trip_id={self.trip_id},"
                f" from_station={self.from_station.station_id}",
                f" to_station={self.to_station.station_id}>")

    def __str__(self):
        """
        String representation for DivvyTrip
        """
        return (f"Divvy Trip #{self.trip_id}"
                f" from {self.from_station.name}"
                f" to {self.to_station.name}")
    

    def trip_dict_to_object(trip_dicts, stations):
        """
        Converts a list of Divvy trip dictionaries into DivvyTrips.

        Input:
            trip_dicts (list[dict[str,Any]]): list of dictionaries
            stations (dict[int,DivvyStation]): dictionary of DivvyStations

        Output (list[DivvyTrip]): list of DivvyTrips
        """
        date_format = '%m/%d/%Y %I:%M:%S %p'
        trips = []
        for trip in trip_dicts:
            trip_id = int(trip['trip_id'])
            start = datetime.strptime(trip['starttime'], date_format)
            stop = datetime.strptime(trip['stoptime'], date_format)
            bike_id = int(trip['bikeid'])
            duration = int(trip['tripduration'])
            from_station = stations[int(trip['from_station_id'])]
            to_station = stations[int(trip['to_station_id'])]
            user = trip['usertype']
            trips.append(DivvyTrip(trip_id, start, stop, bike_id, duration, 
                                   from_station, to_station, user))
        return trips
    

    >>> trips = trip_dict_to_object(trip_dicts, stations)
    >>> trips[243]
    <DivvyTrip: trip_id=25962605, from_station=236, to_station=84>
    >>> print(trips[243])
    Divvy Trip #25962605 from Sedgwick St & Schiller St to Milwaukee Ave & Grand Ave
    >>> trips[243].starttime
    datetime.datetime(2019, 12, 31, 18, 35, 36)
    >>> trips[243].to_station
    <DivvyStation: station_id=84, name='Milwaukee Ave & Grand Ave'>
    >>> trips[243].to_station.lat
    41.891578
    

One interesting benefit of taking the time to set up classes and objects for modeling is that instead of referring to an identifier (i.e. the station id), we refer to an object—a very important reason for why Python stores references to objects. And an object has attributes and methods that we can take advantage of.

How does this help us? Recall that we needed the distance of a trip. In order to do this, we need not only which stations are the endpoints, but also their location. So we need a few pieces of information and a relatively complicated formula. Using classes makes managing this much simpler.

For instance, we need to know the distance of a trip. This distance is going to be based on the start and end stations of the trip. So this means we should consider adding a method for the stations that will compute this quantity. Consider the following method, which will be added to the DivvyStation class.


    def distance_to(self, other):
        """
        Computes the direct distance (drawing a line across the surface
        of the earth) from this station to the given station.

        Input:
            other (DivvyStation): the destination Divvy station

        Output (float): distance between the two stations in metres 
        """
        diff_latitude = math.radians(other.latitude - self.latitude)
        diff_longitude = math.radians(other.longitude - self.longitude)

        a = (math.sin(diff_latitude/2) * math.sin(diff_latitude/2) +
             math.cos(math.radians(self.latitude)) *
             math.cos(math.radians(other.latitude)) *
             math.sin(diff_longitude/2) * math.sin(diff_longitude/2))
        d = 2 * math.asin(math.sqrt(a))

        return 6371000.0 * d
    

Then we can make use of this method in a DivvyTrip object, since we have access to the station objects. We add the following method to the DivvyTrip class.

    
    def get_distance(self):
        """
        Computes the distance for this trip based on the origin station
        and destination stations.

        Output (float): distance (metres) between from_station and to_station 
        """
        return self.from_station.distance_to(self.to_station)
    

Then we can compute our desired quantity.


    total_distance = 0
    for trip in trips:
        total_distance = total_distance + trip.get_distance()
    

Notice that this now removes the need for us to

Instead, the classes and methods allow us to express this quantity and these actions very naturally. And as we work with the dataset, we can add more functionality to the classes, providing a richer model of the Divvy system than if we were just focused on answering the one question we started with.