Cleaning a GTFS using gtfstools

Introducing GTFS and gtfstools

On this page, use some functions from gtfstools as well as the tidyverse to clean a GTFS. We will explore some of the structure of the GTFS and prepare it for calculations we will perform on subsequent pages.

library(gtfstools)
library(dplyr)

read_gtfs reads a GTFS, holding it in your environment as a special datatable: a dt_gtfs.

gtfs <- read_gtfs("./data/gtfs 11_2018.zip")
class(gtfs)
[1] "dt_gtfs" "gtfs"    "list"   
summary(gtfs)
A gtfs object with the following tables and respective numbers of entries in each:
        agency       calendar calendar_dates         routes         shapes 
             1              4              6            107         280190 
    stop_times          stops      transfers          trips 
        745329           4726           1266          15484 
The data

For this exercise, we are using a GTFS published by the Maryland Transit Administration (MTA), containing information about service from September 2018 into February 2019. That information is contained, appropriately, in the “calendar” table of the dt_gtfs.

gtfs$calendar$start_date
[1] "2018-09-02" "2018-09-02" "2018-09-02" "2018-09-02"
gtfs$calendar$end_date
[1] "2019-02-02" "2019-02-02" "2019-02-02" "2019-02-02"

There are a bunch of other tables here, containing rows upon rows of information about transit service circa Fall and Winter, 2018-19. One of the tables is “routes” which contains information about route names and types.

unique(gtfs$routes$route_type)
[1] 3 1 0 2

Filtering by route type

In the GTFS Reference, light rail is assigned type “0”, subway is assigned type “1”, intercity rail is assigned type “2”, and bus is assigned type “3”. Knowing this, we can use the filter_by_route_type function in gtfstools to filter the entire dt_gtfs to just the information about bus service.

##filter down gtfs to just bus routes
gtfs <- filter_by_route_type(gtfs, route_type = 3)
##from 4 types to 1
unique(gtfs$routes$route_type) 
[1] 3

Agencies assign their own route_ids and these change with each new GTFS published. However, route_id corresponds to route_short_name, which does not change unless a route is eliminated or renamed.

length(unique(gtfs$routes$route_id))
[1] 102
length(unique(gtfs$routes$route_short_name))
[1] 102
#it's 1:1

Filtering buses

In the MTA bus system, routes 95 and greater are commuter and supplementary services. A number of these commuter routes serve Washington D.C. For this analysis, we are only interested in regular services in Baltimore City and that extend into surrounding Baltimore and Anne Arundel Counties.

In the next chunk, we will pass a range of numbers representing the extraneous bus routes to a character vector. Then, we will subset rows that have a route_short_name match in the character vector. We can pass the values from the resulting route_id column to a new character vector, which we can subsequently use with gtfstools filter_by_route_id function.

##create vector of route short names
comm_names <- as.character(95:850)
##subset route ids that have a match in the comm_names vector
route_ids <- gtfs$routes[which(gtfs$routes$route_short_name %in% comm_names), "route_id"]
##pass the route ids to a new vector
comm_ids <- c(route_ids$route_id)
##filter by route id
gtfs_fil <- filter_by_route_id(gtfs, comm_ids, keep = FALSE) ##seeya

length(unique(gtfs_fil$routes$route_id))
[1] 55

Times and days of the week

GTFS stores arrival and departure times as a character in HH:MM:SS format.

head(gtfs_fil$stop_times$arrival_time)
[1] "20:43:00" "20:43:57" "20:44:53" "20:45:33" "20:46:15" "20:47:21"

The convert_time_to_seconds function allows us to convert all arrival and departure times from HH:MM:SS format to seconds after midnight. This is going to make later calculations much easier. It also helps iron out some quirks that come with transit scheduling, since a single day of service can often extend beyond 24 hours.

gtfs_fil <- convert_time_to_seconds(gtfs_fil)
head(gtfs_fil$stop_times$arrival_time_secs)
[1] 74580 74637 74693 74733 74775 74841

Finally, I’m going to break the GTFS down into weekday (Monday to Friday) and weekend (Saturday and Sunday) service. Service is often pared back on the weekends and it may be useful to exclude those trips from our subsequent calculations.

Agencies assign service IDs to differentiate days of the week. There might be additional service IDs that pertain to special holiday or event services.

gtfs_fil$calendar
   service_id monday tuesday wednesday thursday friday saturday sunday
1:          1      1       1         1        1      1        0      0
2:          2      0       0         0        0      0        1      0
3:          3      0       0         0        0      0        0      1
   start_date   end_date
1: 2018-09-02 2019-02-02
2: 2018-09-02 2019-02-02
3: 2018-09-02 2019-02-02

gtfstools handles this type of filtering, once again, with a function. We’ll store character vectors for the respective service groupings and then use them to filter the GTFS globally. We could also use the service IDs to filter individual data tables, like gtfs_fil$trips.

m_f <- c("monday", "tuesday", "wednesday", "thursday", "friday")
sa_su <- c("saturday", "sunday")

gtfs_fil_m_f <- filter_by_weekday(gtfs_fil, m_f, combine = "and")
gtfs_fil_sa_su <- filter_by_weekday(gtfs_fil, sa_su, combine = "or")

unique(gtfs_fil_m_f$trips$service_id)
[1] "1"
unique(gtfs_fil_sa_su$trips$service_id)
[1] "2" "3"