library(gtfstools)
library(dplyr)
Cleaning a GTFS using gtfstools
Introducing GTFS and gtfstools
On this page, use some functions from gtfstools as well as the tidyverse to clean a GTFS. We will explore some of the structure of the GTFS and prepare it for calculations we will perform on subsequent pages.
read_gtfs reads a GTFS, holding it in your environment as a special datatable: a dt_gtfs.
<- read_gtfs("./data/gtfs 11_2018.zip")
gtfs class(gtfs)
[1] "dt_gtfs" "gtfs" "list"
summary(gtfs)
A gtfs object with the following tables and respective numbers of entries in each:
agency calendar calendar_dates routes shapes
1 4 6 107 280190
stop_times stops transfers trips
745329 4726 1266 15484
The data
For this exercise, we are using a GTFS published by the Maryland Transit Administration (MTA), containing information about service from September 2018 into February 2019. That information is contained, appropriately, in the “calendar” table of the dt_gtfs.
$calendar$start_date gtfs
[1] "2018-09-02" "2018-09-02" "2018-09-02" "2018-09-02"
$calendar$end_date gtfs
[1] "2019-02-02" "2019-02-02" "2019-02-02" "2019-02-02"
There are a bunch of other tables here, containing rows upon rows of information about transit service circa Fall and Winter, 2018-19. One of the tables is “routes” which contains information about route names and types.
unique(gtfs$routes$route_type)
[1] 3 1 0 2
Filtering by route type
In the GTFS Reference, light rail is assigned type “0”, subway is assigned type “1”, intercity rail is assigned type “2”, and bus is assigned type “3”. Knowing this, we can use the filter_by_route_type function in gtfstools to filter the entire dt_gtfs to just the information about bus service.
##filter down gtfs to just bus routes
<- filter_by_route_type(gtfs, route_type = 3)
gtfs ##from 4 types to 1
unique(gtfs$routes$route_type)
[1] 3
Agencies assign their own route_ids and these change with each new GTFS published. However, route_id corresponds to route_short_name, which does not change unless a route is eliminated or renamed.
length(unique(gtfs$routes$route_id))
[1] 102
length(unique(gtfs$routes$route_short_name))
[1] 102
#it's 1:1
Filtering buses
In the MTA bus system, routes 95 and greater are commuter and supplementary services. A number of these commuter routes serve Washington D.C. For this analysis, we are only interested in regular services in Baltimore City and that extend into surrounding Baltimore and Anne Arundel Counties.
In the next chunk, we will pass a range of numbers representing the extraneous bus routes to a character vector. Then, we will subset rows that have a route_short_name match in the character vector. We can pass the values from the resulting route_id column to a new character vector, which we can subsequently use with gtfstools filter_by_route_id function.
##create vector of route short names
<- as.character(95:850)
comm_names ##subset route ids that have a match in the comm_names vector
<- gtfs$routes[which(gtfs$routes$route_short_name %in% comm_names), "route_id"]
route_ids ##pass the route ids to a new vector
<- c(route_ids$route_id)
comm_ids ##filter by route id
<- filter_by_route_id(gtfs, comm_ids, keep = FALSE) ##seeya
gtfs_fil
length(unique(gtfs_fil$routes$route_id))
[1] 55
Times and days of the week
GTFS stores arrival and departure times as a character in HH:MM:SS format.
head(gtfs_fil$stop_times$arrival_time)
[1] "20:43:00" "20:43:57" "20:44:53" "20:45:33" "20:46:15" "20:47:21"
The convert_time_to_seconds function allows us to convert all arrival and departure times from HH:MM:SS format to seconds after midnight. This is going to make later calculations much easier. It also helps iron out some quirks that come with transit scheduling, since a single day of service can often extend beyond 24 hours.
<- convert_time_to_seconds(gtfs_fil)
gtfs_fil head(gtfs_fil$stop_times$arrival_time_secs)
[1] 74580 74637 74693 74733 74775 74841
Finally, I’m going to break the GTFS down into weekday (Monday to Friday) and weekend (Saturday and Sunday) service. Service is often pared back on the weekends and it may be useful to exclude those trips from our subsequent calculations.
Agencies assign service IDs to differentiate days of the week. There might be additional service IDs that pertain to special holiday or event services.
$calendar gtfs_fil
service_id monday tuesday wednesday thursday friday saturday sunday
1: 1 1 1 1 1 1 0 0
2: 2 0 0 0 0 0 1 0
3: 3 0 0 0 0 0 0 1
start_date end_date
1: 2018-09-02 2019-02-02
2: 2018-09-02 2019-02-02
3: 2018-09-02 2019-02-02
gtfstools handles this type of filtering, once again, with a function. We’ll store character vectors for the respective service groupings and then use them to filter the GTFS globally. We could also use the service IDs to filter individual data tables, like gtfs_fil$trips.
<- c("monday", "tuesday", "wednesday", "thursday", "friday")
m_f <- c("saturday", "sunday")
sa_su
<- filter_by_weekday(gtfs_fil, m_f, combine = "and")
gtfs_fil_m_f <- filter_by_weekday(gtfs_fil, sa_su, combine = "or")
gtfs_fil_sa_su
unique(gtfs_fil_m_f$trips$service_id)
[1] "1"
unique(gtfs_fil_sa_su$trips$service_id)
[1] "2" "3"