Messing with MTA Subway Turnstile Data
So I haven’t been as good at blogging for the past month. My last post was on week 3 at RC, yikes. In the meantime, I have:
- implemented Hough transform line and shape detection
- built a small What is the color of ___? app
- mucked around with CSS3 animations
- attempted the Matasano Crypto Challenge of which, I am still stuck on set 1, challenge 6.
And most recently poked around some MTA Subway data.
Doing stuff with the MTA data
For the uninitiated, the Metropolitan Transit authority, MTA runs the subway and buses in NYC. And they have an API. Whattt.
Turnstile data!
I didn’t play with the API though. I was more interested in their Turnstile data which has weekly logs of cumulative entries and exits per turnstile per subway station at some time interval.
Which means, perfect for subway traffic analysis. Only that the data was in 200,000 line long, 20MB txt files in two different formats. For one week’s worth of numbers. Nnnnggghhh.
Scraping, parsing, cleaning with SQL!
I spent quite a bit of time writing code to parse the data and to dump it into a database. Let me just say that the data was not at all pretty. Logging intervals vary by stations and happen and different times of the day. The turnstile counters also get reset every now and then on different days.
I wrote a handy command line tool to scrape MTA data so you can specify a date range and it will grab the files from that range and dump the raw data in a db. You can find it on Github here and a sample dataset for Sept 2015 here:
I also wrote a script to clean the data and do some simple cleaning. More details on the README:
Because of all the wonky different formatting, writing simple tests to make sure that my code was working turned out to pay dividends, especially when I started adding edge cases and working with new things like SQLite. This was also my first time writing SQL queries and learning things like:
- how to select and group stuff!
- queries are hella slow because it’s reading from disk and should be minimized
- learning how to think in tables. grouping data and working with those chunks speeds things up a lot.
Thinking in tables
I talked to from Jesse, an RC alum who was super helpful with thinking about the data. The main thing I was trying to do was to find the number of entries/exits for a particular timeframe (which wasn’t explicit because MTA only gives you the running cumulative numbers at every entry).
Instead of having to look through the entire database for the previous entry to find the difference (which took way too long, about an hour with 200,000 rows), he suggested grouping the data by unique turnstile (about 4000+ of them), sorting by date, then lookup for the previous entry would just be the one before, i.e. O(1). In total that would only be 1 query to get list of unique turnstiles, followed by 4000+ queries for each turnstile, which turned out to be waaay faster (under 5 minutes).
Also, working with Python has been a pleasure, especially when it comes to writing simple assert tests, something I haven’t been able to make into a habit with Javascript.
I have useful data, now what?
Now that we have cleaner data, we can do stuff with it! Some preliminary data.
Note: Take it with a grain of salt because I haven’t checked for outliers
Top 10 most popular stations for Sept 2015*
Edit: turns out I was working on the wrong database with cumulative entries instead when I queried this (facepalm). Have fixed with data from the right database this time around. Still a little wonky, will have to learn how to filter out outliers later.
Edit 2: also turns out I have a lot of outliers. Did a simple pruning by standard deviation and ended up with this data. Still had to remove a couple of clear outliers by hand, but data makes sense now!
Note the repetition in 42 ST station names – I’m not sure how or why this is, so ideas are welcome.
I’m assuming that 42 ST-PA BUS TE
is the one that runs on the ACE line and 42 ST-TIMES SQ
is on the 123 line. The subsequent lines are just lines that they are connected to?
ID UNIT STATION LINENAME TOTAL_ENTRIES
---------- ---------- ------------ ---------- -------------
728605 R170 14 ST-UNION 456LNQR 2675454
733260 R046 42 ST-GRD CN 4567S 1789939
679390 R022 34 ST-HERALD BDFMNQR 1732995
714660 R084 59 ST-COLUMB 1ABCD 1729229
713284 R033 42 ST-TIMES 1237ACENQR 1584018
637233 R012 34 ST-PENN S ACE 1563049
738398 R179 86 ST 456 1562741
765251 R055 MAIN ST 7 1525426
711512 R293 34 ST-PENN S 123ACE 1356457
635141 R011 42 ST-PA BUS ACENQRS123 1342426
All stations on the G line in descending popularity
Yes, I take the G a lot, and kind of like how small and unpopular it is! Look at the ridership difference between METROPOLITAN AV
and 42 ST
above. Seems like there are a lot of people who live in LIC/Astoria that take the G all the way up. I’m surprised that HOYT
doesn’t rank higher because it transfers to the ACE
, but then again, most people would have gone on elsewhere and transferred without going through the turnstiles. We’re losing out on a lot of transfer data here.
UNIT STATION LINENAME TOTAL_ENTRIES TOTAL_EXITS
---------- --------------- ---------- ------------- -----------
R268 METROPOLITAN AV GL 406604 204217
R359 COURT SQ EMG 393372 206356
R258 4 AVE DFGMNR 317127 285127
R288 7 AV-PARK SLOPE FG 272005 60317
R204 CHURCH AVE FG 270687 169046
R129 BERGEN ST FG 270588 166326
R220 CARROLL ST FG 268265 121279
R217 HOYT/SCHERMER ACG 240597 213857
R256 NASSAU AV G 218548 113352
R239 GREENPOINT AVE G 218107 155785
R269 BEDFORD/NOSTRAN G 198039 110727
R241 15 ST-PROSPECT FG 146397 63455
R286 MYRTLE-WILLOUGH G 144565 79652
R289 FT HAMILTON PKY FG 134171 38559
R317 CLINTON-WASH AV G 132916 19985
R270 SMITH-9 ST FG 126743 142212
R287 CLASSON AVE G 122914 116782
R318 FULTON ST G 114912 110825
R299 BROADWAY G 102202 85013
R316 FLUSHING AVE G 64595 64798
R360 VAN ALSTON-21ST G 45890 50158
I also have other snippets of data on the Github README.
What’s next
Gonna work on extracting more interesting data and a visualizer!
If you like trains, check out fun facts from the MTA and a previously created turnstile visualizer.
Also, missing the Summer 2s already, but on the flipside we had a huge new batch of Fall 2 RCers join us this Monday! It’s already past the halfway point of RC and time feels like it’s running out too fast.
Back to code!
*From dates of 2015-08-29 to 2015-09-25