EDA: NYC MTA Turnstile Data

14 minute read

NYC MTA Turnstile Data

Data Import
Data Cleaning
Analysis

import pandas as pd
import os
import itertools
import time
import datetime
import matplotlib.pyplot as plt

def format_old_file(filename):
    column_names = ['C/A', 'UNIT', 'SCP', 'DATE', 'TIME','DESC', 'ENTRIES', 'EXITS']
    remainders = itertools.zip_longest(*[iter(range(3, 43))] * 5, fillvalue=0)
    column_values = [[0, 1, 2, *row] for row in remainders]
    frames = [pd.read_csv(filename, header=None, usecols=usecols,
                              names=column_names)
                  for usecols in column_values]
    
    df = pd.concat(frames, axis=0)
    df.reset_index(drop=True,inplace=True)
    return df

Pull in 2013 Data

All txt files for the year of 2013 were downloaded from http://web.mta.info/developers/turnstile.html and placed in a local directory called ‘mta_data’. Data files created prior to 10/18/14 do not have one observation per row, and thus require some preprocessing and transformation (function above). Following this step, all the resulting dataframes from each weekly txt file are concatenated into one large dataframe containing all data for 2013. In all, it takes ~1 min to load in and process all the files (on my Mac at least).

Within the mta_data folder, a subdirectory called ‘2013’ contains all 2013 txt files. Additionally, within the mta_data directory there are two supporting data files:

Most recent MTA data file for getting answer to question 1
Remote-Booth-Station.xls to map stations to older data files

According to the MTA website, the data contains the following fields. Data prior to 10/18/14 contain a subset of the same fields. Fields not contained in older data is marked with an asterisk.

C/A = Control Area (A002)
UNIT = Remote Unit for a station (R051)
SCP = Subunit Channel Position represents an specific address for a device (02-00-00)
STATION* = Represents the station name the device is located at
LINENAME* = Represents all train lines that can be boarded at this station Normally lines are represented by one character. LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION* = Represents the Line originally the station belonged to BMT, IRT, or IND
DATE = Represents the date (MM-DD-YY)
TIME = Represents the time (hh:mm:ss) for a scheduled audit event
DESc = Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours) 1. Audits may occur more that 4 hours due to planning, or troubleshooting activities. 2. Additionally, there may be a “RECOVR AUD” entry: This refers to a missed audit that was recovered.
ENTRIES = The comulative entry register value for a device
EXITS = The cumulative exit register value for a device

import os
data_dir = '../../mta_data/2013/'
filenames = [file for file in sorted(os.listdir(data_dir)) if 'turnstile' in file]

t = time.time()
df = pd.concat([format_old_file(data_dir + file) for file in filenames],axis=0)
print(f"Imported data in {time.time() -t} seconds")
print(f"Number of rows: {len(df)}")

Imported data in 46.12816309928894 seconds
Number of rows: 12380824

Pull in new data file

We also need to pull the latest data file which contains the stations that each unit is in, which is not included in 2013 data files. This data will be merged in to get the station to turnstile mapping.

Cleaning and Quality Check

Null Rows

na_rows = df.isna().any(axis=1).sum()
print(f"There are {na_rows} with na values ({round(na_rows/len(df) * 100,2)} %)")

There are 916414 with na values (7.4 %)

Most of the na values are likely an artifact of the initial parsing logic required for the older file format. Moreover, there is no effective or reasonable assumption that can made to fill in the missing values and, in fact, doing so incorrectly could make matters worse. With over 90% (10 million+) of the records intact, we will just drop the rows containing na values.

df.dropna(inplace=True)

Negative Entries/Exits

df.describe()

	ENTRIES	EXITS
count	1.146441e+07	1.146441e+07
mean	5.034486e+06	2.971423e+06
std	3.292653e+07	3.299601e+07
min	-9.314769e+08	-8.789686e+08
25%	3.364450e+05	1.932042e+05
50%	1.982436e+06	1.216322e+06
75%	4.972726e+06	3.579545e+06
max	9.168487e+08	9.797130e+08

A quick description of the data shows that there appear to be negative values for the Entries and Exits. This doesn’t make any sense given that the columns are meant to be cumulative. However, while the exact reason these values are negative is unclear (may have something to do with DOOR OPEN/DOOR CLOSE events), from a quick check of some of the records the data still seems to be accurately counting the number of exits/entries in between each reading even if the value is abnormal. Thus, we can still calculate the change and then evaluate if we still get negative values there (which we’ll have to fix).

Dates

Combining the date and time columns into a separate datetime column will simplify analysis and reduce the number of columns. Additionally, as all analysis is being confined to 2013, we will ensure only dates in that year are represented.

df['DateTime'] = pd.to_datetime(df['DATE'] + ' ' + df['TIME'])
df = df.drop(columns=['DATE','TIME'])

df = df[(df.DateTime > '2013-01-01 00:00:00') & (df.DateTime < '2013-12-31 23:59:59')]

TurnStile and Entries/Exits change columns

We’ll create a turnstile column that is the combination of the C/A, UNIT, and SCP since that uniquely identifies a turnstile within a station and will simplify the groupby statements. Additionally, we’ll calculate the change in entries and exit at each reading for a turnstile, which will be needed for later analysis.

df = df.reset_index(drop=True)
df = df.assign(TurnStile=df['C/A'] + '-' + df['UNIT'] + '-' + df['SCP'])

df_sort = df.sort_values(['TurnStile','DateTime'])
df['EntriesChange'] = df_sort.groupby('TurnStile')['ENTRIES'].transform(pd.Series.diff)
df['ExitsChange'] = df_sort.groupby('TurnStile')['EXITS'].transform(pd.Series.diff)

# fill na values which mark the first measurement of the turnstile
df = df.fillna({'EntriesChange':0, 'ExitsChange':0})

Quality Check on Change columns

Doing a describe shows that there are some change values which are negative. This is doesnt make sense and we will have to amend the data.

df.describe()

	ENTRIES	EXITS	EntriesChange	ExitsChange
count	1.137625e+07	1.137625e+07	1.137625e+07	1.137625e+07
mean	5.043418e+06	2.975773e+06	8.206008e+02	6.460294e+02
std	3.300180e+07	3.307180e+07	9.819976e+05	1.025034e+06
min	-9.314769e+08	-8.789650e+08	-9.314769e+08	-9.797130e+08
25%	3.366480e+05	1.932310e+05	1.000000e+00	1.000000e+00
50%	1.982904e+06	1.216513e+06	5.200000e+01	3.700000e+01
75%	4.973487e+06	3.579991e+06	2.100000e+02	1.450000e+02
max	9.168487e+08	9.797130e+08	9.168486e+08	9.719247e+08

tot_rows_neg = len(df.loc[(df['EntriesChange'] < 0) | (df['ExitsChange'] < 0)])

Total rows with negative change values: 2304
Percentage of negative change values rows out of total: 0.02%

These rows with negative values account for a whopping 0.02% of all the data. For such a small sample its not really worth it to do a deep dive and write a possibly complicated algorithm to figure out a reasonable value. We’ll just drop them.

However, with the negative values means we have the inverse happening where there are abnormally large numbers. For an example, see the slice below. These are not as easy to identify. We could use IQR or z-score to try to clearly identify them and delete, but we also risk losing data for stations that see actual jumps because of holidays or special events. We want to capture these as best we can so we’ll just set an upper limit of 10,000. If a row has more than 10,000 exits or entries in a 4 hour or less period, we’ll consider that too high. For 4 hours that’s a rate of ~41 people per minute for one turnstile.

df.loc[
            (df.TurnStile == "R644-R135-01-00-00")
            & (df.DateTime.dt.date.astype(str) == "2013-11-26")
        ].head()

	C/A	UNIT	SCP	DESC	ENTRIES	EXITS	DateTime	TurnStile	EntriesChange	ExitsChange
10334734	R644	R135	01-00-00	REGULAR	4039.0	1310543.0	2013-11-26 08:00:00	R644-R135-01-00-00	0.0	65.0
10334735	R644	R135	01-00-00	LGF-MAN	334108.0	1440925.0	2013-11-26 12:25:04	R644-R135-01-00-00	0.0	0.0
10363112	R644	R135	01-00-00	RECOVR AUD	334063.0	1440813.0	2013-11-26 08:00:00	R644-R135-01-00-00	330024.0	130270.0
10363113	R644	R135	01-00-00	LOGON	334108.0	1440925.0	2013-11-26 12:25:15	R644-R135-01-00-00	0.0	0.0
10389840	R644	R135	01-00-00	REGULAR	4040.0	1310668.0	2013-11-26 12:00:00	R644-R135-01-00-00	-330023.0	-130145.0

# drop negative values
df = df[~((df['EntriesChange'] < 0) | (df['ExitsChange'] < 0))]

# drop records with values greater than 10,000
df = df[~((df['EntriesChange'] > 10000) | (df['ExitsChange'] > 10000))]

df.describe()

	ENTRIES	EXITS	EntriesChange	ExitsChange
count	1.137349e+07	1.137349e+07	1.137349e+07	1.137349e+07
mean	5.042996e+06	2.975560e+06	1.572528e+02	1.234072e+02
std	3.299406e+07	3.306292e+07	2.516331e+02	2.265497e+02
min	-9.314769e+08	-8.789650e+08	0.000000e+00	0.000000e+00
25%	3.369660e+05	1.934675e+05	1.000000e+00	1.000000e+00
50%	1.983519e+06	1.216895e+06	5.200000e+01	3.700000e+01
75%	4.973984e+06	3.580515e+06	2.100000e+02	1.450000e+02
max	9.168487e+08	9.797130e+08	9.890000e+03	9.571000e+03

Much better. Now we’re ready

Question 1

Which station has the most number of units?

We can use the most recent data file to get the current number of units per stations. Also, 2013 data does not contain the station column.

new_df = pd.read_csv('../../mta_data/turnstile_210612.txt', engine='python')
new_df['TurnStile'] = new_df['C/A'] + '-' + new_df['UNIT'] + '-' + new_df['SCP']

fig, ax = plt.subplots(figsize=(25, 8))

largest_stations = (
            new_df.drop_duplicates("TurnStile")
            .groupby(["STATION"])
            .size()
            .sort_values(ascending=False)
            .head(10)
        )
largest_stations.plot(kind="bar", ax=ax)

ax.set(
        title="Top 10 Largest Stations by Turnstile Count",
        xlabel="Station",
        ylabel="Number of TurnStiles",
      )
ax.legend().set_visible(False)

png

ANSWER

print(f"Answer: {pd.Series(largest_stations)[:1]}")

Answer: STATION
34 ST-PENN STA    102
dtype: int64

34 ST-PENN STATION has the most turnstiles with 102

Question 2

What is the total number of entries & exits across the subway system for February 1, 2013?

feb113_df = df.loc[df.DateTime.dt.date.astype(str) == '2013-02-01']
feb113_df.head()

	C/A	UNIT	SCP	DESC	ENTRIES	EXITS	DateTime	TurnStile	EntriesChange	ExitsChange
753397	A002	R051	02-00-00	REGULAR	3974913.0	1370326.0	2013-02-01 15:00:00	A002-R051-02-00-00	207.0	32.0
753403	A002	R051	02-00-01	REGULAR	3792617.0	821317.0	2013-02-01 15:00:00	A002-R051-02-00-01	166.0	25.0
753418	A002	R051	02-03-01	REGULAR	3612280.0	5634445.0	2013-02-01 19:00:00	A002-R051-02-03-01	621.0	377.0
753454	A002	R051	02-03-06	REGULAR	4959945.0	439721.0	2013-02-01 07:00:00	A002-R051-02-03-06	46.0	4.0
753460	A002	R051	02-05-00	REGULAR	907.0	0.0	2013-02-01 15:00:00	A002-R051-02-05-00	0.0	0.0

ANSWER

tot_entries = feb113_df.EntriesChange.sum()
tot_exits = feb113_df.ExitsChange.sum()

Total number of entries across the whole subway system on February 1, 2013: 5818588.0
Total number of exits across the whole subway system on February 1, 2013: 4516096.0

On February 1st 2013, 5.81 million people entered the subway and only 4.51 million came out. Terrifying. Although that makes sense because a lot of people will actually use the emergency exits on their way out, which wouldn’t get counted. That’s about 20%. Interesting

Question 3

What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?

To answer this we first have to map the stations to their turnstiles since Feb 2013 data does not contain the station field. MTA provides a Remote-Booth-Station.xls file to help with this. I downloaded it into my home directory and read it in to merge with the Feb 2013 data.

Note: Some booths and/or units do not have records in the excel file. Possibly they were removed in the years gone by.

stat_map = pd.read_excel('../../mta_data/Remote-Booth-Station.xls')
feb113_df = feb113_df.merge(
            stat_map, left_on=["C/A", "UNIT"], right_on=["Booth", "Remote"], how="left"
        )

print(
      f"{feb113_df.isna().any(axis=1).sum()} turnstile do not have mapped stations available "
    + f"({feb113_df.isna().any(axis=1).sum()/len(feb113_df)*100}%)"
    )

558 turnstile do not have mapped stations available (1.8250204415372036%)

ANSWER

busiest_stations = pd.DataFrame((feb113_df.groupby('Station')['EntriesChange'].sum() + 
              feb113_df.groupby('Station')['ExitsChange'].sum()
             ), columns=['Entries+Exits']
            ).sort_values(by="Entries+Exits",ascending=False).head()
busiest_stations

	Entries+Exits
Station
34 ST-PENN STA	348286.0
42 ST-GRD CNTRL	327422.0
34 ST-HERALD SQ	222322.0
86 ST	208671.0
14 ST-UNION SQ	208269.0

Busiest Station: 34 ST-PENN STA
Total Entries + Exits on Feb 1st 2013: 348286

busiest_turnstiles = pd.DataFrame(
    (feb113_df.groupby('TurnStile')['EntriesChange'].sum() + 
     feb113_df.groupby('TurnStile')['ExitsChange'].sum()
    ), columns=['Entries+Exits']).sort_values(by="Entries+Exits",ascending=False).head()
busiest_turnstiles

	Entries+Exits
TurnStile
R240-R047-00-00-00	13529.0
N601-R319-00-00-00	13066.0
N063A-R011-00-00-00	12188.0
R221-R170-01-00-00	12060.0
R249-R179-01-00-09	11692.0

# station that the busiest turnstile is located in
tstile_station = feb113_df.loc[
            feb113_df.TurnStile == busiest_turnstiles.iloc[0].name
        ]["Station"].iloc[0]

Busiest Turnstile: R240-R047-00-00-00
Located in Station: 42 ST-GRD CNTRL
Total Entries + Exits on Feb 1st 2013: 13529

Question 4

What stations have seen the most usage growth/decline in 2013?

To avoid the risk of periodicity or anomalies on certain dates by check the percent change from the first to last date, we can take the average quarterly business for each station and then compare the Q1 to Q4 traffic to see how its changed.

# merge in station mapping to get the station for each turnstile
df = df.merge(stat_map, left_on=["C/A","UNIT"],right_on=["Booth","Remote"],how='left')
df['Entries+Exits'] = df['EntriesChange'] + df['ExitsChange']

# calculate Q1 and Q4 mean traffic for each station
        q1 = (
            df.loc[df["DateTime"].dt.month.isin([1, 2, 3])]
            .groupby(["Station"])["Entries+Exits"]
            .mean()
            .to_frame("Q1Mean")
        )
        q4 = (
            df.loc[df["DateTime"].dt.month.isin([10, 11, 12])]
            .groupby(["Station"])["Entries+Exits"]
            .mean()
            .to_frame("Q4Mean")
        )
        q_df = q1.join(q4)
        q_df["PercentChange"] = round(
            ((q_df["Q4Mean"] - q_df["Q1Mean"]) / q_df["Q1Mean"]) * 100, 3
        )

Stations with largest growth in 2013

q_df.sort_values('PercentChange', ascending=False).head(3)

	Q1Mean	Q4Mean	PercentChange
Station
BEACH 90 ST	0.050420	43.670715	86513.585
AQUEDUCT TRACK	0.370667	42.954366	11488.408
BEACH 44 ST	1.610232	29.146561	1710.084

aqueduct_df = df.loc[df.Station == 'AQUEDUCT TRACK'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Aqueduct Track"})
beach44_df =  df.loc[df.Station == 'BEACH 44 ST'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Beach 44 ST"})
beach90_df =  df.loc[df.Station == 'BEACH 90 ST'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Beach 90 ST"})
beach44_df.join(beach90_df).join(aqueduct_df).fillna(0).plot(kind='line')

png

Stations with largest decline in 2013

q_df.sort_values('PercentChange').head(3)

	Q1Mean	Q4Mean	PercentChange
Station
HOWARD BCH-JFK	118.424049	49.343952	-58.333
WHITEHALL ST	217.121889	90.847422	-58.158
DYCKMAN ST	511.564791	230.912285	-54.862

howard_df = df.loc[df.Station == 'HOWARD BCH-JFK'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Howard Beach JFK"})
whitehall_df =  df.loc[df.Station == 'WHITEHALL ST'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Whitehall St"})
dyckman_df =  df.loc[df.Station == 'DYCKMAN ST'].groupby(df['DateTime'].dt.month).agg({'Entries+Exits':'mean'}).rename(columns={"Entries+Exits":"Dyckman St"})
howard_df.join(whitehall_df).join(dyckman_df).fillna(0).plot(kind='line')

png

ANSWER

Stations with largest growth

Beach 90 St: +86,513%
Aqueduct Track: +11,488%
Beach 44 St: +1710%

Fun Fact: A quick google search shows that the Beach St stations were actually entirely or partially closed due to damage from Hurricane Sandy. They opened back up in May 2013, which corresponds to the sharp increase in overall traffic in month 5. Alternatively, Aqueduct track was close from 2011-2013 to rebuild it for better access to the Resorts World Casino, opening back up in August 2013.

Stations with largest decline

Howard Beach JFK: -58.3%
Whitehall St: -58.1%
Dyckman St: -54.8%

Fun Fact: Again, Hurricane Sandy comes into play. Because of numerous closings of other subway stations that were damaged by the hurricane, Howard Beach JFK served as the alternative route while these stations were rebuilt. As a result, Howard Beach JFK had very high ridership numbers early in the year, but as the closed stations re-opened the traffic fell back to normal levels. On the other hand, Whitehall St and Dyckman St experienced decline in traffic because they went under rehabilitation work in the later part of 2013.

Question 5

What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?

Least Busy Days

df['DATE'] = df['DateTime'].dt.date
least_busy_days = pd.DataFrame(df.groupby('DATE')['Entries+Exits'].sum().sort_values().head(3)).reset_index()
least_busy_days['DOW'] = pd.to_datetime(least_busy_days['DATE']).dt.day_name()

least_busy_days

	DATE	Entries+Exits	DOW
0	2013-12-25	3452962.0	Wednesday
1	2013-01-01	3557056.0	Tuesday
2	2013-11-28	4281943.0	Thursday

Not too surprising that the least traveled days happen to be the biggest holidays of the year with Christmas, New Years, and Thanksgiving taking the first, second, and third place, respectively.

Closed Stations

Closed stations would have no entries or exits at any turnstile for the day.

In other words: Completely Closed = total ridership for the day is zero

closed_stations = (
            df.groupby(["Station", "DATE"])["Entries+Exits"]
            .sum()
            .to_frame("DailyTraffic")
            .query("DailyTraffic == 0")
        )
closed_stations = (
            closed_stations.reset_index().groupby("DATE").agg(lambda x: x.tolist())
        )
closed_stations["NumberOfStationsClosed"] = closed_stations["Station"].apply(
            lambda x: len(x)
        )

 Number of days which have a station closed: 152
 Number of stations closed for the day throughout 2013: 213

closed_stations.sort_values('NumberOfStationsClosed', ascending=False).drop(columns=['DailyTraffic']).head()

	Station	NumberOfStationsClosed
DATE
2013-05-18	[BEACH 105 ST, BEACH 90 ST, BEACH 98 ST, ORCHA...	5
2013-05-19	[BEACH 105 ST, BEACH 44 ST, BEACH 98 ST, ORCHA...	4
2013-04-07	[190 ST, AQUEDUCT TRACK, CRESCENT ST, FOREST P...	4
2013-03-31	[AQUEDUCT TRACK, BEACH 90 ST, CLEVELAND ST, CR...	4
2013-01-06	[14TH STREET, 9TH STREET, KINGSTON AVE, TWENTY...	4

5 stations were closed on May 18th which was the highest amount for the year

Partial Capacity

Stations not operating at full capacity would have no entries or exits for only a portion of the turnstiles for the day.

Partial = some turnstile are zero

closed_turnstiles = df.groupby(['Station','TurnStile','DATE'])['Entries+Exits'].sum().to_frame('DailyTSTraffic').query("DailyTSTraffic == 0")
num_closed_ts = closed_turnstiles.reset_index().groupby(['Station','DATE']).count().drop(columns=["TurnStile"]).rename(columns={'DailyTSTraffic':'NumberClosed'})

partial_operating = df[['Station','TurnStile','DateTime','DATE']].merge(num_closed_ts.reset_index(), on=['Station','DATE'], how='left')
partial_operating = partial_operating.query("NumberClosed > 0")
station_turnstiles = partial_operating.drop_duplicates(subset=['TurnStile','DATE']).groupby(['Station','DATE']).size().to_frame('TotalTurnStiles')
partial_operating = partial_operating.merge(station_turnstiles, on=['Station','DATE'], how='left')[['Station','DATE','NumberClosed','TotalTurnStiles']]

partial_operating['OperatingCapacity'] = round(100*(
    (partial_operating['TotalTurnStiles'] - partial_operating['NumberClosed'])/partial_operating['TotalTurnStiles']),2)

partial_operating = partial_operating.drop_duplicates(subset=['Station','DATE']).reset_index().groupby('DATE').agg(lambda x: x.tolist())
partial_operating['NumberOfPartialOperatingStations'] = partial_operating['Station'].apply(lambda x: len(x))
partial_operating[['Station','OperatingCapacity','NumberOfPartialOperatingStations']].sort_values('NumberOfPartialOperatingStations',ascending=False).head()

	Station	OperatingCapacity	NumberOfPartialOperatingStations
DATE
2013-08-18	[LEXINGTON AVE, 5 AVE-59 ST, 49 ST-7 AVE, 34 S...	[80.0, 85.71, 90.91, 96.3, 94.74, 85.71, 44.44...	171
2013-11-17	[LEXINGTON AVE, 5 AVE-59 ST, 49 ST-7 AVE, 34 S...	[80.0, 92.86, 90.91, 96.3, 97.37, 82.14, 33.33...	170
2013-10-20	[LEXINGTON AVE, 5 AVE-59 ST, 49 ST-7 AVE, 42 S...	[80.0, 85.71, 95.45, 96.0, 98.15, 97.37, 87.5,...	168
2013-10-06	[LEXINGTON AVE, 5 AVE-59 ST, 49 ST-7 AVE, 42 S...	[90.0, 92.86, 90.91, 96.0, 98.15, 83.93, 66.67...	166
2013-11-03	[LEXINGTON AVE, 5 AVE-59 ST, 49 ST-7 AVE, 34 S...	[80.0, 92.86, 90.91, 98.15, 87.5, 66.67, 17.65...	161

ANSWER

Least Busy Dates

Not too surprising that the biggest holidays are the least busy days

Christmas (2013-12-25)
New Years (2013-01-01)
Thanksgiving (2013-11-28)

Completely closed and partially closed stations

Refer to the above data frames to see the list of stations that are either completely or partially closed on certain dates.

Bonus

What hour is the busiest for station CANAL ST in Q1 2013?

# filter data down to CANAL ST in Q1
canalq12013 = df.loc[(df['Station'] == 'CANAL ST') & (df['DateTime'].dt.month.isin([1,2,3]))]

# resample data to get hourly frequency, using linear interpolation to fill in hour numbers
canalq12013 = canalq12013.groupby(['TurnStile']).resample('60T', on='DateTime').mean().reset_index()
canalq12013['Hour'] = canalq12013['DateTime'].dt.hour
canalq12013.interpolate(method='linear').reset_index().groupby('Hour').mean().sort_values('Entries+Exits', ascending=False)['Entries+Exits'].head()

Hour
  627.749974
  626.143669
  588.330532
  551.228967
  540.642590
Name: Entries+Exits, dtype: float64

ANSWER

The busiest hour for CANAL ST in Q1 2013 is hour 20 (8pm), followed closely by hour 19 and 18. So, it looks like 5-9pm sees the highest traffic. Makes sense for rush hour

Twitter Facebook LinkedIn

Jake Robinson

NYC MTA Turnstile Data

Table of Contents

Pull in 2013 Data

Pull in new data file

Cleaning and Quality Check

Null Rows

Negative Entries/Exits

Dates

TurnStile and Entries/Exits change columns

Quality Check on Change columns

Question 1

Which station has the most number of units?

ANSWER

34 ST-PENN STATION has the most turnstiles with 102

Question 2

What is the total number of entries & exits across the subway system for February 1, 2013?

ANSWER

Question 3

What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?

ANSWER

Question 4

What stations have seen the most usage growth/decline in 2013?

Stations with largest growth in 2013

Stations with largest decline in 2013

ANSWER

Stations with largest growth

Stations with largest decline

Question 5

What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?

Least Busy Days

Closed Stations

Partial Capacity

ANSWER

Least Busy Dates

Completely closed and partially closed stations

Bonus

What hour is the busiest for station CANAL ST in Q1 2013?

ANSWER

You May Also Enjoy