NYC taxi data

In this research, we prepare NYC taxi data for analysis.

The prepared data sets are available at mob4cast: Multidimensional time series prediction with passenger/taxi flow data sets.

Yellow taxi trip records

Processing (pickup location, dropoff location, pickup datetime) information

Download data sets from TLC trip record data. For example, download the data set yellow_tripdata_2019-07.csv and process it as follows,

import pandas as pd

data = pd.read_csv('yellow_tripdata_2019-07.csv')

data = data.drop(['VendorID', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID',
                  'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                  'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge'], axis = 1)
data

We focus on precessing trip data that covers the information of pickup location, dropoff location, and pickup datetime. Other columns would be dropped.

tpep_pickup_datetime	PULocationID	DOLocationID
0	2019-07-01 00:51:04	193	193
1	2019-07-01 00:46:04	234	25
2	2019-07-01 00:25:09	132	42
3	2019-07-01 00:33:32	132	142
4	2019-07-01 00:00:55	107	114
...	...	...	...
6310414	2019-07-29 16:34:53	137	7
6310415	2019-07-29 16:07:57	162	81
6310416	2019-07-29 16:01:31	70	89
6310417	2019-07-29 16:58:00	209	71
6310418	2019-07-29 16:16:00	87	33
6310419 rows × 3 columns

In this research, we only take into account the trip data in Manhattan, NYC. There are 69 pickup/dropoff zones, and the numbers are given as follows,

import numpy as np

manhattan = np.array([127, 128, 153, 120, 243, 244, 116, 42, 152, 166, 41, 74, 194, 24, 151,
                      238, 75, 239, 43, 236, 263, 143, 142, 262, 50, 237, 141, 140, 202, 163,
                      48, 230, 161, 162, 229, 246, 100, 68, 186, 233, 164, 170, 90, 234, 137,
                      158, 249, 107, 113, 224, 114, 125, 79, 211, 4, 144, 231, 148, 232, 13,
                      261, 45, 209, 87, 12, 88, 105, 104, 103])

Figure 1. TCL taxi zones in Manhattan, NYC, USA. The source is from taxi_zone_map_manhattan.jpg.

Extract trip data in Manhattan, NYC. (How do I select rows from a DataFrame based on column values?)

data = data.loc[data['PULocationID'].isin(manhattan)]
data = data.loc[data['DOLocationID'].isin(manhattan)]
data

The output would be

tpep_pickup_datetime	PULocationID	DOLocationID
4	2019-07-01 00:00:55	107	114
5	2019-07-01 00:18:23	125	45
6	2019-07-01 00:28:09	142	68
7	2019-07-01 00:57:07	246	141
10	2019-07-01 00:17:29	170	142
...	...	...	...
6310306	2019-07-29 15:35:50	140	74
6310336	2019-07-29 16:09:00	234	142
6310353	2019-07-29 16:31:00	229	74
6310360	2019-07-27 13:06:37	151	113
6310386	2019-07-29 16:41:00	233	238
5263707 rows × 3 columns

Extract (day, hour) information from pickup datetime.

data.insert(3, "day", pd.DatetimeIndex(data['tpep_pickup_datetime']).day.values, True)
data.insert(4, "hour", pd.DatetimeIndex(data['tpep_pickup_datetime']).hour.values, True)
data = data.drop(['tpep_pickup_datetime'], axis = 1)
data

Aggregating hourly (pickup location, dropoff location) taxi flow

trip = data.groupby(['PULocationID', 'DOLocationID', 'day', 'hour'])['day'].count().reset_index(name = 'count')
trip

The output would be

PULocationID	DOLocationID	day	hour	count
0	4	4	1	0	1
1	4	4	1	3	1
2	4	4	1	6	1
3	4	4	1	9	1
4	4	4	1	10	1
...	...	...	...	...	...
1212975	263	263	31	19	18
1212976	263	263	31	20	11
1212977	263	263	31	21	11
1212978	263	263	31	22	3
1212979	263	263	31	23	3
1212980 rows × 5 columns

Here, these five columns refer to pickup location, dropoff location, day, hour, and count, respectively. The last column is the trip count of each specific dimension setting.

import numpy as np

np.savez_compressed('yellow_tripdata_7.npz', trip.values)

One can use np.load('yellow_tripdata_7.npz')['arr_0'] to open the .npz file.

Green taxi trip records

import pandas as pd
import numpy as np

data = pd.read_csv('green_tripdata_2019-07.csv')

data = data.drop(['VendorID', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'RatecodeID', 'passenger_count',
                  'trip_distance', 'fare_amount', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                  'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type',
                  'trip_type', 'congestion_surcharge'], axis = 1)

manhattan = np.array([127, 128, 153, 120, 243, 244, 116, 42, 152, 166, 41, 74, 194, 24, 151,
                      238, 75, 239, 43, 236, 263, 143, 142, 262, 50, 237, 141, 140, 202, 163,
                      48, 230, 161, 162, 229, 246, 100, 68, 186, 233, 164, 170, 90, 234, 137,
                      158, 249, 107, 113, 224, 114, 125, 79, 211, 4, 144, 231, 148, 232, 13,
                      261, 45, 209, 87, 12, 88, 105, 104, 103])
data = data.loc[data['PULocationID'].isin(manhattan)]
data = data.loc[data['DOLocationID'].isin(manhattan)]

data.insert(3, "day", pd.DatetimeIndex(data['lpep_pickup_datetime']).day.values, True)
data.insert(4, "hour", pd.DatetimeIndex(data['lpep_pickup_datetime']).hour.values, True)
data = data.drop(['lpep_pickup_datetime'], axis = 1)

trip = data.groupby(['PULocationID', 'DOLocationID', 'day', 'hour'])['day'].count().reset_index(name = 'count')
np.savez_compressed('green_tripdata_7.npz', trip.values)
trip

The output would be

PULocationID	DOLocationID	day	hour	count
0	12	236	9	7	1
1	13	13	7	20	1
2	13	48	12	16	1
3	13	239	23	13	1
4	24	24	1	8	1
...	...	...	...	...	...
67750	263	263	23	13	1
67751	263	263	26	9	1
67752	263	263	29	17	1
67753	263	263	29	18	2
67754	263	263	31	9	1
67755 rows × 5 columns

High volume for-hire vehicle trip records

Download data sets from TLC trip record data. For example, download the data set fhvhv_tripdata_2019-07.csv and process it as follows,

import pandas as pd

data = pd.read_csv('fhvhv_tripdata_2019-07.csv')
data = data.drop(['hvfhs_license_num', 'dispatching_base_num', 'dropoff_datetime', 'SR_Flag'], axis = 1)
data

We focus on precessing trip data that covers the information of pickup location, dropoff location, and pickup datetime. Other columns would be dropped. The output would be

pickup_datetime	PULocationID	DOLocationID
0	2019-07-01 00:12:33	228	89
1	2019-07-01 00:41:26	97	188
2	2019-07-01 00:18:50	81	220
3	2019-07-01 00:05:12	78	69
4	2019-07-01 00:29:01	69	239
...	...	...	...
20303306	2019-07-31 23:46:39	138	25
20303307	2019-07-31 23:34:36	231	50
20303308	2019-07-31 23:15:17	244	169
20303309	2019-07-31 23:31:03	169	119
20303310	2019-07-31 23:54:59	42	41
20303311 rows × 3 columns

There are a large amount of trip records. We only take into account the pickup and dropoff in Manhattan.

import numpy as np

manhattan = np.array([127, 128, 153, 120, 243, 244, 116, 42, 152, 166, 41, 74, 194, 24, 151,
                      238, 75, 239, 43, 236, 263, 143, 142, 262, 50, 237, 141, 140, 202, 163,
                      48, 230, 161, 162, 229, 246, 100, 68, 186, 233, 164, 170, 90, 234, 137,
                      158, 249, 107, 113, 224, 114, 125, 79, 211, 4, 144, 231, 148, 232, 13,
                      261, 45, 209, 87, 12, 88, 105, 104, 103])
data = data.loc[data['PULocationID'].isin(manhattan)]
data = data.loc[data['DOLocationID'].isin(manhattan)]

Extract (day, hour) information from pickup datetime.

data.insert(3, "day", pd.DatetimeIndex(data['pickup_datetime']).day.values, True)
data.insert(4, "hour", pd.DatetimeIndex(data['pickup_datetime']).hour.values, True)
data = data.drop(['pickup_datetime'], axis = 1)
data

The output would be

PULocationID	DOLocationID	day	hour
5	48	42	1	0
19	141	87	1	0
21	141	141	1	0
22	246	170	1	0
23	100	263	1	0
...	...	...	...	...
20303280	148	186	31	23
20303281	164	230	31	23
20303282	163	79	31	23
20303307	231	50	31	23
20303310	42	41	31	23
6442962 rows × 4 columns

Aggregating hourly (pickup location, dropoff location) taxi flow

trip = data.groupby(['PULocationID', 'DOLocationID', 'day', 'hour'])['day'].count().reset_index(name = 'count')
trip

The output would be

PULocationID	DOLocationID	day	hour	count
0	4	4	1	0	1
1	4	4	1	1	1
2	4	4	1	3	1
3	4	4	1	7	1
4	4	4	1	9	1
...	...	...	...	...	...
1650811	263	263	31	19	2
1650812	263	263	31	20	6
1650813	263	263	31	21	3
1650814	263	263	31	22	3
1650815	263	263	31	23	2
1650816 rows × 5 columns

Here, these five columns refer to pickup location, dropoff location, day, hour, and count, respectively. The last column is the trip count of each specific dimension setting.

import numpy as np

np.savez_compressed('fhvhv_tripdata_7.npz', trip.values)

For-hire vehicle trip records

import pandas as pd
import numpy as np

data = pd.read_csv('fhv_tripdata_2019-07.csv')
data = data.drop(['dispatching_base_num', 'dropoff_datetime', 'SR_Flag'], axis = 1)

manhattan = np.array([127, 128, 153, 120, 243, 244, 116, 42, 152, 166, 41, 74, 194, 24, 151,
                      238, 75, 239, 43, 236, 263, 143, 142, 262, 50, 237, 141, 140, 202, 163,
                      48, 230, 161, 162, 229, 246, 100, 68, 186, 233, 164, 170, 90, 234, 137,
                      158, 249, 107, 113, 224, 114, 125, 79, 211, 4, 144, 231, 148, 232, 13,
                      261, 45, 209, 87, 12, 88, 105, 104, 103])
data = data.loc[data['PULocationID'].isin(manhattan)]
data = data.loc[data['DOLocationID'].isin(manhattan)]

data.insert(3, "day", pd.DatetimeIndex(data['pickup_datetime']).day.values, True)
data.insert(4, "hour", pd.DatetimeIndex(data['pickup_datetime']).hour.values, True)
data = data.drop(['pickup_datetime'], axis = 1)

trip = data.groupby(['PULocationID', 'DOLocationID', 'day', 'hour'])['day'].count().reset_index(name = 'count')
np.savez_compressed('fhv_tripdata_7.npz', trip.values)
trip

The output would be

PULocationID	DOLocationID	day	hour	count
0	4.0	4.0	1	9	1
1	4.0	4.0	1	13	1
2	4.0	4.0	24	10	1
3	4.0	4.0	24	14	1
4	4.0	4.0	29	5	1
...	...	...	...	...	...
21131	263.0	263.0	1	13	1
21132	263.0	263.0	2	14	2
21133	263.0	263.0	6	18	1
21134	263.0	263.0	14	9	1
21135	263.0	263.0	24	7	1
21136 rows × 5 columns

NYC taxi flow tensor

So far, we have each .npz file for saving taxi flow information of different kinds of vehicles. By using the data from July to December of 2019, we can process data with the following codes:

import numpy as np

manhattan = np.array([127, 128, 153, 120, 243, 244, 116, 42, 152, 166, 41, 74, 194, 24, 151,
                      238, 75, 239, 43, 236, 263, 143, 142, 262, 50, 237, 141, 140, 202, 163,
                      48, 230, 161, 162, 229, 246, 100, 68, 186, 233, 164, 170, 90, 234, 137,
                      158, 249, 107, 113, 224, 114, 125, 79, 211, 4, 144, 231, 148, 232, 13,
                      261, 45, 209, 87, 12, 88, 105, 104, 103])

# Yellow taxi
tensor = np.zeros((69, 69, 31 * 24))
data = np.load('yellow_tripdata_7.npz')['arr_0']
for j in range(data.shape[0]):
    tensor[np.where(manhattan == data[j, 0])[0][0], np.where(manhattan == data[j, 1])[0][0],
                int(data[j, 3] + (data[j, 2] - 1) * 24)] = data[j, 4]
for i in range(8, 13):
    if i == 8 or i == 10 or i == 12:
        sub_tensor1 = np.zeros((69, 69, 31 * 24))
        data = np.load('yellow_tripdata_{}.npz'.format(i))['arr_0']
        for j in range(data.shape[0]):
            sub_tensor1[np.where(manhattan == data[j, 0])[0][0], np.where(manhattan == data[j, 1])[0][0],
                        int(data[j, 3] + (data[j, 2] - 1) * 24)] = data[j, 4]
    else:
        sub_tensor1 = np.zeros((69, 69, 30 * 24))
        data = np.load('yellow_tripdata_{}.npz'.format(i))['arr_0']
        data = np.delete(data, np.where(data[:, 2] == 31), 0)
        for j in range(data.shape[0]):
            sub_tensor1[np.where(manhattan == data[j, 0])[0][0], np.where(manhattan == data[j, 1])[0][0],
                        int(data[j, 3] + (data[j, 2] - 1) * 24)] = data[j, 4]
    tensor = np.append(tensor, sub_tensor1, axis = 2)
np.savez_compressed('yellow_taxi_tensor.npz', tensor)

It turns out these taxi flow data are multidimensional tensor. Now, the data are available at mob4cast: Multidimensional time series prediction with passenger/taxi flow data sets.

Putting all taxi trips of different vehicles together, we have a fourth-order tensor of size .

This is a large-scale data set, one can use GPU to speed up calculations using cupy. Please replace import numpy as np as import cupy as np.

Reference