Data

We are strong advocates of open-source and reproducible research.

In this project, we adapt some publicly available data sets into our spatiotemporal traffic data imputation and forecasting experiments. These data are in the form of multivariate time series matrix and multidimensional time series tensor.

Quick look

Working data sets

There are many well-suited real-world data sets (most of them are traffic data) that can be used to spatiotemporal data modeling tasks. Some are carefully selected as follows,

Multivariate time series
- Birmingham parking data set
- California PeMS traffic speed data set (large-scale)
- Guangzhou urban traffic speed data set
- Hangzhou metro passenger flow data set
- London urban movement speed data set (other cities are also available at Uber movement project)
- Portland highway traffic data set (including traffic volume/speed/occupancy, see data documentation)
- Seattle freeway traffic speed data set
- NYC real-time traffic speed data
Multidimensional time series
- New York City (NYC) taxi data set
- Pacific surface temperature data set

It is not difficult to use these data in your experiments. For example, if you want to view or use these data sets, please download them at the ../datasets/ folder in advance, and then run the following codes in your Python console:

import scipy.io

tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
tensor = tensor['tensor']

This is simple case on Guangzhou urban traffic speed data set. The traffic speed observations are formatted as a third-order tensor which is of size 214-by-61-by-144. Here, we have 214 road segments, 61 days, and 144 time steps in each day.

In particular, if you are interested in large-scale traffic data, we recommend PeMS-4W/8W/12W and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of data sets (data path example: ../datasets/California-data-set/pems-4w.csv). Then you can use Pandas to open data:

import pandas as pd

data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)

For model evaluation, we mask certain entries of the “observed” data as missing values and then perform imputation for these “missing” values.

Recommended data sets

Uber movement data (see the detailed speeds calculation methodology)
- NYC movement speed data set (relevant data include traffic volume counts (2014-2019) and real-time traffic speed and travel time data in New York City, USA)
- Seattle movement speed data set
UTD19: Largest multi-city traffic data set
pNEUMA: A large-scale data set of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment (see a detailed introduction to the data set in this paper: On the new era of urban traffic monitoring with massive drone data: The pNEUMA large-scale field experiment)
highD dataset: A drone dataset of naturalistic vehicle trajectories on German highways for validation of highly automated driving systems (this paper introduced the dataset and the used methods)

Data processing examples

London movement speed data set

Example code for processing movement speed as a multivariate time series data. This is indeed a spatiotemporal matrix which consists of 200,000+ road segments (i.e., 70,000+ road segments with different directions) and 30 x 24 time points. Note that this data set is downloaded from Uber movement project. For getting the data file movement-speeds-hourly-london-2019-4.csv, you need to choose city as London, product type as speeds, and time period as 2019 Quarter 2.

import numpy as np
import pandas as pd

data = pd.read_csv('../datasets/London-data-set/movement-speeds-hourly-london-2019-4.csv')

road = data.drop_duplicates(['osm_way_id', 'osm_start_node_id', 'osm_end_node_id'])
road = road.drop(['year', 'month', 'day', 'hour', 'utc_timestamp', 'segment_id', 'start_junction_id',
                  'end_junction_id', 'speed_mph_mean', 'speed_mph_stddev'], axis = 1)
tensor = np.zeros((road.shape[0], max(data.day.values), 24))
k = 0
for i in range(road.shape[0]):
    temp = data[(data['osm_way_id'] == road.osm_way_id.iloc[i])
                & (data['osm_start_node_id'] == road.osm_start_node_id.iloc[i])
                & (data['osm_end_node_id'] == road.osm_end_node_id.iloc[i])]
    for j in range(temp.shape[0]):
        tensor[k, temp.day.iloc[j] - 1, temp.hour.iloc[j]] = temp.speed_mph_mean.iloc[j]
    k += 1
    if (k % 1000) == 0:
        print(k)
mat = tensor.reshape([road.shape[0], max(data.day.values) * 24])
np.save('../datasets/London-data-set/hourly_speed_mat.npy', mat)

del data, road, tensor

import numpy as np
import pandas as pd

data = pd.read_csv('../datasets/Temperature-data-set/data.tsv', sep = '\t', header = None)
mat = data.values

num_month = 399
tensor = np.zeros((30, 84, 399))
for t in range(num_month):
    tensor[:, :, t] = mat[t * 30 + 1 : (t + 1) * 30 + 1, 1 :]

np.save('../datasets/Temperature-data-set/tensor.npy', tensor)

NYC data set

NYC bike data set is from Deep spatio-temporal residual networks for citywide crowd flows prediction (AAAI 2017).

[In]

import h5py

h5 = h5py.File('../datasets/NYC-data-set/nyc_bike.h5', 'r')
data = h5['data']
data.shape

[Out]

(4392, 2, 16, 8)

Benchmark data sets
- Data sets are available at https://github.com/liangzhehan/DMSTGCN.
- Three traffic speed data sets: PeMSD4, PeMSD8, England.
- Three files in each data set: train.npz, val.npz, and test.npz.

For example, use the following code:

import numpy as np
from numpy import load

train = load('../benchmark/PEMSD4/train.npz')
val = load('../benchmark/PEMSD4/val.npz')
test = load('../benchmark/PEMSD4/test.npz')
dense_mat = np.append(np.append(train['x'][:, 0, :, 0], val['x'][:, 0, :, 0], axis = 0),
                      test['x'][:, 0, :, 0], axis = 0).T

This train.npz includes [x] and [y], and each is of size (10181, 12, 307, 2). There are 10181 time points, 12 hours, 307 road segments, and 2 parameters (i.e., speed and volume).

Data

Quick look

Working data sets

Recommended data sets

Data processing examples

transdim

Error

Quick look

Working data sets

Recommended data sets

Data processing examples

Templates (for web app):

Error