Data

We are strong advocates of open-source and reproducible research.

In this project, we adapt some publicly available data sets into our spatiotemporal traffic data imputation and forecasting experiments. These data are in the form of multivariate time series matrix and multidimensional time series tensor.

Quick look

Working data sets

There are many well-suited real-world data sets (most of them are traffic data) that can be used to spatiotemporal data modeling tasks. Some are carefully selected as follows,

It is not difficult to use these data in your experiments. For example, if you want to view or use these data sets, please download them at the ../datasets/ folder in advance, and then run the following codes in your Python console:

import scipy.io

tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
tensor = tensor['tensor']

This is simple case on Guangzhou urban traffic speed data set. The traffic speed observations are formatted as a third-order tensor which is of size 214-by-61-by-144. Here, we have 214 road segments, 61 days, and 144 time steps in each day.

In particular, if you are interested in large-scale traffic data, we recommend PeMS-4W/8W/12W and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of data sets (data path example: ../datasets/California-data-set/pems-4w.csv). Then you can use Pandas to open data:

import pandas as pd

data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)

For model evaluation, we mask certain entries of the “observed” data as missing values and then perform imputation for these “missing” values.

Data processing examples

  • London movement speed data set

Example code for processing movement speed as a multivariate time series data. This is indeed a spatiotemporal matrix which consists of 200,000+ road segments (i.e., 70,000+ road segments with different directions) and 30 x 24 time points. Note that this data set is downloaded from Uber movement project. For getting the data file movement-speeds-hourly-london-2019-4.csv, you need to choose city as London, product type as speeds, and time period as 2019 Quarter 2.

import numpy as np
import pandas as pd

data = pd.read_csv('../datasets/London-data-set/movement-speeds-hourly-london-2019-4.csv')

road = data.drop_duplicates(['osm_way_id', 'osm_start_node_id', 'osm_end_node_id'])
road = road.drop(['year', 'month', 'day', 'hour', 'utc_timestamp', 'segment_id', 'start_junction_id',
                  'end_junction_id', 'speed_mph_mean', 'speed_mph_stddev'], axis = 1)
tensor = np.zeros((road.shape[0], max(data.day.values), 24))
k = 0
for i in range(road.shape[0]):
    temp = data[(data['osm_way_id'] == road.osm_way_id.iloc[i])
                & (data['osm_start_node_id'] == road.osm_start_node_id.iloc[i])
                & (data['osm_end_node_id'] == road.osm_end_node_id.iloc[i])]
    for j in range(temp.shape[0]):
        tensor[k, temp.day.iloc[j] - 1, temp.hour.iloc[j]] = temp.speed_mph_mean.iloc[j]
    k += 1
    if (k % 1000) == 0:
        print(k)
mat = tensor.reshape([road.shape[0], max(data.day.values) * 24])
np.save('../datasets/London-data-set/hourly_speed_mat.npy', mat)

del data, road, tensor
import numpy as np
import pandas as pd

data = pd.read_csv('../datasets/Temperature-data-set/data.tsv', sep = '\t', header = None)
mat = data.values

num_month = 399
tensor = np.zeros((30, 84, 399))
for t in range(num_month):
    tensor[:, :, t] = mat[t * 30 + 1 : (t + 1) * 30 + 1, 1 :]

np.save('../datasets/Temperature-data-set/tensor.npy', tensor)
  • NYC data set

NYC bike data set is from Deep spatio-temporal residual networks for citywide crowd flows prediction (AAAI 2017).

[In]

import h5py

h5 = h5py.File('../datasets/NYC-data-set/nyc_bike.h5', 'r')
data = h5['data']
data.shape

[Out]

(4392, 2, 16, 8)
  • Benchmark data sets

For example, use the following code:

import numpy as np
from numpy import load

train = load('../benchmark/PEMSD4/train.npz')
val = load('../benchmark/PEMSD4/val.npz')
test = load('../benchmark/PEMSD4/test.npz')
dense_mat = np.append(np.append(train['x'][:, 0, :, 0], val['x'][:, 0, :, 0], axis = 0),
                      test['x'][:, 0, :, 0], axis = 0).T

This train.npz includes [x] and [y], and each is of size (10181, 12, 307, 2). There are 10181 time points, 12 hours, 307 road segments, and 2 parameters (i.e., speed and volume).