# Data

We are strong advocates of open-source and reproducible research.

In this project, we adapt some publicly available data sets into our spatiotemporal traffic data imputation and forecasting experiments. These data are in the form of multivariate time series matrix and multidimensional time series tensor.

## Quick look

### Working data sets

There are many well-suited real-world data sets (most of them are traffic data) that can be used to spatiotemporal data modeling tasks. Some are carefully selected as follows,

**Multivariate time series**- Birmingham parking data set
- California PeMS traffic speed data set (large-scale)
- Guangzhou urban traffic speed data set
- Hangzhou metro passenger flow data set
- London urban movement speed data set (other cities are also available at Uber movement project)
- Portland highway traffic data set (including traffic volume/speed/occupancy, see data documentation)
- Seattle freeway traffic speed data set
- NYC real-time traffic speed data

**Multidimensional time series**

It is not difficult to use these data in your experiments. For example, if you want to view or use these data sets, please download them at the ../datasets/ folder in advance, and then run the following codes in your Python console:

```
import scipy.io
tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
tensor = tensor['tensor']
```

This is simple case on Guangzhou urban traffic speed data set. The traffic speed observations are formatted as a third-order tensor which is of size 214-by-61-by-144. Here, we have 214 road segments, 61 days, and 144 time steps in each day.

In particular, if you are interested in large-scale traffic data, we recommend **PeMS-4W/8W/12W** and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of data sets (data path example: `../datasets/California-data-set/pems-4w.csv`

). Then you can use `Pandas`

to open data:

```
import pandas as pd
data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)
```

For model evaluation, we mask certain entries of the “observed” data as missing values and then perform imputation for these “missing” values.

### Recommended data sets

- Uber movement data (see the detailed speeds calculation methodology)
- UTD19: Largest multi-city traffic data set
- pNEUMA: A large-scale data set of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment (see a detailed introduction to the data set in this paper: On the new era of urban traffic monitoring with massive drone data: The pNEUMA large-scale field experiment)
- highD dataset: A drone dataset of naturalistic vehicle trajectories on German highways for validation of highly automated driving systems (this paper introduced the dataset and the used methods)

## Data processing examples

**London movement speed data set**

Example code for processing movement speed as a multivariate time series data. This is indeed a spatiotemporal matrix which consists of 200,000+ road segments (i.e., 70,000+ road segments with different directions) and 30 x 24 time points. Note that this data set is downloaded from Uber movement project. For getting the data file `movement-speeds-hourly-london-2019-4.csv`

, you need to choose city as `London`

, product type as `speeds`

, and time period as `2019 Quarter 2`

.

```
import numpy as np
import pandas as pd
data = pd.read_csv('../datasets/London-data-set/movement-speeds-hourly-london-2019-4.csv')
road = data.drop_duplicates(['osm_way_id', 'osm_start_node_id', 'osm_end_node_id'])
road = road.drop(['year', 'month', 'day', 'hour', 'utc_timestamp', 'segment_id', 'start_junction_id',
'end_junction_id', 'speed_mph_mean', 'speed_mph_stddev'], axis = 1)
tensor = np.zeros((road.shape[0], max(data.day.values), 24))
k = 0
for i in range(road.shape[0]):
temp = data[(data['osm_way_id'] == road.osm_way_id.iloc[i])
& (data['osm_start_node_id'] == road.osm_start_node_id.iloc[i])
& (data['osm_end_node_id'] == road.osm_end_node_id.iloc[i])]
for j in range(temp.shape[0]):
tensor[k, temp.day.iloc[j] - 1, temp.hour.iloc[j]] = temp.speed_mph_mean.iloc[j]
k += 1
if (k % 1000) == 0:
print(k)
mat = tensor.reshape([road.shape[0], max(data.day.values) * 24])
np.save('../datasets/London-data-set/hourly_speed_mat.npy', mat)
del data, road, tensor
```

```
import numpy as np
import pandas as pd
data = pd.read_csv('../datasets/Temperature-data-set/data.tsv', sep = '\t', header = None)
mat = data.values
num_month = 399
tensor = np.zeros((30, 84, 399))
for t in range(num_month):
tensor[:, :, t] = mat[t * 30 + 1 : (t + 1) * 30 + 1, 1 :]
np.save('../datasets/Temperature-data-set/tensor.npy', tensor)
```

**NYC data set**

NYC bike data set is from Deep spatio-temporal residual networks for citywide crowd flows prediction (AAAI 2017).

[In]

```
import h5py
h5 = h5py.File('../datasets/NYC-data-set/nyc_bike.h5', 'r')
data = h5['data']
data.shape
```

[Out]

```
(4392, 2, 16, 8)
```

**Benchmark data sets**- Data sets are available at https://github.com/liangzhehan/DMSTGCN.
- Three traffic speed data sets: PeMSD4, PeMSD8, England.
- Three files in each data set:
`train.npz`

,`val.npz`

, and`test.npz`

.

For example, use the following code:

```
import numpy as np
from numpy import load
train = load('../benchmark/PEMSD4/train.npz')
val = load('../benchmark/PEMSD4/val.npz')
test = load('../benchmark/PEMSD4/test.npz')
dense_mat = np.append(np.append(train['x'][:, 0, :, 0], val['x'][:, 0, :, 0], axis = 0),
test['x'][:, 0, :, 0], axis = 0).T
```

This `train.npz`

includes `[x]`

and `[y]`

, and each is of size `(10181, 12, 307, 2)`

. There are 10181 time points, 12 hours, 307 road segments, and 2 parameters (i.e., speed and volume).