Updated On : Jul-09,2022 Tags mxnet, time-series, lstm…

MXNet: LSTM Networks for Time-Series Data (Regression Tasks)

Time-series data as its name says has a time component present in it which is generally referred to as a temporal component. Generally, Time-series data has one or more data variables measured at a specified interval of time. The most common example of time-series data is stock price recorded every minute/hour/day. Time-series data can also have trend and seasonality. When making predictions of any example with time-series data, we generally use one or more previous examples to make predictions. As time-series data have an order, we can not shuffle examples of it and we need to design networks that could capture this order to make better predictions. Various researches have shown that Recurrent Neural Networks (RNNs) and its variant are quite good at capturing sequence order in datasets hence giving better results for tasks involving time-series data compared to other network architectures.

As a part of this tutorial, we have explained how we can create Recurrent Neural Networks (RNNs) consisting of LSTM layers using Python Deep Learning library MXNet for solving time-series regression task. The dataset that we have used in our tutorial is a Tetouan City Power Consumption dataset available from UCI ML Datasets Repository. The dataset is a multivariate dataset and has data variables like temperature, humidity, wind speed, diffuse flows, power consumption, etc measured every 10 minutes. We'll prepare our network to predict the power consumption of the city. As we are predicting continuous variable, it'll be regression task. At the end of the tutorial, we have also given a few suggestions on how we can improve network performance further.

Below, we have listed important sections of the Tutorial to give an overview of the material.

Important Sections Of Tutorials

  1. Prepare Data
    • 1.1 Download Data
    • 1.2 Load Data
    • 1.3 Reorganize Data for Regression Task
    • 1.4 Scale Target Values
    • 1.5 Create Datasets and Data Loaders
  2. Define LSTM Regression Network
  3. Train Network
  4. Evaluate Network Performance
  5. Visualize Predictions
  6. Further Recommendations

Below, we have imported the necessary Python libraries and printed the versions that we have used in our tutorial.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.1

1. Prepare Data

In this section, we have organized our data so that it can be given to a neural network for processing. As mentioned earlier, when working with time-series data, in order to make a prediction of current or future values, the best data features to use are that of a few previous examples data features. In our case, we have decided that in order to make a prediction of the current example's target value, we'll look at data features of 30 previous examples. Don't worry if you don't understand what we just said, it'll become pretty clear when we implement them below.

1.1 Download Data

Below, we have downloaded Tetouan City power consumption dataset. The dataset has information about power distribution units from three different electricity distribution networks and a few other data variables for Tetouan city located in Morocco. This is the dataset that we'll use for our task. Next, we'll load it in memory to give an idea about its contents.

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00616/Tetuan%20City%20power%20consumption.csv
--2022-06-03 04:40:12--  https://archive.ics.uci.edu/ml/machine-learning-databases/00616/Tetuan%20City%20power%20consumption.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4222390 (4.0M) [application/x-httpd-php]
Saving to: ‘Tetuan City power consumption.csv’

Tetuan City power c 100%[===================>]   4.03M  7.77MB/s    in 0.5s

2022-06-03 04:40:13 (7.77 MB/s) - ‘Tetuan City power consumption.csv’ saved [4222390/4222390]

%ls
'Tetuan City power consumption.csv'   __notebook__.ipynb

1.2 Load Data

Below, we have loaded our dataset into the main memory as a pandas dataframe. After loading the dataset, we have set DateTime column as an index of the dataframe. The dataset has the below-mentioned data variables recorded every 10 minutes.

  1. Date-time
  2. Temperature
  3. Humidity
  4. Wind Speed
  5. General diffuse flows
  6. Diffuse flows
  7. Zone 1 Power Consumption (units)
  8. Zone 2 Power Consumption
  9. Zone 3 Power Consumption

After loading the dataset, we have also displayed the first few columns of the dataset to give an idea about its contents. We can notice from the dataset that power units recorded by three different zones have quite a high range compared to other columns of data.

Apart from displaying data, we have also plotted a line chart showing the power consumption of three zones for December 2017. We can notice from the chart that data has clearly seasonality.

We have also plotted another line chart to show the consumption of units for a single day of Zone 1 to give an idea of how demand varies throughout the day. The demand is low at the beginning of the and then starts rising from 6-7 AM till 12 PM. Then, it stays almost the same till 5-6 PM and then rises again till 9 PM. After 9 PM demands drop till morning.

import pandas as pd

data_df = pd.read_csv("Tetuan City power consumption.csv")
data_df["DateTime"] = pd.to_datetime(data_df["DateTime"])
data_df = data_df.set_index('DateTime')
data_df.columns = [col.strip() for col in data_df.columns]

print("Columns : {}".format(data_df.columns.values.tolist()))
print("Dataset Shape : {}".format(data_df.shape))

data_df.head()
Columns : ['Temperature', 'Humidity', 'Wind Speed', 'general diffuse flows', 'diffuse flows', 'Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption']
Dataset Shape : (52416, 8)
Temperature Humidity Wind Speed general diffuse flows diffuse flows Zone 1 Power Consumption Zone 2 Power Consumption Zone 3 Power Consumption
DateTime
2017-01-01 00:00:00 6.559 73.8 0.083 0.051 0.119 34055.69620 16128.87538 20240.96386
2017-01-01 00:10:00 6.414 74.5 0.083 0.070 0.085 29814.68354 19375.07599 20131.08434
2017-01-01 00:20:00 6.313 74.5 0.080 0.062 0.100 29128.10127 19006.68693 19668.43373
2017-01-01 00:30:00 6.121 75.0 0.083 0.091 0.096 28228.86076 18361.09422 18899.27711
2017-01-01 00:40:00 5.921 75.7 0.081 0.048 0.085 27335.69620 17872.34043 18442.40964
import matplotlib.pyplot as plt

data_df.loc["2017-12"].plot(y=["Zone 1 Power Consumption", "Zone 2  Power Consumption", "Zone 3  Power Consumption"], figsize=(18, 7), grid=True);
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black');

MXNet: LSTM Networks for Time-Series Data (Regression Tasks)

import matplotlib.pyplot as plt

data_df.loc["2017-12-1"].plot(y="Zone 1 Power Consumption", figsize=(18, 7), color="tomato", grid=True);
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black');

MXNet: LSTM Networks for Time-Series Data (Regression Tasks)

1.3 Reorganize Data for Regression Task

In this section, we are actually organizing our data to be given to the network. We have decided that data of columns ['Temperature', 'Humidity', 'Wind Speed', 'general diffuse flows', 'diffuse flows'] will be used as data features (X) and data of column Zone 1 Power Consumption will be our target variable (Y).

As discussed earlier, we'll be using data features of 30 previous examples to make a prediction of the target value for the current example. As data is recorded every 10 minutes, the previous 30 examples comprise 5 hours data. So, we'll be looking at data from the previous 5 hours to make the prediction of current. We have declared the variable lookback and set its value to 30.

In order to prepare data, we are moving a window of size lookback through data one step at a time. The examples that fall into the window will be taken as data features and the target value of the example after the window will be our target value. The window will keep moving by one step at a time recording data features (X_organized) and target values (Y_organized). Organizing data in this way will help LSTM layers to capture sequences in data.

Once data is organized as per our need, we divided data into the train (first 50k examples) and test (remaining examples) sets. We have wrapped datasets into mxnet nd arrays as required by MXNet networks. We have also printed the shape of the train and test sets for reference purposes.

import numpy as np
from mxnet import nd

feature_cols = ['Temperature', 'Humidity', 'Wind Speed', 'general diffuse flows', 'diffuse flows']
target_col = 'Zone 1 Power Consumption'

X = data_df[feature_cols].values
Y = data_df[target_col].values

n_features = X.shape[1]
lookback = 30 ## 5 hours lookback to make prediction

X_organized, Y_organized = [], []
for i in range(0, X.shape[0]-lookback, 1):
    X_organized.append(X[i:i+lookback])
    Y_organized.append(Y[i+lookback])

X_organized, Y_organized = np.array(X_organized), np.array(Y_organized)
X_train, Y_train, X_test, Y_test = X_organized[:50000], Y_organized[:50000], X_organized[50000:], Y_organized[50000:]
X_train, X_test = nd.array(X_train, dtype=np.float32),nd.array(X_test, dtype=np.float32)


X_organized.shape, Y_organized.shape, X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
((52386, 30, 5), (52386,), (50000, 30, 5), (50000,), (2386, 30, 5), (2386,))

1.4 Scale Target Values

Here, we have normalized the values of our target variable. The reason for doing normalization is that the data of the target variable is in quite a high range compared to other data features columns as we had highlighted earlier. This can make gradient descent optimization algorithm hard to converge. In order to make the task of our optimization algorithm, we have normalized the target variable.

We have first calculated the mean and standard deviation of target values. Then, we subtracted the mean from the train and test target values. After subtracting, we have divided them by standard deviation. We have printed the new range of data as well.

When making predictions for our task, we'll reverse this process to make an actual prediction.

mean, std = Y_train.mean(), Y_train.std()

print("Mean : {:.2f}, Standard Deviation : {:.2f}".format(mean, std))
Y_train_scaled, Y_test_scaled = (Y_train - mean)/std , (Y_test-mean)/std
Y_train_scaled, Y_test_scaled = nd.array(Y_train_scaled, dtype=np.float32),nd.array(Y_test_scaled, dtype=np.float32)

Y_train_scaled.min(), Y_train_scaled.max(), Y_test_scaled.min(), Y_test_scaled.max()
Mean : 32492.51, Standard Deviation : 7137.51
(
 [-2.6055043]
 <NDArray 1 @cpu(0)>,

 [2.761731]
 <NDArray 1 @cpu(0)>,

 [-2.002983]
 <NDArray 1 @cpu(0)>,

 [1.6084197]
 <NDArray 1 @cpu(0)>)

1.5 Create Datasets and Data Loaders

In this section, we have first simply wrapped train and test arrays into ArrayDataset object. Then, we created data loaders from these dataset objects. The data loaders will let us loop through data in batches during the training process. We have set batch size to 128 and shuffle is set to False as we don't want to disturb ordering. We can't shuffle examples because this is time-series data and order is important.

from mxnet.gluon.data import ArrayDataset, DataLoader

train_dataset = ArrayDataset(X_train, Y_train_scaled)
test_dataset  = ArrayDataset(X_test, Y_test_scaled)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=False)
test_loader  = DataLoader(test_dataset,  batch_size=128, shuffle=False)
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(128, 30, 5) (128,)

2. Define LSTM Regression Network

In this section, we have defined a simple RNN that we'll use for our regression task. The network consists of two LSTM layers and one dense layer. The first two layers of the network are LSTM layers with 256 output units each. We have stacked two LSTM layers to better capture the order present in time-series data. MXNet provides us with LSTM() constructor through 'gluon.nn' sub-module. It let us stack more than one LSTM layer. We have provided num_layers parameter to 2 informing the constructor to stack two LSTM layers. The output of the second LSTM layer is given to a dense layer that has one output unit. The output of the dense layer is a prediction of our network.

After defining the network, we initialized it and performed a forward pass through it using random data for verification purposes. We have also printed the summary of output shapes of layers and parameters count per layer.

Please make a NOTE that we have not covered the inner workings of LSTM layers here as we have assumed that the reader has little background on them. If you want to learn about it then we recommend that you go through the below link as it'll help you better understand it.

Below, we have included another link for people who are new to MXNet and want to learn how to design networks using it. It can be considered an MXNet starter tutorial.

from mxnet.gluon import nn, rnn

hidden_dim = 256
n_layers = 2

class LSTMRegressor(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMRegressor, self).__init__(**kwargs)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=n_features)
        self.dense = nn.Dense(1)

    def forward(self, x):
        x = self.lstm(x)

        return self.dense(x[:, -1])

model = LSTMRegressor()

model
LSTMRegressor(
  (lstm): LSTM(5 -> 256, NTC, num_layers=2)
  (dense): Dense(None -> 1, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randn(10,lookback, n_features))

preds.shape
(10, 1)
model.summary(nd.random.randn(10,lookback, n_features))
--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
================================================================================
               Input                                 (10, 30, 5)               0
              LSTM-1                               (10, 30, 256)          795648
             Dense-2                                     (10, 1)             257
     LSTMRegressor-3                                     (10, 1)               0
================================================================================
Parameters in forward computation graph, duplicate included
   Total params: 795905
   Trainable params: 795905
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 795905
--------------------------------------------------------------------------------

3. Train Network

In this section, we have trained our network. In order to train it, we have defined a helper function. The function takes Trainer object (network parameters are in it), train data loader, validation data loader, and a number of epochs. The function executes the training loop a number of epochs times. For each epoch, it loops through training data in batches using a train data loader. For each batch, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters using gradients. It records the loss of each batch and prints the average loss of all batches at the end of an epoch. There is another helper function that helps us calculates validation loss.

from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def CalcValLoss(model, val_loader):
    losses = []
    for X_batch, Y_batch in val_loader:
        val_loss = loss_func(model(X_batch), Y_batch)
        val_loss = val_loss.mean().asscalar()
        losses.append(val_loss)
    print("Valid Loss : {:.3f}".format(np.array(losses).mean()))

def TrainModelInBatches(trainer, train_loader, val_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        print("Train Loss : {:.3f}".format(np.array(losses).mean()))
        CalcValLoss(model, val_loader)

Below, we are actually training our network by calling training routine. We have initialized a number of epochs to 15 and the learning rate to 0.001. Then, we have initialized model, l2 loss (mean squared error loss), Adam optimizer and Trainer object (network parameters). At last, we have called our training function with the necessary parameters to perform training. We can notice from the reducing loss values getting printed after each epoch that our network seems to be doing a good job at the regression task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMRegressor()
model.initialize(initializer.Xavier())
loss_func = loss.L2Loss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 391/391 [04:03<00:00,  1.60it/s]
Train Loss : 0.100
Valid Loss : 0.051
100%|██████████| 391/391 [04:30<00:00,  1.44it/s]
Train Loss : 0.078
Valid Loss : 0.048
100%|██████████| 391/391 [04:46<00:00,  1.37it/s]
Train Loss : 0.078
Valid Loss : 0.047
100%|██████████| 391/391 [04:54<00:00,  1.33it/s]
Train Loss : 0.078
Valid Loss : 0.049
100%|██████████| 391/391 [04:47<00:00,  1.36it/s]
Train Loss : 0.077
Valid Loss : 0.045
100%|██████████| 391/391 [04:53<00:00,  1.33it/s]
Train Loss : 0.078
Valid Loss : 0.044
100%|██████████| 391/391 [04:53<00:00,  1.33it/s]
Train Loss : 0.076
Valid Loss : 0.043
100%|██████████| 391/391 [04:50<00:00,  1.35it/s]
Train Loss : 0.077
Valid Loss : 0.044
100%|██████████| 391/391 [04:57<00:00,  1.31it/s]
Train Loss : 0.077
Valid Loss : 0.047
100%|██████████| 391/391 [04:57<00:00,  1.31it/s]
Train Loss : 0.079
Valid Loss : 0.043
100%|██████████| 391/391 [04:57<00:00,  1.31it/s]
Train Loss : 0.079
Valid Loss : 0.045
100%|██████████| 391/391 [04:57<00:00,  1.31it/s]
Train Loss : 0.077
Valid Loss : 0.043
100%|██████████| 391/391 [04:54<00:00,  1.33it/s]
Train Loss : 0.081
Valid Loss : 0.041
100%|██████████| 391/391 [04:54<00:00,  1.33it/s]
Train Loss : 0.077
Valid Loss : 0.039
100%|██████████| 391/391 [04:50<00:00,  1.35it/s]
Train Loss : 0.074
Valid Loss : 0.040

4. Evaluate Network Performance

In this section, we are evaluating the performance of our trained network on test data.

Below, we have first made predictions on the test dataset using our trained network. Then, we have de-normalized predictions using the training mean and standard deviation that we had calculated earlier. This will bring predictions into the actual range.

Then, in the next cell, we have calculated metrics 'mean squared error (MSE)','r2 score' and 'mean absolute error (MAE)' on test predictions. R2 score is a very commonly used metric to check the performance of the model on regression tasks. We have calculated score using r2_score function available from scikit-learn. It returns values in the range 0-1 and values near 1 are considered signs of a good generalized model. We can notice from our score that our model is doing a good job at the task.

If you want to know how r2 score works internally then we would suggest the below link. It explains the majority of metrics available from sklearn in-depth.

test_preds = model(X_test) ## Make Predictions on test dataset
test_preds  = (test_preds*std) + mean ## Upscaling Predictions

test_preds[:5]
[[28072.309]
 [28616.97 ]
 [29025.395]
 [29711.068]
 [30092.531]]
<NDArray 5x1 @cpu(0)>
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

print("Test  MSE : {:.2f}".format(mean_squared_error(test_preds.asnumpy().squeeze(), Y_test)))
print("Test  R^2 Score : {:.2f}".format(r2_score(test_preds.asnumpy().squeeze(), Y_test)))
print("Test  MAE : {:.2f}".format(mean_absolute_error(test_preds.asnumpy().squeeze(), Y_test)))
Test  MSE : 4108683.31
Test  R^2 Score : 0.88
Test  MAE : 1565.02

5. Visualize Predictions

In this section, we are visualizing predictions next to actual values to have an even deeper look at the performance of our network.

Below, we have first added test predictions to our dataframe. Then, we visualized the original zone 1 consumption units and predicted units as a line chart. We can notice from the chart that our model is better at capturing peaks compared to downward movements. Overall, it has done a good job at capturing the seasonality. We can still improve the model by trying a few suggestions we have given next.

data_df_final = data_df[50000:].copy()

data_df_final["Zone 1 Power Consumption Prediction"] = [None]*lookback + test_preds.asnumpy().squeeze().tolist()

data_df_final.tail()
Temperature Humidity Wind Speed general diffuse flows diffuse flows Zone 1 Power Consumption Zone 2 Power Consumption Zone 3 Power Consumption Zone 1 Power Consumption Prediction
DateTime
2017-12-30 23:10:00 7.010 72.4 0.080 0.040 0.096 31160.45627 26857.31820 14780.31212 33240.578125
2017-12-30 23:20:00 6.947 72.6 0.082 0.051 0.093 30430.41825 26124.57809 14428.81152 29519.865234
2017-12-30 23:30:00 6.900 72.8 0.086 0.084 0.074 29590.87452 25277.69254 13806.48259 23045.025391
2017-12-30 23:40:00 6.758 73.0 0.080 0.066 0.089 28958.17490 24692.23688 13512.60504 22795.671875
2017-12-30 23:50:00 6.580 74.1 0.081 0.062 0.111 28349.80989 24055.23167 13345.49820 22790.503906
data_df_final.plot(y=["Zone 1 Power Consumption", "Zone 1 Power Consumption Prediction"],figsize=(18,7));
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black');

MXNet: LSTM Networks for Time-Series Data (Regression Tasks)

6. Further Recommendations

  1. Train the network for more epochs to see whether it is improving further.
  2. Try different output units for LSTM layers.
  3. Stack more LSTM layers. This can increase training time hence think twice.
  4. Try adding more dense layers after LSTM layers.
  5. Try adding dropout.
  6. Try different activation functions.
  7. Create network architecture that predicts more than one future value. Our network predicts only a single future value. You can design a network that can predict 5-10 or more days. You need to organize data also to make more than one future prediction.
  8. Add datetime-related features like day, day of the week, month, hour, AM/PM, month start, month-end, quarter start, quarter-end, year start, year-end, etc.
  9. Try learning rate schedulers during training.
Sunny Solanki  Sunny Solanki

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.