Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views61 pages

Module 7 - Deep Sequence Modeling

lecture sheet

Uploaded by

omarfaroque910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views61 pages

Module 7 - Deep Sequence Modeling

lecture sheet

Uploaded by

omarfaroque910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Module 7 – Part I

DNN vs RNN for Timeseries


Intuition

Standard NN
Recurrent
NN

Dr. Pedram Jahangiry


Road map!
• Module 1- Introduction to Deep Forecasting
• Module 2- Setting up Deep Forecasting Environment
• Module 3- Exponential Smoothing
• Module 4- ARIMA models
• Module 5- Machine Learning for Time series Forecasting
• Module 6- Deep Neural Networks
• Module 7- Deep Sequence Modeling (RNN, LSTM)
• Module 8- Prophet and Neural Prophet

Dr. Pedram Jahangiry


What is Sequence Data?
• Sequence data refers to any data that has a specific order or sequence to it!

Sequence Data

Time Series
Text Data Audio Data Video Data
Data

Regular TS

Irregular TS

Dr. Pedram Jahangiry


Time series Tasks
TS tasks

Anomaly/Event
Forecasting Classification Clustering
Detection

Qualitative

Quantitative

Dr. Pedram Jahangiry


Understanding DNNs & RNNs for Time Series Forecasting

• Comparing feed-forward networks with sequential models


• Key ideas: data transformation, memory

Standard NN
Recurrent
NN

Dr. Pedram Jahangiry


Feature Engineering in DNN
• Use lagged values (e.g., 12 lags) as independent input features
• Each observation: vector of 12 features representing consecutive time steps

Dr. Pedram Jahangiry


It’s All about shapes! DNN

Dr. Pedram Jahangiry


It’s All about shapes! DNN {

Dr. Pedram Jahangiry


Batch Training in DNNs
• Data is shuffled into batches (e.g., batch size = 16)
• Temporal order among different observations is lost
• Within each observation, the sequential order of lags is preserved

Dr. Pedram Jahangiry


Learning in DNNs
• Model learns to map fixed windows of past values to a target
• Temporal relationships are implicitly modeled through feature patterns

Dr. Pedram Jahangiry


How RNNs Process Time Series Data
• Sequential Processing:
• Input is the time series itself (one feature)
• RNN unrolls over a sequence (e.g., sequence length = 12)

Recurrent
NN

Dr. Pedram Jahangiry


How RNNs Process Time Series Data
• Hidden State Mechanism:
• Hidden state carries information from previous time steps
• Explicitly models temporal dependencies

Recurrent
NN

Dr. Pedram Jahangiry


How RNNs Process Time Series Data
• Key Difference from DNNs:
• DNNs treat lagged inputs as independent features
• RNNs connect time steps via hidden states, preserving order

Standard NN

Recurrent
NN

Dr. Pedram Jahangiry


Memory in RNNs
• Sequence Length:
• Sets the potential memory span/length (how many past time steps are seen)
• Limited by practical issues (e.g., vanishing gradients limit long-term retention)
• Hidden State Size:
• Determines the capacity/depth of the memory
• Larger hidden states can capture richer, more complex patterns

Dr. Pedram Jahangiry


Feature Engineering in RNN?

Dr. Pedram Jahangiry


It’s All about shapes! RNN

Dr. Pedram Jahangiry


All about shapes!

Dr. Pedram Jahangiry


Key Comparisons (DNN vs RNN)
• DNN:
• Simple and effective for short-term dependencies via engineered features
• Uses engineered lagged features as independent inputs
• Shuffling within batches loses order between samples

• RNN:
• Designed for sequential data
• Processes sequences one time step at a time with a hidden state
• Explicitly captures the order and dependencies in the data
• Memory is influenced by both sequence length and hidden state size

Dr. Pedram Jahangiry


RNN performance (raw data vs pre-processed data)

Dr. Pedram Jahangiry


Module 7 – Part II
Deep Sequence Modeling
Recurrent Neural Networks (RNN)

Dr. Pedram Jahangiry


Sequence Modeling
To model sequence data efficiently, we need a new architecture that:
• Preserve the order
• Account for long-term dependencies
• Handle input-length
• Share parameters across the sequence

Dr. Pedram Jahangiry


What is RNN (Recurrent Neural Network)?

• The architecture of RNNs is inspired by the way biological


intelligence processes information incrementally while
maintaining an internal model of what it is processing.

• This ability to remember previous inputs and incorporate them


into the current output allows RNNs to model sequential data.

• RNN maintains a state that contains information relative to


what it has seen so far

• RNNs can be thought of as neural networks with an internal


loop, which allows them to process sequences of varying
lengths and learn from temporal dependencies.

Dr. Pedram Jahangiry


Perceptron vs Recurrent Cell

Perceptron Recurrent Cell

Dr. Pedram Jahangiry


Unrolling the Recurrent Cell

𝑦ො0 𝑦ො1 𝑦ො𝑡 output

Z|𝜎 Z|𝜎 ... Z|𝜎 RNN

𝑋0 𝑋1 𝑋𝑡 input

Dr. Pedram Jahangiry


Dense Layer vs Recurrent Layer

Dense Layer Recurrent Layer

Dr. Pedram Jahangiry


Inside the Recurrent Cell
𝑜𝑢𝑡𝑝𝑢𝑡𝑡 = 𝑓 𝑖𝑛𝑝𝑢𝑡𝑡 , 𝑆𝑡𝑎𝑡𝑒𝑡

𝑊 𝑊 𝑊

𝑈 𝑈 𝑈

𝑠𝑡+1 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑊𝑋𝑡 + 𝑈𝑠𝑡 + 𝑏

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


RNN architectures

Not Aligned

Dr. Pedram Jahangiry


How does RNN learn representations?
• Backpropagation Through Time (BPTT)
𝜕𝐽
• P are the parameters
𝜕𝑃
𝜕𝐽 𝜕𝐽0 𝜕𝐽1 Total Loss (𝐽)
• = + + ..
𝜕𝑊 𝜕𝑊 𝜕𝑊
𝜕𝐽0 𝜕𝐽0 𝜕𝑦0 𝜕𝑆0
• = 𝐽0 𝐽1 𝐽𝑡
𝜕𝑊 𝜕𝑦0 𝜕𝑆0 𝜕𝑊
𝜕𝐽1 𝜕𝐽1 𝜕𝑦1 𝜕𝑆1 𝜕𝑆1 𝜕𝑆1 𝜕𝑆0
• = , = 𝑦ො0 𝑦ො1 𝑦ො𝑡
𝜕𝑊 𝜕𝑦1 𝜕𝑆1 𝜕𝑊 𝜕𝑊 𝜕𝑆0 𝜕𝑊
𝑆1 𝑆2
• … 𝑆𝑡
Z|𝜎 Z|𝜎 ... Z|𝜎
𝜕𝐽𝑡 𝜕𝐽 𝜕𝑦 𝜕𝑆 𝜕𝑆
• = σ𝑡𝑘=0 𝑡 𝑡 𝑡 𝑘
𝜕𝑊 𝜕𝑦𝑡 𝜕𝑆𝑡 𝜕𝑆𝑘 𝜕𝑊 𝑊 𝑊 𝑊
𝑋0 𝑋1 𝑋𝑡

Dr. Pedram Jahangiry


Vanishing Gradient Problem
• As the time horizon gets bigger, this product gets longer and longer.
• We are multiplying a lot of small numbers → smaller gradients → biased parameters
unable to capture long term dependencies.

𝜕𝐽𝑡 𝜕𝐽𝑡 𝜕𝑦𝑡 𝜕𝑆𝑡 𝜕𝑆𝑘


• = σ𝑡𝑘=0
𝜕𝑊 𝜕𝑦𝑡 𝜕𝑆𝑡 𝜕𝑆𝑘 𝜕𝑊 𝑆𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑊𝑋𝑡_1 + 𝑈𝑠𝑡−1
𝜕𝑆10 𝜕𝑆10 𝜕𝑆9 𝜕𝑆8 𝜕𝑆7 𝜕𝑆1
• = …
𝜕𝑆0 𝜕𝑆9 𝜕𝑆8 𝜕𝑆7 𝜕𝑆6 𝜕𝑆0
𝑆1 𝑆2
𝑆𝑡
...
𝑊 𝑊 𝑊
𝑋0 𝑋1 𝑋𝑡

Dr. Pedram Jahangiry


A simple timeseries with multiple features example
• A temperature forecasting example: deep-learning-with-python-notebooks
• Predicting the temperature 24 hours in the future
• Target: temperature
• Features: 14 different variables including pressure, humidity, wind direction and etc
• Data recorded every 10 minutes from 2009-2016

Temperature between 2009-2016 Temperature in the first 10 days: 10*24*6 = 1440

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Preparing the data
• Given the previous 5 days (120 hours) and samples once per hour, can we predict
temperature in 24 hours (after the end of the sequence)?
• Data batches:
• Sequence length = 120
• [1,2,3,…,120][144]
• [2,3,4,…,121][145]
• [3,4,5,…,122][146]
• Bath size: 256 of these samples are shuffled and batched
• Sample shape: (256, 120, 14)
• Target shape: (256,)

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Naïve forecaster: common-sense baseline
• Temperature 24 hours from now = Temperature right now
• This is our random walk with no drift forecaster.

• Performance:
• Validation MAE = 2.44 degrees Celsius
• Test MAE = 2.62 degrees Celsius
• The baseline model is off by about 2.5 degrees on average. Not bad!!

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Let’s try DNN (Deep Neural Networks)

• Test MAE = 2.62 degrees Celsius


• No improvement!!
• Flattening a timeseries data is not a
good idea!

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Let’s try CNN (Convolutional Neural Networks)

• Motivation: Maybe a temporal convnet could reuse the same representations across
different days, much like a spatial convnet can reuse the same representations across
different locations in an image!

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


CNN performance
• Test MAE = 3.10 degrees Celsius
• Even worse than the densely connected model!!
• CNN treats every segment of the data the same way!
• Pooling layers are destroying order information.

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Let’s try a simple RNN

• Baseline Test MAE = 2.62


• Simple RNN Test MAE = 2.51
• beats the naïve forecaster.

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Beyond RNN
RNN can handle the following sequence modeling criteria:
• Preserve the order
• Handle input-length
• Share parameters across the sequence

RNN limitations:
• Does not account for long-term dependencies (only remember short term history )
• Vanishing Gradient Problem

Dr. Pedram Jahangiry


Module 7 – Part III
Deep Sequence Modeling
(Gated cells, LSTM)

Dr. Pedram Jahangiry


Beyond RNN
RNN can handle the following sequence modeling criteria:
• Preserve the order
• Handle input-length
• Share parameters across the sequence

RNN limitations:
• Does not account for long-term dependencies (only remember short term history )
• Vanishing Gradient Problem

Dr. Pedram Jahangiry


How to solve vanishing gradient problem
1. Use Activation Function that prevents fast shrinkage of gradient

𝑆𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑊𝑋𝑡−1 + 𝑈𝑠𝑡−1

Dr. Pedram Jahangiry


How to solve vanishing gradient problem
1. Use Activation Function that prevents fast shrinkage of gradient
2. Use weight initialization techniques that ensure that the initial weights are not too small
3. Use gradient clipping which limits the magnitude of the gradients from becoming too
small (vanishing gradient) or too large (exploding gradient)
4. Use batch normalization, which normalizes the input to each layer and helps to reduce the
range of activation values and thus the likelihood of vanishing gradients.
5. Use a different optimization algorithm that is more resilient to vanishing gradients, such
as Adam or RMSprop.
6. Gated cells: Use some sort of skip connections, which allow gradients to bypass some
of the layers in the network and thus prevent them from becoming too small.

Dr. Pedram Jahangiry


Gated cells
• Instead of using a simple RNN cell, let’s use a more complex cell with gates which
control the flow of information.
• Think of a conveyer belt running parallel to the sequence being processed:
• Information can jump on → transported to a later timestep → jump off when needed.
• This is what a gated cell does! Analogous to residual connections we saw before.

• Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two examples
of gated cells that can keep track of information throughout many timesteps.

Dr. Pedram Jahangiry


Inside the LSTM cell

Simple RNN

Vs
Output

Long term memory: Carry track: 𝐶𝑡


𝐶𝑡−1
Forget Input
Output
irrelevant relevant
filtered
info from info and
version of
previous selectively
the cell state
state update cell
Short term state New short
memory: ℎ𝑡−1 term memory:
Input ℎ𝑡

Dr. Pedram Jahangiry


LSTM details
𝒊𝒕 𝒐𝒕 𝑺𝒕

𝑪𝒕
𝑪𝒕−𝟏

𝒇𝒕

𝒉𝒕−𝟏
𝒉𝒕

𝑿𝒕

Dr. Pedram Jahangiry


Dr. Pedram Jahangiry
Dr. Pedram Jahangiry
Dr. Pedram Jahangiry
Dr. Pedram Jahangiry
LSTM takeaway
• LSTM uses gates to regulate the information flow (allows past information to be
reinjected later)
• This new cell state (carry) can better capture longer term dependencies
• LSTM fights the vanishing gradient problem

Dr. Pedram Jahangiry


Let’s try LSTM on the temperature example

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


A simple timeseries with multiple features example
• A temperature forecasting example: deep-learning-with-python-notebooks
• Predicting the temperature 24 hours in the future
• Target: temperature
• Features: 14 different variables including pressure, humidity, wind direction and etc
• Data recorded every 10 minutes from 2009-2016

Temperature between 2009-2016 Temperature in the first 10 days: 10*24*6 = 1440

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Let’s try LSTM on the temperature example

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


LSTM performance
• Baseline Test MAE = 2.62
• Simple LSTM Test MAE = 2.53
• Also beats the naïve forecaster.
• Overfitting?

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Can we do better?

Dr. Pedram Jahangiry


Improving the simple LSTM model
• We can improve the performance of the simple LSTM
model by:
1. Recurrent Dropout : use drop out to fight overfitting in the
recurrent layers (in addition to drop out for the dense
layers)
2. Stacking recurrent layers: increase model complexity to
boost representation power
3. Using bidirectional RNN: processing the same information
differently! Mostly used in NLP.

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Regular vs Recurrent Dropout

Dr. Pedram Jahangiry


Recurrent Drop out
• The same dropout pattern should be applied at every timestep

• Baseline Test MAE = 2.62


• Simple RNN, Test MAE = 2.51
• Simple LSTM, Test MAE = 2.53
• LSTM with dropout, Test MAE = 2.45

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Stacking Recurrent Layers
• Let’s train a dropout-regulated, stacked GRU model.
• GRU is a slightly simpler version (hence, faster) of LSTM architecture

• Baseline Test MAE = 2.62


• Simple RNN, Test MAE = 2.51
• Simple LSTM, Test MAE = 2.53
• Stacking GRU, Test MAE = 2.39

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Bidirectional RNN
• Bidirectional RNN process the input sequence both
chronologically and antichronologically.
• Idea: capturing patterns (representations) that might be
overlooked by a unidirectional RNN.

• For the temperature example, the bidirectional LSTM


strongly underperforms even the common-sense • Baseline Test MAE = 2.62
baseline.
• Simple RNN, Test MAE = 2.51

• Simple LSTM, Test MAE = 2.53

• Stacking GRU, Test MAE = 2.39

• Bidirectional RNN, Test MAE= 2.79

Dr. Pedram Jahangiry Deep learning with Python, Francois Chollet


Final message
• Deep learning is more an art than science! Too many moving part!
• Number of units in each recurrent layer
• Number of stacked layers
• Amount of dropout and recurrent dropout
• Number of dense layers
• Sequence horizon!
• Optimizers, learning rates and etc
• ….
• Apply RNN to datasets that past is a good predictor of the future! Not the stock market!

Dr. Pedram Jahangiry


Road map!
✓ Module 1- Introduction to Deep Forecasting
✓ Module 2- Setting up Deep Forecasting Environment
✓ Module 3- Exponential Smoothing
✓ Module 4- ARIMA models
✓ Module 5- Machine Learning for Time series Forecasting
✓ Module 6- Deep Neural Networks
✓ Module 7- Deep Sequence Modeling (RNN, LSTM)
• Module 8- Prophet and Neural Prophet

Dr. Pedram Jahangiry

You might also like