STOCK PRICE PREDICTOR
Authors:
Deepak, Chandigarh University
Rahul Dhiman, Chandigarh University
TABLE OF CONTENTS
Table Contents ....................................................................................................
1. DEFINITION...................................................................................................
.............................................................................................1.1 Project Overview
...............................................................................................................................
..........................................................................................1.2 Problem Statement
...............................................................................................................................
............................................................................................................1.3 Metrices
...............................................................................................................................
2. ANALYSIS.......................................................................................................
2.1 Data Exploration..............................................................................................
2.2 Exploratory Visualisation................................................................................
2.2 Algorithms and Techniques.............................................................................
2.2 Benchmark model............................................................................................
3. METHODOLOGY..........................................................................................
3.1 Data Processing................................................................................................
................................................................................................................................
3.2 Implementation...............................................................................................
................................................................................................................................
3.3 Refinement......................................................................................................
4. RESULT ........................................................................................................
4.1 Model Evaluation and Validation.................................................................
4.2 Exploratory Visualisation..............................................................................
4. CONCLUSION..............................................................................................
5.1 Free-Form Visualisation................................................................................
5.2 Reflection.........................................................................................................
5.3 Improvement...................................................................................................
References (If Any)................................................................................................
Chapter 1.
DEFINITION
1.1 Project Overview
Investment firms, hedge funds and even individuals have
been using financial models to better understand market
behavior and make profitable investments and trades. A
wealth of information is available in the form of historical
stock prices and company performance data, suitable for
machine learning algorithms to process.
Can we actually predict stock prices with machine learning?
Investors make educated guesses by analyzing data. They'll
read the news, study the company history, industry trends
and other lots of data points that go into making a
prediction. The prevailing theory is that stock prices are
totally random and unpredictable but that raises the
question why top firms like Morgan Stanley and Citigroup
hire quantitative analysts to build predictive models. We
have this idea of a trading floor being filled with adrenaline
infused men with loose ties running around yelling
something into a phone but these days they're more likely
to see rows of machine learning experts quietly sitting in
front of computer screens. In fact about 70% of all orders
on Wall Street are now placed by software, we're now living
in the age of the algorithm.
This project seeks to utilize Deep Learning models, Long-
Short Term Memory (LSTM) Neural Network algorithm, to
predict stock prices. For data with time frames recurrent
neural networks (RNNs) come in handy but recent research
has shown that LSTM networks are the most popular and
useful variants of RNNs.
I will use Keras to build a LSTM to predict stock prices using
historical closing price and trading volume and visualize
both the predicted price values over time and the optimal
parameters for the model.
1.2 Problem Statement
The challenge of this project is to accurately predict the
future closing value of a given stock across a given period
of time in the future. For this project I will use a Long
Short Term Memory networks – usually just called
“LSTMs” to predict the closing price of the S&P 5002 using
a dataset of past prices
Goals
1. Explore stock prices.
2. Implement a basic model using linear regression.
3. Implement LSTM using keras library.
4. Compare the results and submit the report.
1.3 Metrics
For this project measure of performance will be using the
Mean Squared Error (MSE) and Root Mean Squared Error
(RMSE) calculated as the difference between predicted and
actual values of the target stock at adjusted close price and
the delta between the performance of the benchmark
model (Linear Regression) and our primary model (Deep
Learning).
Chapter 2.
ANALYSIS
2.1 Data Exploration
The data used in this project is of the Alphabet Inc3 from
January 1, 2005 to June 20, 2017, this is a series of data
points indexed in time order or a time series. My goal was
to predict the closing price for any given date after training.
For ease of reproducibility and reusability, all data was
pulled from the Google Finance Python API4.
The prediction has to be made for the Closing (Adjusted
closing) price of the data. Since Google Finance already
adjusts the closing prices for us5, we just need to make
a prediction for the “CLOSE” price.
The dataset is of following form :
Date Ope High Low Close Volu
n me
30- 943. 945. 929. 929.6 228
Jun- 99 00 61 8 766
17 2
29- 951. 951. 929. 937.8 320
Jun- 35 66 60 2 667
17 4
28- 950. 963. 936. 961.0 274
Jun- 66 24 16 1 556
17 8
Table: The whole data can be found out in ‘Google.csv’ in
the project root folder6
Note: I did not observe any abnormality in datasets, i.e, no
feature is empty and does not contains any incorrect value
as negative values.
The mean, standard deviation, maximum and minimum of
the data was found to be following:
Feat Ope High Low Close Volum
ure n e
Mean 382. 385.87 378.7 382.35 420570
5141 20 371 02 7.8896
Std 213. 214.60 212.0 213.43 387748
4865 22 8010 59 3.0077
Max 1005 1008.6 1008. 1004.2 411828
.49 1 61 8 89
Min 87.7 89.29 86.37 87.58 521141
4
We can infer from this dataset that date, high and low
values are not important features of the data. As it does not
matter at what was the highest prices of the stock for a
particular day or what was the lowest trading prices. What
matters is the opening price of the stock and closing prices
of the stock. If at the end of the day we have higher closing
prices than the opening prices that we have some profit
otherwise we saw losses. Also volume of share is important
as a rising market should see rising volume, i.e, increasing
price and decreasing volume show lack of interest, and this
is a warning of a potential reversal. A price drop (or rise) on
large volume is a stronger signal that something in the
stock has fundamentally changed.
Therefore I have removed Date, High and low features from
the data set at preprocessing step. The mean, standard
deviation, maximum and minimum of the preprocessed
data was found to be following:
Mean Std Max Min
Open 0.321 0.232 1.0 0.0
2 61
Close 0.321 0.2328 1.0 0.0
5
Volu 0.090 0.095 1.0 0.0
me 61 3
2.2 Exploratory Visualization
To visualize the data i have used the matplotlib7 library. I
have plotted Closing stock price of the data with the no of
items( no of days) available.
Following is the snapshot of the plotted data :
X-axis: Represents Tradings Days Y-axis: Represents
Closing Price In USD
Y-axis: Represents Closing Price In USD
Through this data we can see a continuous growth in
Alphabet Inc. The major fall in the prices between 600-1000
might be because of the Global Financial Crisis of 2008-
2009.
2.3 Algorithms and Techniques
The goal of this project was to study time-series data and
explore as many options as possible to accurately predict
the Stock Price. Through my research I came to know about
Recurrent Neural Nets (RNN)8 which are used specifically
for sequence and pattern learning. As they are networks
with loops in them, allowing information to persist and thus
ability to memorize the data accurately. But Recurrent
Neural Nets have a vanishing Gradient descent problem
which does not allow it to learn from past data as was
expected. The remedy of this problem was solved in Long-
Short Term Memory Networks, usually referred to as LSTMs.
These are a special kind of RNN, capable of learning long-
term dependencies.
In addition to adjusting the architecture of the Neural
Network, the following full set of parameters can be tuned
to optimize the prediction model:
• Input Parameters
• Preprocessing and Normalization (see Data Preprocessing
Section)
• Neural Network Architecture
• Number of Layers (how many layers of nodes in the
model; used 3)
• Number of Nodes (how many nodes per layer; tested
1,3,8, 16, 32, 64, 100,128)
• Training Parameters
• Training / Test Split (how much of dataset to train versus
test model on; kept constant at 82.95% and 17.05% for
benchmarks and lstm model)
• Validation Sets (kept constant at 0.05% of training sets)
• Batch Size (how many time steps to include during a
single training step; kept at 1 for basic lstm model and at
512 for improved lstm model)
• Optimizer Function (which function to optimize by
minimizing error; used “Adam” throughout)
• Epochs (how many times to run through the training
process; kept at 1 for base and at 20 for improved LSTM)
2.4 Benchmark Model
For this project I have used a Linear Regression model as
its primary benchmark. One of my goals is to understand
the relative performance and implementation differences of
machine learning versus deep learning models. This Linear
Regressor was based on the examples presented in
Udacity’s Machine Learning for Trading course and was
used for error rate comparison MSE and RMSE utilizing the
same dataset as the deep learning models.
Following is the predicted results that i got from my
benchmark model :
X-axis: Represents Trading Days Y-axis: Represents Closing
Price In USD
Y-axis: Represents closing price in USD
Green line: Adjusted close price
Blue line: Predicted close price
Train Score: 0.1852 MSE (0.4303 RMSE)
Test Score: 0.08133781 MSE (0.28519784 RMSE)
Chapter 3.
METHODOLOGY
3.1 Data Preprocessing
Acquiring and preprocessing the data for this project occurs
in following sequence, much of which has been modularized
into the preprocess.py file for importing and use across all
notebooks:
• Request the data from the Google Finance Python API and
save it in google.csv file in the following format.
• Remove unimportant features(date, high and low) from
the acquired data and reversed the order of data, i.e., from
January 03, 2005 to June 30, 2005
Item Open Close Volume
0 98.80 101.46 158606
92
1 100.77 97.35 137623
96
2 96.82 96.85 823954
5
3 97.72 94.37 103898
03
Normalized the data using MinMaxScaler helper function
from Scikit-Learn.
Item Open Close Volume
0 0.01205 0.01514 0.37724
1 1 8
1 0.01419 0.01065 0.32564
8 8 4
2 0.00989 0.01011 0.18982
4 2 0
3 0.01087 0.00740 0.24270
4 7 1
• Stored the normalized data in google_preprocessed.csv
file for future reusability.
• Splitted the dataset into the training (68.53%) and test
(31.47%) datasets for linear regression model. The split was
of following shape :
x_train (2155, 1)
y_train (2155, 1)
x_test (990, 1)
y_test (990, 1)
• Splitted the dataset into the training (82.95%) and test
(17.05%) datasets for LSTM model. The Split was of
following shape:
x _train (2589, 50, 3)
y_train (2589,)
x_test (446, 50, 3)
y_test (446,)
3.2 Implementation
Once the data has been downloaded and preprocessed, the
implementation process occurs consistently through all
three models as follow:
I have thoroughly specified all the steps to build, train and
test model and its predictions in the notebook itself.
Some code implementation insight:
Benchmark model :
Step 1 : Split into train and test model :
Here I am calling a function defined in ‘stock_data.py’
which splits the data for linear regression model. The
function is as follows :
Step 2: In this step model is built using scikit-learn
linear_model10 library.
Here I am calling a function defined in
‘LinearRegressionModel.py’ which builds the model for
the project. The screenshot of the function is as follows:
Step 3: Now it’s time to predict the prices for given test
datasets.
The screenshot of the function is as follows, it is defined in
‘LinearRegressionModel.py’:
Step 4: Finally calculate the test score and plot the results
of benchmark model.
Improved LSTM model :
Step 1 : Split into train and test model :
Note : The same set of training and testing data is used for
improved LSTM as is used with basic LSTM.
Step 2 : Build an improved LSTM model :
Here I am calling a function defined in ‘lstm.py’ which
builds the improved lstm model for the project. The
screenshot of the function is as follows:
NOTE: The function uses keras Long short term
memory11
I have increased the batch_size to 512 from 1
Epochs from 1 to 20 for my improved LSTM model.
library to implement LSTM model.
Also in the function I have increased the number of nodes
in the hidden layer to 128 from 100 and have added a drop
out of 0.2 to all the layers.
Step 3: We now need to train our model.
I have used a built-in library function to train the model.
Step 4: Now it’s time to predict the prices for given test
datasets.
I have used a built-in function to predict the outcomes of
the model.
Step 5: Finally calculate the test score and plot the results
of improved LSTM model.
3.3 Refinement
For this project I have worked on fine tuning
parameters of LSTM to get better predictions. I
did the improvement by testing and analyzing
each parameter and then selecting the final
value for each of them.
To improve LSTM i have done following:
● Increased the number of hidden nodes from 100 to 128.
● Added Dropout of 0.2 at each layer of LSTM
● Increased batch size from 1 to 512
● Increased epochs from 1 to 20
● Added verbose = 2
● Made prediction with the batch size
Thus improved my mean squared error, for testing sets,
from 0.01153170 MSE to
0.00093063 MSE.
The predicted plot difference can be seen as follows:
Fig : Plot For Adjusted Close and Predicted Close Prices for
basic
Fig : Plot For Adjusted Close and Predicted Close
Prices for improved LSTM model
Chapter 4.
RESULT
4.1 Model Evaluation and Validation
With each model I have refined and fine tuned my
predictions and have reduced mean squared error
significantly.
● For my first model using linear regression model:
● Train Score: 0.1852 MSE (0.4303 RMSE)
● Test Score: 0.08133781 MSE
(0.28519784 RMSE)
Fig: Plot of Linear Regression Model
● For my second model using basic Long-Short Term
memory model:
● Train Score: 0.00089497 MSE (0.02991610
RMSE)
● Test Score: 0.01153170 MSE (0.10738577
RMSE)
Fig: Plot of basic Long-Short Term Memory
model
● For my third and final model, using improved Long-
Short Term memory model:
● Train Score: 0.00032478 MSE (0.01802172
RMSE)
● Test Score: 0.00093063 MSE (0.03050625
RMSE)
Fig: Plot of Improved Long-Short Term
Memory Model
Robustness Check :
For checking the robustness of my final model I used
unseen data, i.e, data of Alphabet Inc. from July 1,
2017 to July 20, 2017. On predicting the values of
unseen data I got a decent result for the data. The
results are as follows:
Test Score: 0.3897 MSE (0.6242 RMSE)
4.2 Justification
Comparing the benchmark model - Linear Regression
to the final improved LSTM model, the Mean Squared
Error improvement ranges from 0.08133781 MSE
(0.28519784 RMSE) [Linear Regression Model] to
0.00093063 MSE (0.03050625 RMSE) [Improved
LSTM]. This significant decrease in error rate clearly
shows that my final model has surpassed the basic
and benchmark model.
Also the Average Delta Price between actual and
predicted Adjusted Closing Price values was:
Delta Price: 0.000931 - RMSE * Adjusted Close
Range
Which is less than one cent :)
5. CONCLUSION
5.1 Free-Form Visualization
I have already discussed all the important features of
the datasets and their visualization in one of the
above sections. But to conclude my report I would
choose my final model visualization, which is an
improved version of LSTM by fine tuning parameters.
I was very impressed on seeing how close i have
gotten to the actual data, with a mean square error
of just 0.0009. It was an ‘Aha!’ moment for me as i
had to poke around a lot (really ALOT !! :P ). But it
was fun working on this project.
Fig: Plot of Improved Long-Short Term
Memory Model
5.2 Reflection
To recap, the process undertaken in this project:
● Set Up Infrastructure
○ iPython Notebook
○ Incorporate required Libraries (Keras, Tensor flow,
Pandas, Matplotlib, Sklearn, Numpy)
○ Git project organization
● Prepare Dataset
○ Incorporate data of Alphabet Inc company
○ Process the requested data into Pandas
Dataframe
○ Develop function for normalizing data
○ Dataset used with a 80/20 split on training and
test data across all models
● Develop Benchmark Model
○ Set up basic Linear Regression model with Scikit-
Learn
○ Calibrate parameters
● Develop Basic LSTM Model
○ Set up basic LSTM model with Keras utilizing
parameters from Benchmark Model
● Improve LSTM Model
○ Develop, document, and compare results
using additional labels for the LSMT model 5.
Document and Visualize Results
● Plot Actual, Benchmark Predicted Values, and
LSTM Predicted Values per time series
● Analyze and describe results for the report.
I started this project with the hope to learn a
completely new algorithm, i.e, Long-Short Term
Memory and also to explore real time series data
sets. The final model really exceeded my
expectations and has worked remarkably well. I am
greatly satisfied with these results.
The major problem I faced during the
implementation of the project was exploring the
data. It was the toughest task. To convert data from
raw format to preprocess data and then to split them
into training and test data. All of these steps require
a great deal of patience and a very precise approach.
Also i had to work around a lot to successfully use
the data for 2 models, i.e, Linear Regression and
Long-Short Term Memory, as both of them have
different inputs sizes. I read many research papers to
get this final model right and I think it was all worth it
:)
Improvement
Before starting my journey as a Machine Learning
Nanodegree Graduate i had no prior experience in
python. In the beginning of this course to do
everything with python, I had to google it. But now I
have not only made 7 projects in python, I have
explored many libraries along the way and can use
them very comfortably. This is all because of highly
interactive videos and forums provided by Udacity. I
am really happy and satisfied with taking up this
course.
And as there is scope of improvement in each
individual, so is the case with this project. This
project predicts closing prices with very minimum
Mean Squared Error, still there are many things that
are lagging in this project. Two of most important
things are :
● There is no user interaction or interface provided
in this project. A UI can be provided where users can
check the value for future dates.
● The stocks used for this project are only of
Alphabet Inc, we can surely add more S&P 500 in the
list so as to make this project more comprehensive.
I would definitely like to add these improvements to
this project in future.
References:
1 Long Short Term Memory networks
2 S&P 500 companies
3 Alphabet Inc
4 Google Finance python api
5 adjusts the closing prices for us
6 Google.csv
7 Matplotlib
8 Recurrent Neural Network
9 Long-Short Term Memory
10 Linear Model
11 Long Short Term Memory