Hands-on Practice on Financial AI Session
Session 3.
Financial Market Prediction Using Machine Learning
AI for Finance (IE471 )
KAIST Financial Engineering Lab.
[3-1] Introduction
1 Binary Classification Problem in Financial Markets
or
If the financial market index is interpreted from a very simple
aspect, it can be interpreted as Up (≒Bullish) and Down (≒
Bearish) of the financial market index.
2
[3-1] Idea
1 Predicting the Fluctuation of KOSPI of the Next Day Using Other Global Market Indices
Caution
▪ This row is the fluctuation of the next day (2016-
01-06)
3
[3-1] Machine Learning Models
1 XGBoost
▪ Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable
tree boosting system. In Proceedings of the 22nd acm sigkdd
international conference on knowledge discovery and data
mining (pp. 785-794).
▪ In machine learning, boosting algorithm is an ensemble
meta-algorithm for primarily reducing bias, and also variance
in supervised learning, and a family of machine learning
algorithms that convert weak learners to strong ones.
▪ XGBoost is an algorithm that has recently been dominating
applied machine learning and Kaggle competitions for
structured or tabular data.
▪ XGBoost is an implementation of gradient boosted decision
trees designed for speed and performance.
4
[3-1] Machine Learning Models
1 XGBoost
Source: StatQuest with Josh Starmer. XGBoost Part 2 (of 4): Classification (Retrieved: 2022.03.10.)
▪ https://www.youtube.com/watch?v=8b1JEDvenQU 5
[3-1] Machine Learning Models
1 XGBoost
6
[3-1] Machine Learning Models
1 XGBoost
7
[3-1] Machine Learning Models
1 XGBoost
8
[3-1] Code Description
1 Preparation
▪ For data preprocessing and visualization
▪ For constructing a machine learning model
▪ Setting seeds for scoring
9
[3-1] Code Description
1 Preprocessing Data
10
[3-1] Code Description
1 Training a Classification Model and Evaluating Their Performance
▪ Training data
▪ Saving classification results
▪ Evaluating performance and
Illustrating confusion matrix
11
[3-2] Introduction
1 Sentiment Analysis?
Sentiment analysis is a natural language processing technique
used to determine whether data is positive, negative or neutral.
12
[3-2] Idea
1 Sentiment Analysis: Example
Ex. NAVER Sentiment Movie Corpus:
fin1234
- This movie is very funny! I recommend this!
eng5678
- The dubbed voice is so annoying.
Conducting sentiment analysis, we can classify text data into three
emotions: positive, negative, and neutral.
13
[3-2] Idea
2 Sentiment Analysis: Idea
Question
Can we include the positive / negative
perspective of the particular stock into the prediction?
Goal of this session
We will include the positive score / negative score
into the stock price prediction model of the previous session.
14
[3-2] Machine Learning Models
1 RNN (Recurrent Neural Network)
▪ A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes
form a directed graph along a temporal sequence.
▪ It is known as a suitable model for processing data that appears sequentially or time-series data.
An unrolled recurrent neural network
Source: Stanford University. CS231n:CS231n: Convolutional Neural Networks for Visual Recognition. (Retrieved: 2022.03.10.)
▪ http://cs231n.stanford.edu/ 15
[3-2] Machine Learning Models
2 LSTM (Long Short-Term Memory)
▪ Because of the disadvantage of RNN models (vanishing
gradient problem), the vanilla RNN models are not used
very often.
▪ Long Short-Term Memory (LSTM) (Hochreiter and
Schmidhuber, 1997) is one of the improved model of RNN.
This model made up for the vanishing gradient problem.
▪ LSTM models are very powerful in sequence prediction
problems because they’re able to store past information.
This is important in our case because the previous price of
a stock is crucial in predicting its future price.
16
[3-2] Machine Learning Models
2 LSTM (Long Short-Term Memory)
RNN
LSTM
17
[3-2] Machine Learning Models
3 Package for Machine Learning: PyTorch
PyTorch is an open source machine learning library based on the Torch library, used for applications such as
computer vision and natural language processing, primarily developed by Meta's AI Research lab (FAIR).
Advantages
▪ It is easy to install.
▪ It consists of intuitive and concise code that is easy to understand and debug.
▪ It is highly compatible with Python libraries (Numpy, Scipy, Cython and so on).
18
[3-2] Prior Research
1 Sentiment Analysis for Predicting Stock Prices: Mittal, A., & Goel, A. (2012).
What they did…
▪ They use twitter data to predict public mood and use the predicted mood and previous days’ DJIA (Dow Jones
Industrial Average) values to predict the stock market movements.
▪ They got 75.56% accuracy on the Twitter feeds and DJIA values from the period June 2009 to December 2009.
19
[3-2] Prior Research
2 Sentiment Analysis for Predicting Stock Prices: Nguyen, T. H., Shirai, K., & Velcin, J. (2015)
What they did…
▪ This paper shows an evaluation of the effectiveness of the sentiment analysis in the stock prediction task via a
large scale experiment.
▪ Their method achieved 9.83% better accuracy than historical price method, and 3.03% better than human
sentiment method.
20
[3-2] Data
1 Data
We will use…
▪ Tesla (TSLA) stock price data from 2nd January 2020 to 31st January 2020.
▪ 200 tweets per day including ‘TSLA (ticker symbol of Tesla)’, and ‘Tesla’ keywords from Twitter.
21
[3-2] Code Description
1 Original Stock Prediction Model (Python File: 3-2-1)
▪ For data preprocessing and visualization
▪ For constructing a machine learning model
22
[3-2] Code Description
3 Loading Dataset
▪ There are six columns are in the loaded
dataset.
▪ We will use five features: High price, low
price, open price, close price, and volume.
23
[3-2] Code Description
4 Scaling and Converting Data
▪ We collected Tesla’s stock price
data in 2020.
▪ We split these data into the
training set and test set. We set the
data from January 2020 to
September 2020 as training data
and the data from October 2020 to
December 2020 as test data.
24
[3-2] Code Description
5 Constructing LSTM Model
Reference: https://9bow.github.io/PyTorch-tutorials-kr-0.3.1/beginner/blitz/autograd_tutorial.html
25
[3-2] Stock Price Prediction Using LSTM
6 Setting Hyperparameters and Training Data
Input_size (Input Size)
▪ 5 features (high price, low price, open price, close price, volume)
26
[3-2] Code Description
7 Test
Reverse
Transformation
27
[3-2] Code Description
1 NLTK (Python File: 3-2-2)
▪ NLTK (Natural Language ToolKit) is a platform for constructing Python programs to work with human language
data.
▪ It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of
text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries.
28
[3-2] Code Description
1 Cleaning the Raw Tweets (Python File: 3-2-2)
▪ For preprocessing text data
▪ For visualizing of text data
29
[3-2] Code Description
1 Cleaning the Raw Tweets (Python File: 3-2-2)
Remove Special Characters
▪ Delete special characters like #, !, ., … .
Lowercase all tweets
▪ For detecting stopwords
▪ In computing, stopwords are words which are
filtered out before or after processing of
natural language data (texts).
▪ For example, words such as I, my, me, over,
postposition, and suffixes often appear in
sentences, but rarely contribute to actual
semantic analysis.
Tokenize tweets
▪ Tokenization is the splitting of a string into
several pieces (tokens).
▪ Ex. "Hello, World.” → 'Hello’, ‘,’, ‘World’, ‘.’
30
[3-2] Code Description
2 Lemmatizing Tweets (Python File: 3-2-2)
Lemmatize tweets
▪ Lemmatization in linguistics is the process of
grouping together the inflected forms of a
word so they can be analyzed as a single item,
identified by the word's lemma, or dictionary
form.
▪ Ex. watched → watch
31
[3-2] Code Description
3 Frequency Analysis (Python File: 3-2-2)
32
[3-2] Code Description
1 TextBlob (Python File: 3-2-3)
▪ Textblob is based on NLTK and has many features to facilitate text processing.
▪ TextBlob’s output for a polarity task is a float within the range [-1.0, 1.0] where -1.0 is a negative polarity and 1.0 is positive. This score can
also be equal to 0, which stands for a neutral evaluation of a statement as it doesn’t contain any words from the training set.
33
[3-2] Code Description
2 Calculate Sentiment Score with TextBlob (Python File: 3-2-3)
Dictionary-based
Scores
Ex.
▪ Recommend = Positive Word (+1)
▪ Disappointed = Negative Word (-1)
34
[3-2] Code Description
2 Calculate Sentiment Score with TextBlob (Python File: 3-2-3)
σ ni=1 Sentiment Score of Tweets #iin the particular day
n
Average Sentiment Score Values
35
[3-2] Code Description
2 Calculate Sentiment Score with TextBlob (Python File: 3-2-3)
Remove Neutral Tweets
▪ The neutral tweets may lead to underestimate
the average daily sentiment scores.
▪ Sentiment scores including too many neutral
tweets may not affect to improve stock price
prediction results.
▪ Therefore, we delete neutral tweets whose
sentiment scores are exactly 0 in this session.
36
[3-2] Code Description
2 Calculate Sentiment Score with TextBlob (Python File: 3-2-3)
Average Sentiment Scores Average Sentiment Scores
before Removing Neutral after Removing Neutral
Tweets Tweets
▪ We will use the average sentiment scores after removing neutral tweets to predict Tesla’s stock
prices.
37
[3-2] Code Description
1 Difference from the Original Stock Price Prediction Model (Python File: 3-2-4)
38
[3-2] Code Description
1 Difference from the Original Stock Price Prediction Model (Python File: 3-2-4)
Changed dataset shape
Changed torch tensor size
39
[3-2] Code Description
1 Difference from the Original Stock Price Prediction Model (Python File: 3-2-4)
Changed input size
40
[3-2] Code Description
2 Comparison of two results
Without sentiment analysis results With sentiment analysis results
▪ Using the sentiment analysis results, we may lead to improve prediction results.
▪ However, many features can also lead to worse results than original model, because of the reason that added
sentiment index is not good for training, overfitting problem, and so on.
41
[3-2] Results & Conclusion
2 Conclusion
▪ The recurrent neural network (RNN) model can be used as a time-series forecasting model.
▪ LSTM (Long Short-Term Memory) models are improved versions of RNN from the aspect of preventing vanishing
gradients problem.
▪ There can be various features to consider when we predict stock prices.
▪ Sentiment analysis is a natural language processing technique used to determine whether data is positive, negative or
neutral (narrow definition).
▪ A more efficient sentiment analysis is possible through preprocessing processes on collected raw text data such as
tweets.
▪ We can conduct sentiment analysis in English using NLTK and TextBlob Python libraries.
42
Reference
▪ Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp.
785-794)
▪ Kim, Y. B., Lee, S. H., Kang, S. J., Choi, M. J., Lee, J., & Kim, C. H. (2015). Virtual world currency value fluctuation prediction system based on user sentiment analysis. PloS one, 10(8), e0132944.
▪ Kuzminykh, N. (2020). Sentiment Analysis in Python With TextBlob. https://stackabuse.com/sentiment-analysis-in-python-with-textblob/
▪ Lee, K. (2017). Understanding RNN and LSTM. https://ratsgo.github.io/natural%20language%20processing/2017/03/09/rnnlstm/.
▪ Li, F., Krishna, R., and Xu, D. (2021). CS231n: Convolutional Neural Networks for Visual Recognition Stanford - Spring 2021 http://cs231n.stanford.edu/
▪ Mittal, A., & Goel, A. (2012). Stock prediction using twitter sentiment analysis. Standford University, CS229 (2011 http://cs229. stanford. edu/ proj2011/GoelMittal-StockMarketPredictionUsing
TwitterSentimentAnalysis. pdf), 15.
▪ Natural Language Toolkit (https://www.nltk.org/) (Retrieved on: 2022.03.10.)
▪ Nguyen, T. H., Shirai, K., & Velcin, J. (2015). Sentiment analysis on social media for stock movement prediction. Expert Systems with Applications, 42(24), 9603-9611.
▪ Olah, C. (2015). Understanding LSTM networks.
▪ Park, L. (2015). NAVER Sentiment Movie Corpus. https://github.com/e9t/nsmc.
▪ Singh, G. (2019). Updated Text Preprocessing techniques for Sentiment Analysis. https://towardsdatascience.com/updated-text-preprocessing-techniques-for-sentiment-analysis-549af7fe412a
▪ TextBlob (https://textblob.readthedocs.io/en/dev/#) (Retrieved on: 2022.03.10.)
43
Questions
&
Answers
▪ E-mail: [email protected]
44