Forecasting Cryptocurrency Returns From Sentiment Signals: An Analysis of BERT Classifiers and Weak Supervision
Forecasting Cryptocurrency Returns From Sentiment Signals: An Analysis of BERT Classifiers and Weak Supervision
Corresponding Author: Duygu Ider, Unter-den-Linden 6, 10099 Berlin, Germany. Tel. +49
15257162290; E-mail [email protected]
(Permanent Address: Sulzbacher Str. 12, 80803 Munich, Germany)
2. Literature Review
A large body of literature examines the predictability of financial markets (Rapach & Zhou,
2013, pp. 328–383; Granger, 1992). Motivations for corresponding research include testing the
informational efficiency of specific markets (Nordhaus, 1987), benchmarking novel forecasting
methods (Yu, Wang, & Lai, 2008), or devising algorithmic trading strategies (Brownlees,
Cipollini, & Gallo, 2011). Recent studies increasingly rely on ML or deep learning for financial
forecasting (Sezer, Gudelek, & Ozbayoglu, 2020). In the area of cryptocurrency forecasting,
those techniques are especially popular and have shown better results than alternative approaches
(Lahmiri & Bekiros, 2019; McNally, Roche, & Caton, 2018; Chen, Xu, Jia, & Gao, 2021).
Traditional financial markets have been established over the last century with an abundance
of structured data available for research or commercial use (Wang, Lu, & Zhao, 2019). For
cryptocurrency markets, structured data is not as readily available and the need for alternative
data sources plays a key role in predicting returns. NLP approaches to text data for
cryptocurrency return forecasting include sentiment analysis (Nasekin & Chen, 2020), semantic
analysis (Kraaijeveld & De Smedt, 2020; Ortu, Uras, Conversano, Bartolucci, & Destefanis,
2022), and topic modeling (Loginova, Tsang, van Heijningen, Kerkhove, & Benoit, 2021).
We focus on sentiment analysis, a sub-field of NLP, which aims to identify the polarity of a
piece of text (Pang, Lee, & others, 2008). Sentiment analysis methods have developed from
lexicon-based approaches to state-of-the-art transformer models (Mishev, Gjorgjevikj,
Vodenska, Chitkushev, & Trajanov, 2020). Given vast empirical evidence that sentiment scores
extracted from investment-related social media and other online platforms reflect the subjective
perception of investors, many studies investigate the potential value of adding corresponding
features into (cryptocurrency) return forecasting models. Examining many cryptocurrencies,
Nasekin and Chen (2020) show that sentiment from StockTwits extracted by BERT with added
domain-specific tokens contributes to return predictability. Chen et al. (2019) use public
sentiment from Reddit and StockTwits and confirm the value of sentiment features in a return
prediction model, with the condition that the text classifier uses a domain-specific lexicon.
Polasik et al. (2015) find that Bitcoin prices are highly driven by news volume and news
sentiment. Vo et al. (2019) show that sentiment scores extracted from news articles of the past
seven days increase the predictive performance of an LSTM model in predicting cryptocurrency
price directions. Ortu et al. (2022) conclude that features based on BERT-based emotion
classification of GitHub and Reddit comments significantly improve the hourly and daily return
direction predictability of Bitcoin and Ethereum.
A general challenge in forecasting using text data concerns the absence of labeled data. The
training of a predictive model requires known outcomes of the target variable. Accordingly,
developing a sentiment classifier requires text data with actual sentiment labels. Yet, news and
social media data are naturally unlabeled. Prior work has considered various approaches to
remedy the lack of labels or circumvent the challenge. These include manual labeling text by
domain experts (Li, Bu, Li, & Wu, 2020; Malo, Sinha, Korhonen, Wallenius, & Takala, 2014;
Van de Kauter, Breesch, & Hoste, 2015; Cortis, et al., 2017). Involving human labor, this
approach lacks scalability. A more feasible, established alternative is to rely on sentiment
dictionaries (Taboada, Brooke, Tofiloski, Voll, & Stede, 2011). However, the state-of-the-art in
sentiment classification has moved far beyond lexicon-based approaches (Zhang, Wang, & Liu,
2018), which are criticized for being highly domain-dependent and lacking contextual
understanding (Basiri & Kabiri, 2020). Therefore, the prevailing approach for extracting features
from text data involves using pretrained embeddings from, e.g., Word2Vec (Rybinski, 2021) or
BERT (Jiang, Lyu, Yuan, Wang, & Ding, 2022). Such embeddings are powerful but their
understanding of language is based on the corpus on which they were (pre)trained. Without
further finetuning, embeddings cannot accommodate peculiarities in a target corpus such as
financial news or social media postings. To address the unlabeled data challenge, we introduce a
recently proposed NLP approach called weak supervision to the forecasting literature. Weak
supervision facilitates using unlabeled text data that represents the specific language in a target
domain such as cryptocurrencies for finetuning pretrained embeddings and text classifiers like
BERT. To our best knowledge, weak supervision has not been used in the field of financial -
especially for cryptocurrency-related - forecasting.
3. Data
3.1. Text Data
This study focuses on sentiment analysis and financial prediction for Bitcoin and Ethereum,
which are currently the two largest cryptocurrencies by market capitalization. Their popularity
and relatively long period of existence motivate their choice for this study.
Bitcoin and Ethereum have their text datasets and are handled separately. Both datasets
consist of news articles, Tweets, and Reddit posts in the period from 01/08/2019 until
15/02/2022 and scraped using different Python libraries, as shown in Table A1 in the online
appendix.
The raw text datasets are filtered and prepared for analysis. The preparation differs across
data sources, due to their unique format and varying ways of accessing them. News articles in
the GoogleNews RSS feed are filtered to include only those published by CoinDesk and
CoinTelegraph, and contain the name or code of the respective currency, i.e., “Bitcoin” or
“BTC” and “Ethereum” or “ETH”, adding up to, on average, 20 news articles per day and
currency. There is no further filtration for the news data since there are no spam samples.
Twitter data is filtered by selecting and searching for the name and code of each currency
from all Tweets that day. We then filter results based on a minimum threshold number of
retweets, number of likes, number of followers of the corresponding account, and whether the
account ID is verified. This helps to eliminate scams, advertisements, or irrelevant samples and
emphasizes more publicly viewed posts published by relatively important accounts. The
underlying assumption is that a post with a larger reach better reflects the community sentiment
about cryptocurrency price changes.
For the Reddit data, the r/Bitcoin and r/ethereum subreddits are scraped and further filtered
based on the number of likes and comments. Here, the assumption is that the priority lies in the
commonly commented and discussed posts that likely have more critical content. Also, posts that
are very short or just contain a URL are eliminated, as they would be meaningless to train or
predict with the sentiment classification models.
There are common text samples between the Bitcoin and Ethereum datasets, especially in
news and Tweets. The samples that contain both Bitcoin’s and Ethereum’s name or code are
included in both datasets.
The data is split into training and test datasets, such that the training data is from 01/08/2019
until 31/07/2021, and the test data is from 01/08/2021 until 15/02/2022. The test period covers
both a bullish and bearish pattern, to prevent the prediction and investment gain outcomes from
being biased due to a single-direction price movement during the test period. Figure 1 shows the
daily closing prices of Bitcoin and Ethereum during the training and test periods. The training
and test datasets are standardized using the mean and standard deviation characteristics of the
training data. Standardization is applied to all features with continuous values.
Figure 1: Training and test data split for Bitcoin and Ethereum price data
Table 2 shows the performance of all models based on the 970 test samples in terms of
accuracy, unweighted macro-averaged precision, recall, and the F1 score. We chose the
unweighted macro-averaging approach to avoid prioritizing neutral cases, which represent 59.3%
of the test set, whereas the negative and positive class cover 13.2% and 27.5%, respectively.
BERT-Unfrozen, trained on the ZSC predictions as pseudo-labels, has a classification
accuracy of 78.9%, which is not much lower than the 82.2% accuracy of FinBERT, which is
trained on the actual labels. In terms of accuracy, Ensemble ZU-nb is the best model with 79.5%
accuracy and performs the closest to FinBERT.
Table 3 shows the confusion matrices for the predictions of FinBERT and BERT-Unfrozen,
compared to the actual labels. We note that BERT-Unfrozen successfully distinguishes between
positive and negative samples, as does FinBERT. Most of the inaccuracy comes from incorrect
labeling of negative or positive samples as neutral and vice versa.
FinBERT Predictions BERT-Unfrozen Pred.
Pos. Neg. Neu. Pos. Neg. Neu.
Real Labels
Real Labels
Pos. 22 0.72 4.5 Pos. 20 0.93 6.5
Neg. 0.1 11 2.1 Neg. 0.41 9.2 3.6
Neu. 6.8 3.6 49 Neu. 7.8 1.9 50
(a) (b)
Table 3: Confusion matrices (a) for FinBERT and (b) for BERT-Unfrozen predictions, in percentages
So far, we have measured accuracy based on the test samples for which at least eight of the
16 expert labelers agreed on a single label. Table 4 presents the classifier on the 433 test samples
that all of the16 expert labelers agreed on. There, the accuracy of FinBERT and BERT-Unfrozen
increases substantially to 92.6% and 91.7%, respectively. Ensemble ZFU even outperforms
FinBERT based on the F1 score, indicating a better combined precision-recall performance, and
achieves a very similar accuracy score of 92.4%.
Table 5 portrays confusion matrices for the predictions of FinBERT and Ensemble ZFU on
the selected samples. Regarding the significant performance increase of the models after
sampling texts that all labelers agree on, the cases that the models misclassify are those on which
even the human labelers disagree. This happens mostly between negative-neutral and positive-
neutral labels. It is plausible that defining a clear boundary between these label pairs is a difficult
task; even for a human. In sum, the experiment suggests that BERT-based sentiment
classification models finetuned on weak labels of ZSC predictions can perform competitively to
a model trained on the same data with the actual labels. We consider this an encouraging result
and further evidence for the potential value of the weak-supervision approach.
FinBERT Predictions Ensemble ZFU Pred.
Pos. Neg. Neu. Pos. Neg. Neu.
Real Labels
Real Labels
Pos. 21 1.6 2.1 Pos. 20 1.4 3.5
Neg. 0 12 1.4 Neg. 0 12 1.2
Neu. 1.2 1.2 60 Neu. 1.6 0 61
(a) (b)
Table 5: Confusion matrices (a) for FinBERT and (b) for Ensemble ZFU predictions, in percentages
Table 6 provides the sentiment classifier comparison. First, we note that BERT-Context
performs best among the three finetuned BERT models in accuracy and F1 score. Given that
BERT-Context can process some cryptocurrency-specific vocabulary, whereas BERT-Frozen
and BERT-Unfrozen can only use subword embeddings for cryptocurrency jargon, this result
agrees with our expectations. More importantly, Table 6 reemphasizes the merit of weak
supervision for finetuning BERT. BART ZSC and FinBERT perform sentiment classification
with 58.7% and 58.0% accuracy. BERT-Context, which is trained on a dataset that incorporates
the predictions of BART ZSC as pseudo-labels, achieves a higher accuracy of 61.1%. This
evidences that compared to using an unsupervised model or a model trained on a different kind
of data, it is beneficial to finetune text classifiers using weak labels.
Table 6 also includes the best-performing ensemble is included in the comparison, Ensemble
ZUCF-pb, which is given by a polarity-biased majority vote from BART ZSC, BERT-Unfrozen,
BERT-Context, and FinBERT. Ensemble ZUCF-pb outperforms all single models as well as
other considered ensembles not included in Table 6 in terms of both accuracy and F1 score. The
confusion matrix in Table 7 shows that this ensemble can distinguish positive and negative
classes rather successfully, whereby most errors come from the neutral class.
Ensemble ZUCF-pb
Pos. Neg. Neu.
Manual Labels
Pos. 35 3.5 11
Neg. 1.3 15 3
Neu. 7.2 4.3 20
Table 7: Confusion matrix for Ensemble ZUCF-pb based on the manually classified labels, in percentages
We acknowledge that the above results are based on the specific, small, manually labeled
subset and our specific sampling protocol. Therefore, the following analysis clarifies the merit of
weak supervision in a financial forecasting context while using the entire test set. Given that
Ensemble ZUCF-pb is the best sentiment classifier, we use its predictions to compute sentiment
features for the financial prediction models, in the upcoming Section 5.
Each model is built two times, such that one is trained on the entire feature set and the other
with all features except for the sentiment features. Fitting the 22 models of Table 10 for each of
the two feature sets, which are then repeated for both cryptocurrencies, adds up to 88 models in
total. Using these models, we compare models with and without sentiment features, as well as
the respective performances of regression and classification models.
Ensemble models synthesize different machine learning models and are commonly built to
raise forecast accuracy. We implement voting and stacking ensembles, which are included in
Table 10. Although models such as XGBoost or random forest are also ensemble models, we
refer only to the voting and stacking ensembles as ensembles in the rest of the paper. The
ensemble regressors combine all regression models. Likewise, the ensemble classifier integrates
all classifiers. The voting classifier predicts a class based on majority voting, whereas a voting
regressor takes an unweighted average of the single model predictions. A stacking ensemble
regressor combines single models by fitting a linear regression to their outputs.
We perform hyperparameter tuning separately for each Bitcoin and Ethereum model using
stratified five-fold cross-validation with three repeats on the training data and select
hyperparameters based on the balanced accuracy score. For regressors, this metric is computed
by converting the predicted return values to binary classes. The selected hyperparameters with
maximal cross-validated performance are used to fit final models on the entire training dataset.
These models are tested on multiple test periods using an investment simulation.
Figure 2: Ideal scenario in which a trader buys (sells) Bitcoin at local minima (maxima).
Most cryptocurrency trading platforms apply a percentage cost per transaction. Such
transaction costs are important to consider in our trading simulation to make it closer to a real-
life scenario. Transaction costs typically vary between 0.1% to 0.5% in the literature (Kim, 2017;
Alessandretti, ElBahrawy, Aiello, & Baronchelli, 2018; Żbikowski, 2016, pp. 161-168). We
consider a transaction cost of 0.2%.
The input amount is 1000 USD for each of the 14 test frames of 60-days. The output amount
refers to the value of a trader’s wallet in USD on the last day of each test period. Table 11 reports
the result of the benchmarks. The values represent averages over the 14 test frames. Values in
parentheses report the corresponding standard deviation.
reg ridge all 0.661 0.583 1204.90 (133.80) 0.254 (0.255) 0.049 0.572 (0.291) 0.000011 29.47 (13.43) 13 (5)
reg mlp no sent. 0.639 0.583 1181.38 (119.45) 0.235 (0.27) 0.078 0.544 (0.296) 0.000022 32.36 (6.06) 14 (2)
reg mlp all 0.648 0.573 1165.31 (120.65) 0.226 (0.306) 0.11 0.54 (0.361) 0.000038 33.35 (8.13) 15 (3)
reg svm all 0.678 0.613 1174.86 (155.87) 0.213 (0.215) 0.1 0.518 (0.227) 0.000044 40.23 (7.29) 18 (2)
reg ridge no sent. 0.641 0.558 1183.63 (202.02) 0.211 (0.186) 0.1 0.518 (0.202) 0.000096 29.78 (11.07) 13 (4)
reg ens. vote no sent. 0.728 0.568 1124.43 (109.37) 0.192 (0.326) 0.23 0.497 (0.39) 0.00014 33.84 (14.63) 16 (6)
clf lr no sent. 0.703 0.553 1163.76 (293.58) 0.158 (0.07) 0.21 0.459 (0.127) 0.0017 25.56 (8.54) 10 (3)
reg ens. stack all 0.444 0.528 1085.10 (89.97) 0.142 (0.274) 0.44 0.424 (0.292) 0.00046 8.39 (8.25) 4 (4)
clf per no sent. 0.632 0.568 1053.40 (69.88) 0.121 (0.312) 0.67 0.396 (0.338) 0.0012 27.61 (9.30) 13 (4)
reg svm no sent. 0.636 0.553 1050.86 (93.05) 0.122 (0.327) 0.7 0.395 (0.353) 0.0015 25.32 (8.92) 12 (3)
* Mean and standard deviation, the latter inside parentheses, are used to represent columns that refer to a distribution.
Table 11: BITCOIN - Performance metrics and trading output of the ten best-performing models, in descending order
Model Model Feature Train Test Output Gain Scaled t-value Gain Scaled t-value Total Num
Type Set CV Acc. Amount* (USD) by Hold Hold by Random Random Trading Tran.*
Acc. Scenario* Scenario Scenario* Scenario Cost* (USD)
clf mlp all 0.689 0.588 1293.98 (263.42) 0.267 (0.202) 0.039 0.501 (0.244) 0.0009 27.52 (13.93) 11 (5)
clf svm no sent. 0.717 0.573 1285.68 (273.32) 0.266 (0.242) 0.049 0.494 (0.271) 0.0013 35.43 (18.94) 14 (7)
reg ens. stack all 0.245 0.573 1243.33 (253.93) 0.211 (0.147) 0.094 0.447 (0.259) 0.0026 39.76 (9.77) 16 (3)
reg mlp all 0.606 0.588 1181.46 (155.42) 0.191 (0.285) 0.19 0.416 (0.367) 0.004 24.91 (14.25) 11 (5)
reg ridge no sent. 0.627 0.568 1188.31 (196.79) 0.192 (0.284) 0.19 0.412 (0.341) 0.005 10.31 (10.00) 4 (4)
clf mlp no sent. 0.688 0.553 1177.64 (169.20) 0.173 (0.238) 0.21 0.396 (0.31) 0.005 25.53 (11.74) 11 (5)
reg ridge all 0.630 0.573 1165.61 (188.26) 0.169 (0.28) 0.27 0.386 (0.342) 0.0082 13.17 (6.94) 6 (3)
reg svm all 0.649 0.548 1155.45 (166.78) 0.163 (0.284) 0.3 0.38 (0.35) 0.0087 8.26 (10.49) 4 (4)
reg ens. stack no sent. 0.546 0.568 1137.06 (143.07) 0.15 (0.289) 0.38 0.367 (0.371) 0.012 19.63 (18.05) 8 (7)
reg tree all 0.736 0.518 1095.41 (73.55) 0.134 (0.377) 0.65 0.359 (0.489) 0.025 13.00 (6.69) 6 (3)
* Mean and standard deviation, the latter inside parentheses, are used to represent columns that refer to a distribution.
Table 12: ETHEREUM - Performance metrics and trading output of the ten best-performing models, in descending order
Out of the ten best-performing models for Bitcoin and Ethereum, four and six models,
respectively, incorporate sentiment features. Also, the highest test accuracy is achieved by
models that include sentiment features for both cryptocurrencies. These results indicate that the
sentiment features are contributing to the predictive performance of the models, and more
importantly to trading returns above the random baseline. Given that the features are generated
from aggregated sentiment extracted from information with a relative long-term lag to the actual
closing price used, these results are promising.
Curr. Model Feature Mean Mean Mean Outperf. Signif. Outperf. Signif.
Type Set Train Test Output Hold Hold Random Random
Acc. Acc. Amount Scenario Scenario Scenario Scenario
all 74.0% 50.2% 990.40 62.5% 0.0% 100.0% 81.2%
clf
no sent. 76.0% 51.4% 961.62 18.8% 6.2% 100.0% 37.5%
BTC
all 69.6% 52.9% 1057.91 84.6% 7.7% 100.0% 92.3%
reg
no sent. 69.6% 52.3% 1033.42 76.9% 0.0% 100.0% 92.3%
all 78.0% 51.3% 1057.01 62.5% 6.2% 100.0% 25.0%
clf
no sent. 77.1% 52.6% 1060.39 56.2% 6.2% 100.0% 12.5%
ETH
all 64.3% 53.0% 1073.28 53.8% 0.0% 100.0% 46.2%
reg
no sent. 66.2% 52.4% 1039.79 84.6% 0.0% 100.0% 23.1%
Boldface indicates the better performing model per category (with/without sentiment)
Table 13: Summary table comparing all models with and without sentiment features
Table 14 reveals that regressors tend to outperform classifiers for both cryptocurrencies.
They tend to have higher mean test accuracy scores and are more likely to outperform the B&H
baseline. Differences in the target variables may explain this result. Regressors train on the
magnitude and direction of next-day returns, which facilitates differentiating between cases with
small and large magnitudes of positive or negative returns. Classifiers, on the other hand, train
on a binary target, which could result in information loss and inferior performance. Another
striking difference between regression and classification results is that regressors show higher
accuracy when sentiment features are included. Classifiers do not show this pattern. Instead,
accuracy is slightly higher when sentiment features are excluded. When considering trading
performance, however, we observe a higher share of models that incorporate sentiment features
to outperform the random benchmark significantly. This means that, on average, sentiment
features can add predictive information to the feature set. This relation does not hold for each of
the models but it is valid when analyzing all models collectively.
Bitcoin models with sentiment features are more likely to outperform the holding scenario
compared to those without. The same holds for Ethereum classifiers but not for Ethereum
regressors, which outperform the B&H benchmark at a higher rate without sentiment. Only a
negligible fraction of all models outperform the hold scenario significantly at a 90% confidence
level. This indicates that profit differences between model-based trading and the B&H baseline
are typically not substantial enough to conclude that the respective profit distributions differ.
Using model predictions in decision-making supports the trader to get less affected by the
rapidly fluctuating prices of cryptocurrencies and still adds some additional value compared to
simply holding assets for a long term. For traders planning to invest in large volumes, every
slight reduction in risk by having more accurate price predictions poses a profit opportunity.
By testing the value of sentiment on various models, we show that sentiment features
contribute to higher investment gains, on average. It is important to conclude that this does not
equate to a definite outcome that including news and social media sentiment-related features
consistently leads to higher profits. Tables A3-A6 in the online appendix show that this does not
hold for all models and cryptocurrencies. However, Table 14 summarizes that the models with
sentiment features improve decision-making in cryptocurrency trading in a generalized manner.
This suggests that public sentiment extracted from various online sources improves the
predictive power of machine learning models to estimate future price direction changes of
Bitcoin and Ethereum.
6. Conclusion
A key contribution of the paper is related to weak supervision. Textual data has become a
common ingredient in (financial) forecasting models. While early approaches used dictionaries
to extract features from text data, the use of pretrained embeddings has meanwhile become a de
facto standard. In NLP, it is common practice to finetune a pretrained model on a corpus from
the target domain. Finetuning accounts for linguistic peculiarities in that domain and, more
generally, the fact that pretraining was carried out using text from an entirely different field that
may have nothing in common with the target domain. An adaptation of a pretrained model to its
target task (or dataset) is intuitively useful, but proved challenging in many forecasting settings,
because most text data comes without labels. Weak supervision addresses this challenge.
Forecasters can gather text data they judge important, use a cutting-edge pretrained NLP model,
and finetune it before extracting features for their forecasting model. Using a unique multi-
source dataset related to cryptocurrencies, we demonstrate the effectiveness of this approach in a
sentiment classification context.
Beyond weak supervision and its empirical assessment, the paper also contributes to the
empirical literature on cryptocurrency forecasting. Aggregating the sentiment classification from
finetuned BERT using weak supervision, we investigate the added value of using investor
sentiment for cryptocurrency return forecasting. The overall outcome is that sentiment features
improve forecast accuracy across many tested forecasting models. Notably, we find strong
evidence that several models, typically those that incorporate sentiment features, forecast
cryptocurrency returns better than random and yield a positive trading profit after transaction
costs. This suggests that Bitcoin and Ethereum markets cannot be considered efficient; at least
for the period under study.
References
Alessandretti, L., ElBahrawy, A., Aiello, L. M., & Baronchelli, A. (2018). Anticipating cryptocurrency
prices using machine learning. Complexity, 2018.
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of internet stock
message boards. The Journal of Finance, 59, 1259–1294.
Araci, D. (2019). Finbert: Financial sentiment analysis with pre-trained language models. Cornell
University arXiv Computation and Language Repository, arXiv:1908.10063v1.
Basiri, M. E., & Kabiri, A. (2020). HOMPer: A new hybrid system for opinion mining in the Persian
language. Journal of Information Science, 46, 101–117.
Benediktsson, J., & Cappello, B. (2022). TA-Lib (Technical Analysis Library). Retrieved May 15, 2022,
from GitHub: https://github.com/mrjbq7/ta-lib
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of
Computational Science, 2, 1–8.
Brownlees, C. T., Cipollini, F., & Gallo, G. M. (2011). Intra-daily volume modeling and prediction for
algorithmic trading. Journal of Financial Econometrics, 9, 489–518.
Chang, M.-W., Ratinov, L.-A., Roth, D., & Srikumar, V. (2008). Importance of Semantic Representation:
Dataless Classification. AAAI, 2, pp. 830–835.
Chatterjee, S., & Hadi, A. S. (2006). Regression Analysis by Example (4th ed.). Hoboken, New Jersey:
John Wiley & Sons.
Chen, C. Y.-H., Després, R., Guo, L., & Renault, T. (2019). What makes cryptocurrencies special?
Investor sentiment and return predictability during the bubble. Humboldt University of Berlin,
International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
Chen, W., Xu, H., Jia, L., & Gao, Y. (2021). Machine learning model for Bitcoin exchange rate prediction
using economic and technology determinants. International Journal of Forecasting, 37, 28–43.
Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal
of Forecasting, 5, 559–583.
Cortis, K., Freitas, A., Daudert, T., Huerlimann, M., Zarrouk, M., Handschuh, S., & Davis, B. (2017).
Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. 11th
International Workshop on Semantic Evaluations (SemEval-2017): Proceedings of the Workshop
(pp. 519-535). Association for Computational Linguistics (ACL).
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. Cornell University arXiv Computation and Language
Repository, arXiv:1810.04805v2.
Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of
Finance, 25, 383–417.
Fernandez-Rodriguez, F., Gonzalez-Martel, C., & Sosvilla-Rivero, S. (2000). On the profitability of
technical trading rules based on artificial neural networks:: Evidence from the Madrid stock
market. Economics Letters, 69, 89–94.
Granger, C. W. (1992). Forecasting stock market prices: Lessons for forecasters. International Journal of
Forecasting, 8, 3–13.
Hiew, J. Z., Huang, X., Mou, H., Li, D., Wu, Q., & Xu, Y. (2019). BERT-based financial sentiment index
and LSTM-based stock return predictability. Cornell University arXiv Statistical Finance
Repository, arXiv:1906.09024v2.
Huang, J.-Z., Huang, W., & Ni, J. (2019). Predicting bitcoin returns using high-dimensional technical
indicators. The Journal of Finance and Data Science, 5, 140–155.
Jang, H., & Lee, J. (2017). An empirical study on modeling and prediction of bitcoin prices with bayesian
neural networks based on blockchain information. IEEE Access, 6, 5427–5437.
Jasic, T., & Wood, D. (2004). The profitability of daily stock market indices trades based on neural
network predictions: Case study for the S&P 500, the DAX, the TOPIX and the FTSE in the
period 1965–1999. Applied Financial Economics, 14, 285–297.
Jiang, C., Lyu, X., Yuan, Y., Wang, Z., & Ding, Y. (2022). Mining semantic features in current reports
for financial distress prediction: Empirical evidence from unlisted public firms in China.
International Journal of Forecasting, 1086-1099.
Jordan, J. S. (1983). On the efficient markets hypothesis. Econometrica: Journal of the Econometric
Society, 1325–1343.
Kearney, C., & Liu, S. (2014). Textual sentiment in finance: A survey of methods and models.
International Review of Financial Analysis, 33, 171–185.
Kennedy, P. (2008). A Guide to Econometrics (Vol. 6). Malden: Blackwell Publishing.
Kim, T. (2017). On the transaction cost of Bitcoin. Finance Research Letters, 23, 300–305.
Kraaijeveld, O., & De Smedt, J. (2020). The predictive power of public Twitter sentiment for forecasting
cryptocurrency prices. Journal of International Financial Markets, Institutions and Money, 65,
101188.
Lahmiri, S., & Bekiros, S. (2019). Cryptocurrency forecasting with deep learning chaotic neural
networks. Chaos, Solitons & Fractals, 118, 35–40.
Leow, E. K., Nguyen, B. P., & Chua, M. C. (2021). Robo-advisor using genetic algorithm and BERT
sentiments from tweets for hybrid portfolio optimisation. Expert Systems with Applications, 179,
115060.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., . . . Zettlemoyer, L. (2019).
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension. Cornell University arXiv Computation and Language Repository,
arXiv:1910.13461v1.
Li, Y., Bu, H., Li, J., & Wu, J. (2020). The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning. International Journal of Forecasting, 36,
1541–1562.
Lin, S.-Y., Kung, Y.-C., & Leu, F.-Y. (2022). Predictive intelligence in harmful news identification by
BERT-based ensemble learning model with text sentiment analysis. Information Processing &
Management, 59, 102872.
Lin, X., Yang, Z., & Song, Y. (2011). Intelligent stock trading system based on improved technical
analysis and Echo State Network. Expert Systems with Applications, 38, 11347–11354.
Lison, P., Hubin, A., Barnes, J., & Touileb, S. (2020). Named entity recognition without labelled data: A
weak supervision approach. Cornell University arXiv Computation and Language Repository,
arXiv:2004.14723v1.
Loginova, E., Tsang, W. K., van Heijningen, G., Kerkhove, L.-P., & Benoit, D. F. (2021). Forecasting
directional bitcoin price returns using aspect-based sentiment analysis on online text data.
Machine Learning, 1–24.
Mallqui, D. C., & Fernandes, R. A. (2019). Predicting the direction, maximum, minimum and closing
prices of daily Bitcoin exchange rate using machine learning techniques. Applied Soft Computing,
75, 596–606.
Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P. (2014). Good debt or bad debt: Detecting
semantic orientations in economic texts. Journal of the Association for Information Science and
Technology, 65, 782–796.
McNally, S., Roche, J., & Caton, S. (2018). Predicting the price of bitcoin using machine learning. 26th
Euromicro International Conference on Parallel, Distributed and Network-based Processing
(PDP), (pp. 339–343). Cambridge, UK.
Mekala, D., & Shang, J. (2020). Contextualized weak supervision for text classification. Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 323–333).
Mishev, K., Gjorgjevikj, A., Vodenska, I., Chitkushev, L. T., & Trajanov, D. (2020). Evaluation of
sentiment analysis in finance: From lexicons to transformers. IEEE Access, 8, 131662–131682.
Nasekin, S., & Chen, C. Y.-H. (2020). Deep learning-based cryptocurrency sentiment construction.
Digital Finance, 2, 39–67.
Nordhaus, W. D. (1987). Forecasting efficiency: concepts and applications. The Review of Economics and
Statistics, 667–674.
Ortu, M., Uras, N., Conversano, C., Bartolucci, S., & Destefanis, G. (2022). On technical trading and
social media indicators for cryptocurrency price classification through deep learning. Expert
Systems with Applications, 116804.
Pang, B., Lee, L., & others. (2008). Opinion mining and sentiment analysis. Foundations and Trends in
Information Retrieval, 2, 1–135.
Polasik, M., Piotrowska, A. I., Wisniewski, T. P., Kotkowski, R., & Lightfoot, G. (2015). Price
fluctuations and the use of bitcoin: An empirical inquiry. International Journal of Electronic
Commerce, 20, 9–49.
Polikar, R. (2012). Ensemble Learning. In Ensemble Machine Learning (pp. 1–34). Springer.
Popa, C., & Rebedea, T. (2021). BART-TL: Weakly-supervised topic label generation. Proceedings of
the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, (pp. 1418–1425).
Rapach, D., & Zhou, G. (2013). Forecasting stock returns. In Handbook of Economic Forecasting (Vol. 2,
pp. 328–383). Elsevier.
Rybinski, K. (2021). Ranking professional forecasters by the predictive power of their narratives.
International Journal of Forecasting, 186-204.
Sebastião, H., & Godinho, P. (2021). Forecasting and trading cryptocurrencies with machine learning
under changing market conditions. Financial Innovation, 7, 1–30.
Selvin, S., Vinayakumar, R., Gopalakrishnan, E. A., Menon, V. K., & Soman, K. P. (2017). Stock price
prediction using LSTM, RNN and CNN-sliding window model. 2017 International Conference
on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1643-1647). Udupi,
India: IEEE.
Sezer, O. B., Gudelek, M. U., & Ozbayoglu, A. M. (2020). Financial time series forecasting with deep
learning: A systematic literature review: 2005–2019. Applied Soft Computing, 90, 106181.
Shrestha, N. (2020). Detecting multicollinearity in regression analysis. American Journal of Applied
Mathematics and Statistics, 8, 39–42.
Stavroyiannis, S. (2018). Value-at-risk and related measures for the Bitcoin. The Journal of Risk Finance,
127-136.
Stevenson, M., Mues, C., & Bravo, C. (2021). The value of text for small business default prediction: A
deep learning approach. European Journal of Operational Research, 758-771.
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for
sentiment analysis. Computational Linguistics, 37, 267–307.
Timmermann, A., & Granger, C. W. (2004). Efficient market hypothesis and forecasting. International
Journal of Forecasting, 20, 15–27.
Van de Kauter, M., Breesch, D., & Hoste, V. (2015). Fine-grained analysis of explicit and implicit
sentiment in financial news articles. Expert Systems with Applications, 42, 4999–5010.
Vo, A.-D., Nguyen, Q.-P., & Ock, C.-Y. (2019). Sentiment analysis of news for effective cryptocurrency
price prediction. International Journal of Knowledge Engineering, 5, 47–52.
Wang, H., Lu, S., & Zhao, J. (2019). Aggregating multiple types of complex data in stock market
prediction: A model-independent framework. Knowledge-Based Systems, 164, 193–204.
Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018). Zero-shot learning - A comprehensive
evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 41, 2251–2265.
Xiong, W., Du, J., Wang, W. Y., & Stoyanov, V. (2019). Pretrained encyclopedia: Weakly supervised
knowledge-pretrained language model. Cornell University arXiv Computation and Language
Repository, arXiv:1912.09637v1.
Yang, J., Rivard, H., & Zmeureanu, R. (2005). On-line building energy prediction using adaptive artificial
neural networks. Energy and Buildings, 37, 1250–1259.
Yin, W., Hay, J., & Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and
entailment approach. Cornell University arXiv Computation and Language Repository,
arXiv:1909.00161v1.
Yu, L., Wang, S., & Lai, K. K. (2008). Forecasting crude oil price with an EMD-based neural network
ensemble learning paradigm. Energy Economics, 30, 2623–2635.
Żbikowski, K. (2016). Application of machine learning algorithms for Bitcoin automated trading. In
Machine Intelligence and Big Data in Industry (pp. 161–168). Springer.
Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253.
Zhou, Z.-H. (2018). A brief introduction to weakly supervised learning. National Science Review, 5, 44–
53.