Alternative Data Textual Data
Alternative Data Textual Data
Article
Building News Measures from Textual Data and
an Application to Volatility Forecasting
Massimiliano Caporin 1,∗ and Francesco Poli 2
1 Department of Statistical Sciences, University of Padova/via Cesare Battisti, 241, 35121 Padova PD, Italy
2 Department of Economics and Management, University of Padova/via del Santo, 33, 35123 Padova PD,
Italy; [email protected]
* Correspondence: [email protected]; Tel.: +39-0498274199
Abstract: We retrieve news stories and earnings announcements of the S&P 100 constituents from two
professional news providers, along with ten macroeconomic indicators. We also gather data from Google
Trends about these firms’ assets as an index of retail investors’ attention. Thus, we create an extensive
and innovative database that contains precise information with which to analyze the link between news
and asset price dynamics. We detect the sentiment of news stories using a dictionary of sentiment-related
words and negations and propose a set of more than five thousand information-based variables that
provide natural proxies for the information used by heterogeneous market players. We first shed light on
the impact of information measures on daily realized volatility and select them by penalized regression.
Then, we perform a forecasting exercise and show that the model augmented with news-related variables
provides superior forecasts.
Keywords: volatility; news; Google Trends; sentiment analysis; big data; lasso; regularization
1. Introduction
Traditional “efficient markets” thinking suggests that asset prices should completely and
instantaneously reflect movements in underlying fundamentals, while an opposite view indicates
that asset prices and fundamentals are continuously disconnected. One hypothesis that explains the
success of the GARCH class of models is the mixture of distributions hypothesis (MDH). (See Clark 1973;
Epps and Epps 1976; Tauchen and Pitts 1983; and Lamoureux and Lastrapes 1990, among many others.)
According to the MDH, a serially correlated mixing variable that measures the rate at which information
arrives to the market explains the GARCH effects on asset returns. The validity of the MDH remains
an open debate; there is no agreement about how quickly and in what form responses to news occur.
We shed light on the link between news information and volatility, focusing on three questions: What is
the relative importance of types of news? Are investors more influenced by the volume of information
or by variations if it? Do news and an index of investors’ attention help to forecast volatility?
Our first contribution is to create an extensive and innovative database that contains information
useful in answering the three questions. From two news providers, Factset-StreetAccount and
Thomson Reuters-Thomson One, we retrieve news stories and earnings announcements of the S&P
100 constituents, along with ten macroeconomic announcements. Both news providers assign news
stories a topic—Thomson Reuters also gives its news stories a level of importance—while earnings
and macro-announcements report both released figures and consensus forecasts, allowing diversions
from expectations to be computed. In addition, we gather Google Trends1 information about the assets
and use them as a proxy for retail investors’ attention. Google restricts access to daily data for intervals
longer than ten months but allows daily data to be gathered for shorter intervals. Exploiting the
series of daily data associated with each month and the series of monthly data associated with the
whole sample, we reconstruct the daily relative search volume for the whole sample. The collected
news reports are dated with to-the-minute precision, while Google Trends are aggregated by day,
so the dataset has to-the-minute precision for news and daily precision for Google Trends. The sample
contains data for the ten-year period from February 2005 to February 2015.
As a second contribution, we detect the sentiment of news stories using the sentiment-related
word lists developed by Loughran and McDonald (2011) and introduce a set of negations, both with
the aim of creating a method that can be used to extract the sentiment of a financial text with more
confidence and independent of its type, length, and audience.
Our third contribution is to propose a set of news measures that provide natural proxies for
retail investors’ attention and for the information heterogeneous market players use. This study goes
beyond how information has been used so far: starting from the reasoning that investors’ perception
and, as a consequence, their reaction to news disclosures can differ based on how information varies
over time and the reasoning that investors digest and react to news at differing speeds, we look at
how the information stream fluctuates over the day and across days, weeks, and months. We end
with a large set of news measures, each representing a different type of information that can cause
a different market reaction.
As final contribution, we shed light on the impact of news on volatility and address the three
questions posed above using the information-related variables we develop. We perform an application
using the database to explain realized volatility and selecting the most important indicators with
LASSO (least absolute shrinkage and selection operator), an estimation method for linear models that
is commonly employed in big data analysis which performs variable selection and shrinks coefficients.
Then we employ news and Google Trends to forecast volatility in an out-of-sample analysis.
Empirical analyses favor the MDH and show that earnings announcements and news stories are
the most important drivers of daily realized volatility, followed by macroeconomic news and Google
Trends, and that earnings and upgrades/downgrades are the topics of news stories that are most
relevant to explaining volatility. In addition, the analyses show that it is important both to look at
variations of the volume of information across time and to build measures based on the aggregation
of information over various time horizons since the measures imply varying reactions from market
players. By including news-based information, we can improve volatility forecasting substantially.
2. Literature Review
1 Google Trends is a public web facility of Google Inc. based on Google Search that shows how often a particular search-term
is entered relative to the total search volume across various regions of the world and in various languages.
Econometrics 2017, 5, 35 3 of 46
account for the effects of macroeconomic news announcements, arguing that allowing volatility to differ
on days that contain news releases can disentangle calendar and announcement effects. McMillan and
García (2013) forecast intra-day volatility for the IBEX 35 Index futures using volume and the number of
transactions as proxies for information flows and show that introducing the proxy improves the volatility
forecast for several volatility models at various frequencies. Zhang et al. (2014) employ the number of
news stories that appeared in Baidu News2 as a proxy for information arrival and use a sample of SME
Price Index3 in China to validate the MDH. Their empirical results reveal a positive impact of internet
information on the conditional volatility of stock returns. This link has also been documented for the US
stock market (Kim and Kon 1994; Gallo and Pacini 2000), the UK stock market (Omran and McKenzie
2000), and the Australian stock market (Brailsford 1996).
More generally with regard to the relationship between news flows and asset price dynamics,
the last few decades of research have produced a tremendous number of empirical studies, but these
studies have by no means reached consensus. While some of these papers focus on the impact of
macroeconomic news, others explore the idea that assets react to firm-specific news releases.
2 Baidu News is a service of the Chinese web services company Baidu. Baidu News provides links to a selection of local, national,
and international news, and presents news stories in a searchable format within minutes of their publication on the web.
3 the SME Price Index functions as the market indicator of China’s small and medium-size enterprises listed on the SME Board.
Econometrics 2017, 5, 35 4 of 46
4 Reuters NewsScope Sentiment Engine and Thomson Reuters News Analytics are tools that provide sentiment and linguistic
analytics, such as novelty and relevance indicators, for each news article. The indicators are produced based on automated
linguistic pattern recognition of news texts.
5 RavenPack News Analytics is a service of RavenPack.com, a provider of news analytics and machine-readable content,
that provides event and sentiment information to financial services clients.
Econometrics 2017, 5, 35 5 of 46
firm-specific news sentiment on intraday volatility persistence, even after controlling for the potential
effects of macroeconomic news. Firm-specific news sentiment apparently accounts for a greater proportion
of overall volatility persistence than macroeconomic news sentiment does, and negative news has a greater
impact on volatility than positive news does. Riordan et al. (2013) suggest that negative newswire
messages from Reuters NewsScope Sentiment Engine, compared to positive ones, are associated with higher
adverse selection costs, are more informative, and have a more significant impact on high-frequency
asset price discovery and liquidity. Smales (2015) use Thomson Reuters News Analytics sentiment scores
to create aggregate daily news sentiment indicators and find that positive and negative news result in
above and below average returns, respectively, and that neutral news days are indistinguishable from
days without news. Allen et al. (2017) use the Thomson Reuters News Analytics data set to construct
a series of daily sentiment scores for the Dow Jones Industrial Average (DJIA) stock index component
companies, and study the relationship between these financial news sentiment scores and the stock prices
of these companies using entropy metrics, which permit an analysis of the amount of information within
the sentiment series, its relationship to the DJIA and an indication of how the relationship changes over
time. Allen et al. (2015a, 2015b) explore the impact of the Thomson Reuters News Analytics series on the
DJIA constituents asset pricing and volatility. The relation between news and price dynamics has been
studied for other types of assets, as well. For instance, Borovkova and Mahakena (2015) investigate
the impact of news sentiment on returns, price jumps and volatility of natural gas futures. They find
significant relationships between news sentiment and the dynamic characteristics of natural gas futures
prices and document, among other findings, an asymmetric effect of (positive vs negative) news on
volatility. In the books Mitra and Mitra (2011) and Mitra and Yu (2016), several articles deal with many
related research questions. In the field of bag-of-words methods in financial contexts, Loughran and
McDonald (2011) show that word lists developed for other disciplines misclassify common words in
financial texts and develop alternative positive and negative word lists and four other word lists that
reflect tone in financial texts. They show that the proportion of negative words in annual 10-Ks reports6 is
associated with lower returns.
Differently from previous studies that use or focus on only macroeconomic or firm-specific
information, Bajgrowicz et al. (2016) consider macro, pre-scheduled company-specific announcements
and stories from news agencies like Reuters and Dow Jones News Service and relate them to jumps in
the US stock market.
6 A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive
summary of a company’s financial performance.
Econometrics 2017, 5, 35 6 of 46
3. Database Construction
Our first contribution lies in the extraction of information collected from two news providers,
FactSet-StreetAccount and Thomson Reuters-Thomson One, and from Google Trends. Here we describe
our novel dataset and the procedures used to extract the data.
3.1. Dataset
A large set of firm-specific and macroeconomic news is available from the two news providers
FactSet-StreetAccount and Thomson Reuters-Thomson One. As Gloß-Klußmann and Hautsch (2011)
point out, recording and analyzing the overall news flow for a specific asset is challenging since the
amount of news, the number of news sources, and the speed of information dissemination are all
rapidly increasing. Because of the huge amount of information published in all modern media, news
are overlaid with substantial noise from irrelevant information. Since we rely on two professional
news providers that provide only firm-specific news classified by their professionals as relevant to the
firm, we assume that relevant news stories are effectively disentangled from irrelevant ones and that
the impact of noise is adequately reduced. Our approach differs substantially in this regard from work
that analyzes newspapers articles that are not selected a priori.7
StreetAccount, owned by the financial data and software company FactSet, is a news provider
that supplies investment professionals with news summaries. StreetAccount data includes real-time
company updates, portfolio and sector filtering, email alerts, and market summaries. Content can be
customized for portfolio, index, sector, market, time of day (e.g., overnight summaries), and category
(e.g., top stories, market summaries, economic stories, M&A stories). Writers, all of whom are financial
professionals, include former portfolio managers, traders, analysts, and economists who use their
collective market expertise to scan all possible sources for corporate news and report only those stories
that they consider new and material. Comprehensive U.S. and European company coverage and
coverage of a smaller but relevant list of Canadian and Asia Pacific companies extend to thousands of
companies. Firm-specific and macroeconomic news are available in StreetAccount.
Thomson Reuters is a world-leading source of information for businesses and professionals,
and Thomson One, one of its core products, is a database that provides financial market news from
Reuters and leading third-party sources. Thomson One data results from the incorporation of 400
real-time global sources and newswires and more than 6,000 global and regional sources, including
The Economist, Barron’s, Le Monde, The Washington Post, PR Newswire, Business Wire, and The Wall
Street Journal. Comprehensive global coverage of 57,000 publicly listed companies spanning more
than 120 markets tracked and corresponding to 99 percent of global market capitalization includes the
constituents of all major indices and extends to the frontier/emerging markets of Central and Eastern
Europe, Asia, the Middle East, and Africa. Firm-specific news is available in a variety of formats,
each corresponding to a type of information: Significant Developments, Company Events, and Earnings
Surprises. Macroeconomic news is available as well from Thomson One.
Following the growing popularity of the internet as a search tool, the use of such sources as
Google to find information on a certain stock seems to be closely linked to stock market participation.
(See, e.g., Preis et al. (2010).) However, as Da et al. (2011) point out, Google is likely to be representative
of general internet search behavior, so the quantity of queries for a term is a measure for retail investors’
activity, rather than for professional investors’ activity. Therefore, we use Google Trends’ public data
as a proxy for retail investors’ attention.
7 From the pioneering works of Tetlock (2007) and Tetlock et al. (2008) to the more recent studies of, for example, (Birz and
Lott 2011; Dougal et al. 2012; García 2013; Solomon et al. 2014; and Kraussl and Mirgorodskaya 2016), authors have
employed general economic or company-specific news articles from newspapers or specific sections/columns to explain
asset price dynamics but have not made selections based on articles’ relevance or novelty.
Econometrics 2017, 5, 35 7 of 46
We gather news about ten US macroeconomic indicators and firm-specific news and Google
Trends for the S&P 100 Index companies since they are highly capitalized and attention-grabbing
companies. We excluded from the database: (1) stocks whose news stories were not available from
either provider for the period February 2005–February 2015, and (2) stocks that entered the S&P 100
Index or were created after February 2005. Eleven stocks were excluded, and the remaining eighty-nine
stocks are listed in Table A1 in Appendix A.
The information that constitutes the database can be classified into five types:
Firm-specific StreetAccount news stories include trading-floor conjectures, court rulings, FDA and
EU drug approvals, FTC antitrust decisions, SEC filings, brokerage firm upgrades and downgrades,
newspaper and television stories, stories released by social media, and company press releases,
including perspectives, corporate conference calls, and presentations. News are classified along eleven
topics, which are listed in Table A2 in Appendix A. News are filtered for relevancy and redundancy so
each news story is included only once.
Firm-specific Thomson Reuters news stories are available from Significant Developments, a news
analysis, tagging, and filtering service of Thomson One that screens press releases and provides
concise summaries and categorizations of important company events on a near real-time basis.
Customized reports can be created for a portfolio of companies, regions, industries, and news
topics. Each story is organized into one or more of thirty-six topics and is given one of four levels of
significance/importance: low, medium, high, and top, where each level implies a filter which eliminates
all news stories with a lower significance; for instance, low corresponds to all news stories, while
medium corresponds to news stories from medium to top. The thirty-six topics are listed in Table A2
in Appendix A. Assignment of degree of significance is based on the expected effect that the event
will have on the company’s operational and/or financial performance. As for StreetAccount news,
Thomson Reuters news stories are also filtered for relevance and redundancy. Firm-specific company
events are also available from Thomson One and consist of a comprehensive list of current and past
events—primarily earnings releases, conference calls, news conferences, and shareholders’ meetings.
While they are not categorized by topic, they are short descriptions of the events that do not allow
sentiment to be extracted. For these reasons we do not use company events to construct news measures,
but they are part of our database and are reported here.
Firm-specific earnings announcements incorporate both the company’s reported actual EPS and
the consensus forecast figure, given as the mean of a set of surveys at the time of reporting, so investors
and analysts can determine whether the company has met, exceeded, or fallen short of the street’s
expectations. Earnings announcements are recovered by StreetAccount news stories that contain the
quarterly EPS announcements and their consensus forecast, which we compare to compute earnings
surprises. Thomson One reports earnings surprises too; although the figures are highly reliable since
they are computed by the provider, data are available with day precision instead of minute precision
and are limited to the period July 2013 – June 2015. As a consequence of these limitations, we do not
use Thomson Reuters earnings surprises in this study.
Ten US macroeconomic indicators are available from Thomson One, and they are listed in Table 1.
Google Trends summarizes the searches performed through the Google website and shows
how many web searches have been done for a particular keyword in a particular period of time in
a particular geographical area relative to the total number of web searches performed through Google
in the same period and area. Absolute values of the index are not publicly available since Google
Econometrics 2017, 5, 35 8 of 46
normalizes the index to 100 in the period in which it reaches the maximum level. Data are gathered
using IP addresses only if the number of searches exceeds a certain threshold. Repeated queries from
a single IP address within a short time are eliminated. Google Trends have been available almost
in real time since the end of January 2004. For each stock, we look at the number of search queries
for the name of the company but do not include search queries for the company’s products or other
related expressions since it is likely that investors search for the company’s name when they look for
information about it. We also exclude search queries for tickers since, in many cases, they correspond
to acronyms for other institutions or have other meanings. Google restricts the access to daily data
for intervals longer than ten months but allows daily data (relative to the maximum) to be gathered
for shorter intervals. For the ten-year period, we reconstruct the daily search volume for the whole
sample from the set of the daily series for each month and the monthly aggregated series for the whole
sample, following a procedure detailed in Section 5.3.
The dataset’s time range is 4 February 2005 to 25 February 2015, and all data is available with
minute precision, except for Google Trends, which are daily.
News is available in various data formats, depending on the provider. Using the software
python TM , we extracted from each news story a set of elements that depend on the type of news,
its data format, and its provider. For StreetAccount and Thomson Reuters news stories we obtain
stock, date with minute precision time, headline, topic (also importance for Thomson Reuters
news stories), and text. For company events we derive stock, date, time, and event description.
For earnings announcements we extrapolate stock, date, time, actual EPS, and consensus forecast EPS.
For macro-announcements we isolate type of macro-indicator (e.g., GDP), date, time, actual figure,
and consensus forecast. With regard to Google Trends, we collect for each stock the set of the daily
series for each month and the monthly aggregated series for the whole sample.
8 Guidance news is almost coincident with earnings-related news; conjecture news describes possible and uncertain events and
are presumably perceived as not important; corporate actions news is about companies’ internal events, which usually have
minor relevance to investors; management changes and syndicate news stories are rare and even non-existent for some stocks.
Econometrics 2017, 5, 35 9 of 46
(events concerning regulatory agencies, internal investigations, and any type of charges brought
by regulatory bodies). The earnings pre-announcements topic merges three topics: positive earnings
pre-announcements (higher than expected), negative earnings pre-announcements (lower than expected),
and other earnings pre-announcements (neutral with respect to expectations). The financial topic merges
equity issues, bond issues, share repurchases, and equity investments, all of which are events that have
an impact on the company’s balance sheet. The other topics were discarded for reasons similar to
those that led to our discarding some of StreetAccount news’ topics. By jointly exploiting topics and
importance, we get (n. importance) × (n. topics) = 4 × 6 = 24 classifications for Thomson Reuters
news stories.
Topics from different providers that appear to have the same meaning usually have similarities
but can also differ significantly because they depend on the criteria the analysts use for news
categorization, and topics often have different meanings. For instance, StreetAccount’s earnings-related
news is a more broad concept than Thomson Reuters’ earnings pre-announcements, since the former is
comprehensive of earnings pre-announcements released by the company, consensus forecasts released
by the provider, and EPS announcements, while the latter consists only of the company’s earnings
pre-announcements. As another example, StreetAccount’s regulatory news topic apparently does not
include company-internal investigations, unlike Thomson Reuters’ regulatory/company investigation.
Table 2 lists the topics that are included in our dataset for each data provider.
Measure Min Quant 5% Median Quant 95% Max Mean Std Dev
StreetAcc. n. news stories
per day 0.00 0.00 0.00 1.00 7.00 0.23 0.60
T. Reuters n. news stories
per day 0.00 0.00 0.00 1.00 3.00 0.10 0.33
StreetAcc. n. words per
day 0.00 0.00 0.00 102.00 1079.00 18.18 66.68
T. Reuters n. words per
day 0.00 0.00 0.00 68.90 390.00 8.23 31.80
Notes: Summary statistics of Street Account news stories topic all and Thomson Reuters news stories topic all,
importance low.
4. Sentiment Detection
Our second contribution consists of detecting the sentiment of news stories. Sentiment indicates
whether the content of a document—in our case, a news story—is good, bad, or neutral in relation to
the issue it addresses. We use the sentiment-related word lists developed by Loughran and McDonald
(2011) and introduce a set of negations, with the aim of creating a method for extracting the sentiment
of a financial text with more confidence and independent of its type, length, and audience.
Loughran and McDonald (2011) develop six word lists (negative, positive, uncertainty, litigious,
strong modal, and weak modal) and show that a higher proportion of negative words is associated
with lower returns. Their lists are tailored for financial texts; for example, they do not contain words
like liability, earnings, and tax, which are expected to appear in both positive and negative contexts.
The authors account for negation with six words (no, not, none, neither, never, and nobody), but only
if they precede words that are classified as positive. The methodology is applied to US companies’
10-K filings and these texts have a formal tone and are unlikely to contain many negations. We deal
instead with news created by news providers, which we expect to be less limited in the use of language
compared to company filings of 10-Ks. Loughran and McDonald (2011)’s procedure is not adequate
for extracting the sentiment of news stories from news providers because, unlike 10-Ks that are given
to the SEC, news stories do not necessarily have a formal tone; in addition, 10-Ks are long enough that,
if negated words occur and their sentiment is incorrectly identified, the effect is negligible in the whole,
long document. We deal, instead, with news stories that are seldom longer than a few dozen words.
Negations can appear in various forms and can invert the meaning of whole phrases, as well
as single words. The phrase whose meaning is changed is called the negation scope. Negations can
also flip the meaning of sentences, as in “the company has invented a new product for the first and
last time.” Identifying negation scopes, implicit negations, and linguistic peculiarities like sarcasm
and irony still presents many problems. Approaches like heuristic rules and machine-learning that
perform natural language processing can bring significant improvements, but they are out of the scope
of the present work.
Econometrics 2017, 5, 35 11 of 46
Remaining in the field of the bag-of-words and avoiding the numerical complexities implied
by the aforementioned approaches, we invert the sentiment each time a word, whether positive or
negative, is preceded by a negation; and in place of the short negations list of six single words, we use
twenty-eight single words, twenty-four sequences of two words, and six sequences of three words.
We believe that this modification allows the sentiment of a financial text to be extracted with more
confidence and independent of its type, length, and audience:
• single words: no, not, none, never, nothing, nobody, nowhere, neither, nor, hardly, scarcely, seldom, barely,
few, little, rarely, instead, can’t, cannot, don’t, doesn’t, didn’t, mustn’t, won’t, despite, overly, too, less
• two-word sequences: can not, do not, did not, short of, not every, not all, not much, not many, not
always, not so, instead of, far from, not to, never to, no way, out of, not very, not enough, too few, too little,
no big, not big, no significant, not significant
• three-word sequences: not at all, by no means, in no way, in place of, in spite of, in lieu of
1. Positive words are given a value of 1 and negative words a value of −1; the value is inverted in
case of negation.
2. The values of all words with a sentiment are summed to get the sentiment sum (Sent_Sum):
N
Sent_Sum = ∑ si (1)
i =1
where i is the word index, N is the number of words with a sentiment in a text, and si is the
sentiment of the word indexed by i.
3. Sent_Sum is divided by the number of words with a sentiment to obtain a standardized quantity
that we call relative sentiment (Rel_Sent) and that is between −1 and 1 by construction:
Sent_Sum
Rel_Sent = (2)
N
4. If Rel_Sent is larger than 0.05 or smaller than −0.05, we associate a positive (1) or a negative
sentiment (−1) to the news, respectively; otherwise a neutral sentiment (0) is given.
News sentiment is therefore neutral when a significant number of positive and negative words
are detected and their proportion is roughly the same:
−1
if Rel_Sent < −0.05
Text_Sent = 0 if −0.05 ≤ Rel_Sent ≤ 0.05 (3)
1 if Rel_Sent > 0.05
Different from the mainstream of text-analysis techniques, which look either only at headlines
or only at text, we use both headlines and text by applying the sentiment extraction to the headline.
The procedure stops if a positive or negative sentiment is detected; otherwise, the whole text is
analyzed. This method is more complete than looking at headlines only while also being more efficient
than looking directly at the text since it allows us to use small pieces of text rather than long ones when
it is possible to infer sentiment from headlines only.
The following subsections describe the procedures we followed to build news-related variables
from the dataset, and consist of: (1) the concepts to be used to build variables from news stories, (2) the
standardized surprises obtained from earnings and macro-announcements, (3) the daily Google Search
Index reconstruction for the whole sample, and (4) the news measures we propose for a daily analysis
of asset price dynamics.
M
News_BIt ( M, k) = ∑ nkt,j k≥1 (4)
j =1
where t is the time period over which the measure is computed, M is the number of subintervals
into which t can be split, and nt,j is the number of news stories disclosed within (t-1 + (j-1)/M)
and (t-1 + j/M)—that is, in each subinterval. t can range from few minutes to a day or a series of
days. If t is a day, it will be split in a series of intraday intervals, such as five minutes, ten minutes,
or fifteen minutes. If t is a longer period—say, a week or a month—it is reasonable to divide it
into a series of days, such as one-day, two-day, or five-day intervals.
• Quantity variation: variation across periods of the quantity of news stories (or words).
This concept takes into account the chance that investors’ reactions are triggered not only by the
release of information, but more generally by increases in the quantity of information. The market
can become accustomed to news releases such that it perceives them as informative only when
they are released at a higher (lower) rate than usual, in which case they wait for the rate of
information arrival to increase (decrease) before making a decision.
• News persistence/interaction: when the quantity of news is above a threshold in each of
two consecutive periods. Since providers do not supply redundant news9 , this event denotes
persistence in the release of news stories that are related in each period to a different issue.
• Sentiment inversion: when the sentiment of the reference period is opposite to that of previous
periods.
• Quantity variation conditional on sentiment: positive quantity variation conditional on the
sentiment of the reference period and negative quantity variation conditional on the sentiment of
the previous period. The sentiment of the period with a higher quantity of information is likely to
have the greater influence on investors’ attention.
9 News providers claim to supply only novel news stories, so we expect them not to report the same information more than once.
Econometrics 2017, 5, 35 13 of 46
• Sentiment conditional on quantity: sentiment of the reference period conditional on the quantity
of information released during the same and during longer periods. Investors may base their
decisions on the sentiment of the reference period, but their attention may depend on the quantity
of information that is released during periods of the same duration or during longer periods.
f orecast
EPStactual − EPSt
SUEt = f orecast
(5)
σ( EPStactual − EPSt )
f orecast f orecast
where σ ( EPStactual − EPSt ) is the standard deviation of ( EPStactual − EPSt ).
With regard to macro-announcements, we compute from actual and consensus forecasts of the
indicators the standardized surprise, Std_Macro, as we did for earnings.
f orecast
Macrotactual − Macrot
Std_Macrot = f orecast
(6)
σ ( Macrotactual − Macrot )
where Macro generically stands for any of the ten indicators listed in Table 1 and σ( Macrotactual −
f orecast f orecast
Macrot ) is the standard deviation of ( Macrotactual − Macrot ).
• The set of daily series for each month GT_Dailyd,m (121 series, one for each month from February
2005 to February 2015, with the observations in each series equaling the number of days in the
month), where for each series the observations are relative to the maximum of 100 during the
month. Therefore, observations are not comparable across different months.
• The monthly-aggregated series for the whole sample GT_Monthlym (one series having 121
observations), where the observations are relative to the maximum of 100.
1. Compute the relative contribution of day d to the search volume of month m GT_DailyReld,m ,
by dividing the daily observation of day d (relative to the maximum of month m) GT_Dailyd,m by
the sum of all the daily observations of that month:11
GT_Dailyd,m
GT_DailyReld,m = M
(7)
∑d=m1 GT_Dailyd,m
10 The term “Google Search Index” is consistent with the recent literature.
11 GT_DailyReld,m refers to the daily observations being divided by their sum over the month in such a way that their sum
over each month is equal to 1.
Econometrics 2017, 5, 35 14 of 46
GTd,m
GSId,m = · 100 (9)
max ( GTd,m )
where max ( GTd,m ) is the max of GTd,m over the series.
• Daily: information from the market closing time of day t-1 to the market closing time of day t12
• Overnight: information from the market closing time of day t-1 to the market opening time of
day t
• Weekly: the most recent five trading days
• Monthly: the most recent twenty-two trading days
As a last step, with the aim of identifying possible non-linearities in the relationship between market
activity and the indicators, we extend the variables along a series of monotonic transformations, which are
detailed in Table 5. flag if x 6= 0 takes into account the possibility that investors react only to surprises from
expectations with regard to macro-announcements and EPS, to news stories releases independently of
their number or to variations of the quantity of information with regard to news stories, and to variations
of the level of attention with regard to Google Trends; flag if x > 0 and flag if x < 0 are useful to catch
asymmetries in the above-mentioned relationships; (signed) square root(x) and log(x) allow for market
activity to rise less than proportionally to the increase of the absolute value of a variable, while square(x)
does the opposite.
Based on the values the original measure x can assume, we apply either all transformations or
only a subsample of them. For instance, if x can only be non-negative (e.g., n. news stories), flag if x > 0
and flag if x < 0 are not applied; if x is the sentiment, only x, flag if x > 0, and flag if x < 0 are applied;
if x is the day-to-day ∆ n. news stories, all transformations are applied.
Tables 6–11 report the measures built from news stories. Tables 6–9 correspond to one table for
each time horizon, while Tables 10 and 11 list the measures based on the aggregation or comparison of
the information across more than one time horizon. Tables 12 and 13 report the measures built from EPS
and macro-news. Information is aggregated over daily, overnight, weekly, and monthly time horizons.
EPS and macro-announcements are released with frequencies ranging from one week to several months,
so measures based on their comparison across periods would either represent lagged announcements
or zero. Therefore, differently from news stories’ measures, EPS and macro-news are not compared
across periods. Table 14 reports the measures built from Google Trends. Summing news-related
variables for StreetAccount news stories, Thomson Reuters news stories, earnings announcements,
macro-announcements, and Google Trends, we have 5159 news measures for each asset.
12 t, t-1, and so on refer to trading days only, so information released during holidays and weekends is considered part of the
daily information of the first following trading day, as well as part of its overnight information.
Econometrics 2017, 5, 35 15 of 46
Transformation Formula
original measure x
1 if x 6= 0
flag if x 6= 0
0 otherwise
1 if x > 0
flag if x > 0
0 otherwise
1 if x < 0
flag if x < 0
0 otherwisep
signed square root(x) sign( x ) · | x |
signed log(x) sign( x ) · log(1 + | x |)
signed square(x) sign( x ) · x2
Variable N. Transf.
STANDARD
n. news stories 5a
n. words 4
sentiment 3b
ABNORMAL QUANTITY
n. news stories ≥ 2 1c
UNCERTAINTY
pos and neg news in same day 1
NEWS BURST INDEX
news burst index (n. news) 6d
( M = 78, 13, 3) × (k = 2, 4)
news burst index (n. words) 6
( M = 78, 13, 3) × (k = 2, 4)
SENTIMENT COND. ON QUANTITY
pos sent & n. news stories ≥ 2 1
neg sent & n. news stories ≥ 2 1
total for each topic 28
grand total (28 × 31e ) 868
Notes: The first column shows the variables grouped by the concepts that originated them. The second
column shows the number of transformations, with the total number of measures obtained at the end of
the column. We obtain 868 measures. a : When the original measure can only be positive, such as number of
news stories, the two transformations flag if x > 0 and flag if x < 0 are omitted, leaving five transformations.
flag for number of words 6= 0 is omitted because this measure corresponds to flag for number of news stories 6= 0.
b : The transformations applied for are original measure, flag if x > 0, and flag if x < 0. c : When the number of
transformations equals 1, the measure consists of a flag (1 for the occurrence of the event, and 0 otherwise).
d : For news burst index we report in the second column the number of combinations of the parameters M and k,
and do not apply transformations. e : There are seven topics for StreetAccount news stories and six topics and
four levels of importance for Thomson Reuters news stories. 31 stands for the sum of the number of topics of
StreetAccount (7) and the number of topics times four levels of importance of Thomson Reuters (24).
Econometrics 2017, 5, 35 16 of 46
Variable N. Transf.
STANDARD
n. news stories 5
n. words 4
sentiment 3
ABNORMAL QUANTITY
n. news stories ≥ 2 1
UNCERTAINTY
pos and neg news in same day 1
SENTIMENT COND. ON QUANTITY
pos sent & n. news stories ≥ 2 1
neg sent & n. news stories ≥ 2 1
total for each topic 16
grand total (16 × 31) 496
Variable N. Transf.
STANDARD
av. n. news storiesa 5
av. n. words 4
sentimentb 3
ABNORMAL QUANTITY
av. n. news stories ≥ 1 1
NEWS BURST INDEX
news burst index (n. news) 2
( M = 5) × (k = 2, 4)
news burst index (n. words) 2
( M = 5) × (k = 2, 4)
SENTIMENT CONDITIONAL ON QUANTITY
pos sent & av. n. news stories ≥ 1 1
neg sent & av. n. news stories ≥ 1 1
total for each topic 19
grand total (19 × 31) 589
a: Quantities result from averaged daily quantities over the last five trading days. Av. refers to average.
b: Sentiment results from the sign of the averaged sentiment over the last five trading days.
Econometrics 2017, 5, 35 17 of 46
Variable N. Transf.
STANDARD
av. n. news stories 5
av. n. words 4
sentiment 3
ABNORMAL QUANTITY
av. n. news stories ≥ 1 1
NEWS BURST INDEX
news burst index (n. news) 2
( M = 22)x(k = 2, 4)
news burst index (n. words) 2
( M = 22)x(k = 2, 4)
SENTIMENT CONDITIONAL ON QUANTITY
pos sent & av. n. news stories ≥ 1 1
neg sent & av. n. news stories ≥ 1 1
total for each topic 19
grand total (19 x 31) 589
Variable N. Transf.
QUANTITY VARIATIONa
day-to-day ∆ n. news stories 7
week-to-day ∆ n. news stories 7
month-to-day ∆ n. news stories 7
day-to-day ∆ n. words 7
week-to-day ∆ n. words 7
month-to-day ∆ n. words 7
NEWS PERSISTENCE/INTERACTIONb
n. news stories today ≥ 2 & n. news stories day before ≥ 2 1
n. news stories today ≥ 2 & av. n. news stories week before ≥ 1 1
n. news stories today ≥ 2 & av. n. news stories month before ≥ 1 1
SENTIMENT INVERSIONc
day-to-day sent inv 1
day-to-day sent inv, neg to pos 1
day-to-day sent inv, pos to neg 1
week-to-day sent inv 1
week-to-day sent inv, neg to pos 1
week-to-day sent inv, pos to neg 1
month-to-day sent inv 1
month-to-day sent inv, neg to pos 1
month-to-day sent inv, pos to neg 1
Notes: Measures created from the aggregation or comparison of information across different periods, which
can differ from one another. a : day-to-day ∆ n. news stories is equal to the number of news stories on day t
minus the number of news stories on day t-1; week-to-day ∆ n. news stories and month-to-day ∆ n. news stories
are equal to the number of news stories on day t minus the average number of news stories in the week
before (from t-5 to t-1) and in the month before (from t-22 to t-1), respectively. b : news persistence/interaction
describes the event in which the amount of news is above a certain threshold in each of two consecutive
periods. c : day-to-day sent inv describes the event in which sentiment on day t is the opposite of sentiment on
day t-1; day-to-day sent inv, neg to pos and day-to-day sent inv, pos to neg describe events in which sentiment is
negative on day t-1 and positive on day t and the reverse, respectively.
Econometrics 2017, 5, 35 18 of 46
Variable N. Transf.
QUANTITY VARIATION COND. ON SENTIMENTa
day-to-day ∆ n. news stories > 0 & pos sent today 1
day-to-day ∆ n. news stories > 0 & neg sent today 1
day-to-day ∆ n. news stories < 0 & pos sent day before 1
day-to-day ∆ n. news stories < 0 & neg sent day before 1
week-to-day ∆ n. news stories > 0 & pos sent today 1
week-to-day ∆ n. news stories > 0 & neg sent today 1
week-to-day ∆ n. news stories < 0 & pos sent week before 1
week-to-day ∆ n. news stories < 0 & neg sent week before 1
month-to-day ∆ n. news stories > 0 & pos sent today 1
month-to-day ∆ n. news stories > 0 & neg sent today 1
month-to-day ∆ n. news stories < 0 & pos sent month before 1
month-to-day ∆ n. news stories < 0 & neg sent month before 1
SENTIMENT COND. ON PAST QUANTITYb
pos sent today & n. news stories day before ≥ 2 1
neg sent today & n. news stories day before ≥ 2 1
pos sent today & av. n. news stories week before ≥ 1 1
neg sent today & av. n. news stories week before ≥ 1 1
pos sent today & av. n. news stories month before ≥ 1 1
neg sent today & av. n. news stories month before ≥ 1 1
total for each topic 72
grand total (72 x 31) 2232
a : day-to-day ∆ n. news stories > 0 & pos sent today describes the event in which the number of news stories on
day t is greater than the number of news stories on day t-1 and the sentiment on day t is positive; the remaining
variables in the group quantity variation conditional on sentiment are straightforward. The sentiment conditioning
the occurrence of the event is that of the period with a greater amount of news; therefore, we look at the
sentiment of the period before day t for negative variations. b : The variables that belong to the group sentiment
conditional on past quantity describe the events in which the sentiment on day t is positive or negative and
when, in the period before, the quantity of news is above a threshold that equals 2 for the number of news
stories on the day before and 1 for the average number of news stories in the week and the month before.
Variable N. Transf.
daily SUE 8a
overnight SUE 8
weekly SUE 8
monthly SUE 8
grand total 32
Notes: EPS measures result from the EPS released in the corresponding period. For example, weekly SUE is
equal to the SUE if there was an EPS release in the last week. a : In addition to the seven transformations of
Table 5, we add a flag variable for the occurrence of an EPS release (1 for occurrence, and 0 otherwise).
Econometrics 2017, 5, 35 19 of 46
Variable N. Transf.
daily Std_CCONF 8a
daily Std_CPI 8
daily Std_FOMC 8
daily Std_GDP 8
daily Std_INDPROD 8
daily Std_BOP 8
daily Std_JOB 8
daily Std_NFP 8
daily Std_PPI 8
daily Std_RSALES 8
overnight Std_Macrob 8 × 10
weekly Std_Macro 8 × 10
monthly Std_Macro 8 × 10
grand total 320
Notes: Macro-measures result from the macro-announcement released in the corresponding period, as for
EPS measures. a : In addition to the seven transformations of Table 5, we add a flag variable for the occurrence
of a macro-release (1 for occurrence, and 0 otherwise), as for EPS measures. b : Std_Macro refers to the
standardized surprise of any of the macro-indicators, which are reported only in the daily group for reasons of
brevity. In the second column, the number of transformations is multiplied by the number of macro-indicators.
Variable N. Transf.
daily GSI 4a
weekly av. GSI 4
monthly av. GSI 4
day-to-day ∆ GSI 7
week-to-day ∆ GSI 7
month-to-day ∆ GSI 7
grand total 33
Notes: weekly av. GSI and monthly av. GSI correspond to the average of daily GSI over the last five and
twenty-two trading days, respectively. day-to-day ∆ GSI is equal to GSI on day t minus GSI on day t-1;
week-to-day ∆ GSI and month-to-day ∆ GSI are equal to GSI on day t minus the average GSI in the week before
(from t-5 to t-1) and in the month before (from t-22 to t-1), respectively. a : We use four transformations:
original measure, signed square root, signed log, and signed square.
Econometrics 2017, 5, 35 20 of 46
13 The variables, as in Corsi and Renò (2012), are specified in logs. As Andersen et al. (2003) point out, while the distributions
of realized volatilities are clearly right-skewed, the distributions of realized volatilities’ logarithms are approximately
Gaussian. Andersen et al. (2003) also use the logarithmic √ transformation to model and forecast the realized volatilities.
The model can also be specified directly for RVt and for RVt , as in Andersen et al. (2007) and Corsi et al. (2010).
Econometrics 2017, 5, 35 21 of 46
where β News is the k × 1 vector of coefficients, T denotes transposition, and Newst−1 is the k × 1 vector
of news measures available before the market opens on day t.
Using all the measures we created in Section 5, k is equal to 5159. We face a dimensionality issue
since the number of regressors is higher than the number of observations, the latter being smaller
than 3000. We use LASSO to address the issue and to select the measures that are the most useful in
explaining volatility.
LASSO (Tibshirani 1996) is an estimation method for linear models that performs variable
selection and shrinks coefficients. By minimizing the residual sum of squares, subject to the sum of
the absolute value of the coefficients being less than a constant, LASSO shrinks some coefficients and
sets others to 0, thereby providing interpretable models. In addition, the coefficients it produces have
potentially lower predictive errors than ordinary least squares do. Audrino and Knaus (2016) also
use LASSO to model realized volatility, and find that the HAR model’s lags structure is not fully in
agreement with the one identified from a model-selection perspective using LASSO on real data.
Previous studies used LASSO as a model selection tool. In this regard, it is not immune from
criticism. In particular, empirical evidences show that LASSO under-performs in the case of correlated
predictors, in the sense that it may select one at random from a group of highly correlated predictors,
and this can affect the interpretability of the model and compromise its predictive accuracy, see Zou and
Hastie (2005). In addition, coefficients estimated with LASSO are biased (James et al. 2013) and negative
dependence can be problematic (LASSO could miss relevant variables with negative dependencies
depending on the order of inclusion), see Castle et al. (2011). Many variants and alternatives exist,
and they offer solutions to these problems. Stepwise regression is popular, but is also path dependent and
does not have a high success rate of finding the correct model, see Berk (1978). Ridge regression, see Feig
(1978), has been seen to perform well in scenarios with correlated predictors, but it does not perform
variables selection and therefore does not help to make the model more interpretable; in addition, it is
problematic in presence of noise predictors. Other variants are Elastic Net (Zou and Hastie 2005),
Smoothly Clipped Absolute Deviation (Fan and Li 2001), Least Angle regression (Efron et al. 2004),
Fused LASSO (Tibshirani et al. 2005), Group LASSO (Yuan and Lin 2006), Adaptive LASSO (Zou 2006),
and the general-to-specific model selection procedure (Gets), see Castle et al. (2011). The majority of
the classical penalized methods have a Bayesian analogue, often referred to as Bayesian regularization,
see Pavlou et al. (2016) for a review. Overall, no method outperforms the others in all scenarios and
the choice of method should be made based on the features of the particular data set in hand. We leave
the comparison of the abovementioned methods to future work, and employ the basic LASSO in this
illustrative utilization of the developed news variables.
Suppose we have data ( xi , yi ), i = 1, ..., N, where xi = ( xi1 , . . . , xip ) T are the predictor variables,
yi is the response, and N is the number of observations. It is assumed that either the observations
are independent or that yi is conditionally independent given xij and that xij is standardized so that
∑i xij /N = 0, ∑i xij2 /N = 1.
Letting β̂ = ( βˆ1 , . . . , βˆp ) T , the LASSO estimate (α̂, β̂) is:
!
N
(α̂, β̂) = arg min ∑ (yi − α − ∑ β j xij ) 2
subject to ∑ |β j | ≤ t (12)
i =1 j j
Econometrics 2017, 5, 35 22 of 46
where t ≥ 0 is a tuning parameter, which is selected using block cross-validation14 . We use the package
Glmnet for its software R, which computes the cross-validation error for each t among a set of values,
and we select from a grid of 100 values the t corresponding to the most regularized model such that
the error is within one standard error of the minimum.
We estimate the parameters of the LHAR-CJN model using LASSO and apply the restriction to all
coefficients except the constant β 0 . Therefore, the restriction is applied to β Cd , β Cw , β Cm , β Jd , β Jw , β Jm ,
β rd , β rw , β rm and β News , the latter consisting in a vector of 5,159 coefficients.
14 When validation data are randomly selected for cross-validation from the entire time domain, training and validation data
from nearby locations will be dependent. Consequently, if the objective is to project outside the structure of the training data,
error estimates from random cross-validations will be overly optimistic (overfitting). To address this, blocks of contiguous
time can be designed to better ensure independence between cross-validation folds and to achieve more reliable error
estimates and higher forecasting performance (Burman et al. 1994; Racine 2000; Bergmeir and Benitez 2012). We apply
10-fold cross-validation on data that is not partitioned randomly, but sequentially into ten sets. So, the problem of dependent
values is resolved (except for some values at the borders of the blocks).
15 See Section 3.1 for the selection criteria.
16 Hereafter, “surprise” refers to standardized surprise, and “log,” “square root,” and “square” refer to sign-preserving
transformations. In brackets, the percentage of assets for which the measure is selected.
Econometrics 2017, 5, 35 23 of 46
for top appear between the most frequently selected measures, and the coefficient is positive in
all cases. Remembering that a higher level of importance corresponds to a tighter filter and that,
as a consequence, news tagged with higher importance also appears among news tagged with lower
importance, the positiveness of the coefficients associated with all levels of importance (among news
stories on the earnings topic) suggests that filtering by importance implies additional increasing
effects on volatility. Therefore, classification by importance may correspond with the news stories’
relevance to explaining volatility. With regard to the time horizon, variables based on day-to-day
variations and daily aggregation of news stories dominate. Flags for variations of the quantity of
information, both when the daily aggregation is proxied by the number of news stories and when
the number of words is used, are all associated with a positive coefficient, suggesting that investors
can become accustomed to the rate of information arrival and perceive variations as informative.
Only variations that differ from zero and negative variations were selected, suggesting that, in many
cases, investors wait for the rate of information to decrease to make decisions. Measures based on daily
information levels play a minor role in the explanation of volatility, but they still count. Among them,
the number of news stories released in a day is the most important variable. The flag for the release
of at least two news stories in a day is also an important variable, appearing among the ten most
frequently selected news stories measures.
Among macro-indicators, only NFP (Non-Farm Payrolls) belong to the thirty most frequently
selected measures. The monthly surprise (22%) and the monthly log surprise (7%) have both
a negative sign, suggesting that lower wages scare investors, who trade more actively as a consequence.
With regard to this indicator, investors look at the information released during the most recent month.
Finally, the weekly log Google Search Index (9%) has a positive sign, suggesting that retail
investors’ attention during the most recent week is positively related to market activity.
Summarizing the results, EPS and news stories are the most important drivers of volatility,
but macro-announcements and Google Trends also play a role. EPS announcements and surprises are
both important, and there is no evident asymmetric effect between positive and negative surprises.
Only EPS information released during the most recent day seems relevant to explaining volatility.
News stories from StreetAccount are slightly more useful than Thomson Reuters’ news stories are
in explaining market reactions, especially in the form of variables based on day-to-day variations
in the rate of information arrival and daily levels of information. In addition, earnings is the most
important news topic in affecting volatility. Macro-announcements in the form of Non-Farm Payrolls
affect market reactions, markets tend to react more strongly to negative surprises from expectations,
and they consider the information released during the most recent month. Retail investors’ attention
during the most recent week, as revealed by Google Trends, is positively linked to volatility.
Subsample results are reported in Tables A4–A6 in Appendix C. Additional macro-indicators
(Jobless Claims, Retail Sales and Consumer Price Index) are selected; they are based on the information
released during several time horizons, from overnight to the most recent month. FOMC Rate Decisions,
an indicator with a well-documented market reaction, surprisingly does not appear. It is possible
that part of the effect of FOMC announcements is attributed to leverage (in relation to the negative
returns caused by FOMC news), and it is also possible that the market reaction is limited to the same
day of the FOMC announcement (or even that the announcement is anticipated). As a consequence,
the reaction cannot be identified by our set of news-based indicators that, in order to explain log RV
on day t, are built using the information available before the market opens on day t. In Table A5,
which shows subsample results relative to the Global Financial Crisis, variables belonging to the group
News Burst Index appear, highlighting the importance of this concept—an observation that, as far as
we know, other studies do not highlight. Sentiment-based measures also appear in Table A5 and the
sign of their coefficients supports the literature’s notion that negative news moves markets more than
positive news does. Sentiment-based measures are much less significant determinants of volatility than
quantity-based measures are, suggesting either that sentiment is less important than the amount of
information conveyed by our news-related variables, or that the sentiment detection procedure should
Econometrics 2017, 5, 35 24 of 46
be improved. Finally, we point out a low robustness of the LASSO results when using it as a variable
selection tool: some results significantly change if the reference model considered is (only slightly)
modified by including or excluding some lags/variables. In particular, when past negative returns are
omitted from the model, macro-announcements (especially FOMC) become a much more important
driver of volatility. We leave these open questions to future work.
The MDH states that the rate of information arrival explains “the GARCH effects in asset returns,”
so it also explains the autoregressive behavior of volatility. In order to test this idea, we perform two
OLS regressions with HAC standard errors—one for the LHAR-CJ model and one for the LHAR-CJN
model—employing as news variables only those that were selected, and comparing the estimated
autoregressive coefficients between the two models. Table 16 presents the estimation results for
the autoregressive coefficients and leverage (cross-sectional average) β 0 , β Cd , β Cw , β Cm , β Jd , β Jw , β Jm ,
β rd , β rw , and β rm for both models and their variation after the inclusion of news17 . Coefficients are
consistent with the literature, except for β 0 and β Jw , and their value does not vary markedly from the
LHAR-CJ and the LHAR-CJN model. We also performed an F-test for the joint significance of the news
variables’ coefficients in the LHAR-CJN model, and the F-test rejects (at the 5% level) for almost all
stocks the null hypothesis that the news regressors have no effect on realized volatility, highlighting
the relevance of news as a driver of additional information. This result is consistent with the MDH.
17 News’ estimated coefficients are not reported here, as the focus is on the variation of the estimated coefficients of volatility
and leverage after the inclusion of news, and on the significance of news coefficients. News coefficients estimated with
LASSO are described in Table 15.
Econometrics 2017, 5, 35 25 of 46
Table 15. Most selected regressors in the LHAR-CJN model. Sample: 2005–2015.
Macro Firm-Specific News Importance Topic Time Aggregation Measure % Selected % Pos % Neg
Google Trends week log GSI 8.99 8.99 0.00
X TR news high earnings flow: day-to-day flag if ∆ n. news stories 6= 0 8.99 8.99 0.00
X TR news medium earnings flow: day-to-day flag if ∆ n. words < 0 8.99 8.99 0.00
X TR news medium earnings flow: day-to-day flag if ∆ n. news stories < 0 8.99 8.99 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words < 0 8.99 8.99 0.00
X SA news all flow: day-to-day flag if ∆ n. words < 0 8.99 8.99 0.00
X TR news high earnings flow: day-to-day flag if ∆ n. news stories < 0 7.87 7.87 0.00
X TR news low earnings day log n. words 7.87 7.87 0.00
X NFP month log surp 6.74 0.00 6.74
X TR news low all flow: day-to-day flag if ∆ n. words 6= 0 6.74 6.74 0.00
X SA news up/downgrades flow: day-to-day flag if ∆ n. words 6= 0 6.74 6.74 0.00
Notes: Ranking of regressors (past volatility components plus the thirty most frequently selected news measures) by percentage of stocks for which they are selected by LASSO in the
LHAR-CJN model, percentage of positive and percentage of negative coefficients. Sample: Feburary 2005–February 2015.
Econometrics 2017, 5, 35 27 of 46
T
1
MAE =
T ∑ RVt − RV
ct (13)
t =1
18 In a few cases, the LHAR-CJN model provides extremely low or extremely high forecasts of realized volatility, which are not
reliable. We apply an adjustment procedure, detailed in Appendix D.
Econometrics 2017, 5, 35 29 of 46
T 2
1
MSE =
T ∑ RVt − RV
ct (14)
t =1
3. The HRMSE heteroskedasticity adjusted mean square error suggested in Bollerslev and Ghysels
(1996):
v
u T 2
u1 RVt
HRMSE = t
T ∑ RV
ct
−1 (15)
t =1
T
1 RVt
QLIKE =
T ∑ ct+
log RV
RV
ct
(16)
t =1
Results show that the inclusion of news-based measures substantially improves volatility forecasting.
Table 17 reports the cross-sectional mean over all assets of the metrics. It also includes in brackets
for all metrics except the R2 MZ the percentage of assets for which the Diebold and Mariano (1995)
test rejects at a 5% significance level the null hypothesis of equal predictive accuracy in favor of each
model19 , and in brackets for the R2 MZ, the percentage of assets for which the metric is higher (meaning
a superior predictive accuracy) for each model. The LHAR-CJN model yields, on average, lower MAE,
MSE, HRMSE, and QLIKE and a higher R2 MZ. The Diebold-Mariano test reveals that, in terms of HRMSE
and QLIKE, the LHAR-CJN model’s superior forecasting power is statistically significant for 48.31%
and 32.58% of stocks, respectively. In terms of MAE and MSE, instead, the Diebold-Mariano test signals
a superior forecasting performance of the LHAR-CJ model, but for a very limited percentage of stocks.
LHAR-CJ LHAR-CJN
MAE 0.99 (8.99%) 0.97 (1.12%)
MSE 65.73 (3.37%) 37.11 (0.00%)
HRMSE 0.90 (0.00%) 0.83 (48.31%)
QLIKE 1.45 (1.12%) 1.44 (32.58%)
R2 MZ 0.49 (25.84%) 0.51 (74.16%)
Notes: One-step-ahead MAE, MSE, HRMSE, QLIKE, and R2 MZ of the LHAR-CJ and the LHAR-CJN models
(cross-sectional average). In brackets, for each model and for each metric except R2 MZ: percentage of assets
for which the Diebold-Mariano test rejects with a 5% level the null hypothesis of equal predictive accuracy in
favor of that model (p-values are corrected with the Holm–Bonferroni method to control the familywise error
rate in the multiple testing framework); for R2 MZ: percentage of assets for which the metric is higher for
that model.
Figure 1 illustrates a dynamic analysis of the metrics. After having obtained the one-step-ahead
forecasts of the two models (using a rolling window of 1000 observations on the original sample
of 2531 observations, we are left with a series of 1531 forecasts), we apply a rolling window of 250
19 We have four null hypotheses to test and we want to control the familywise error rate, that is the probability that we will identify
at least one significant result, when in fact all of the null hypotheses are true. We apply the sequential Bonferroni correction
proposed by Holm (1979) to each asset. In Holm’s sequential procedure, the tests are first performed in order to obtain their
p-values. The tests are then ordered from the one with the smallest p-value to the one with the largest p-value. The test with the
lowest probability is tested first with a Bonferroni correction involving all tests (which consists in multiplying the p-value by the
total number of tests performed, in our case four). The second test is tested with a Bonferroni correction involving one less test
and so on for the remaining tests. The procedure stops when the first non-significant test is obtained or when all the tests have
been performed.
Econometrics 2017, 5, 35 30 of 46
days to the one-step-ahead forecasts series. The graphs report the percentage of assets for which
the Diebold-Mariano test rejects (at a 5% significance level) the null hypothesis of equal predictive
accuracy of the two models20 , distinguishing when the best model is the LHAR-CJN and when it is the
LHAR-CJ. The dynamic analysis shows that the superior predictive accuracy obtained by including
the news-related variables is uncertain in the first half of the series, and neat in the second half. In the
first half, the LHAR-CJ model is superior to the LHAR-CJN in terms of MAE, but in terms of HRMSE
the opposite is true; in terms of MSE and QLIKE, none of the models is superior, as well as in terms of
R2 MZ. In the second half, except for the metric MAE which yields unclear results, in terms of MSE,
HRMSE, QLIKE, and R2 MZ the LHAR-CJ model is never superior to the LHAR-CJN model, while the
LHAR-CJN model obtains a volatility forecast that is statistically superior for a relevant percentage
of assets.
Figure 1. Rolling analysis of the one-step-ahead MAE, MSE, HRMSE, QLIKE, and R2 MZ of the
LHAR-CJ and the LHAR-CJN models. Applying a window size of 250 observations on the series
of one-step-ahead forecasts, the graphs report for each metric the percentage of assets for which the
Diebold-Mariano test rejects at the 5% level the null hypothesis of equal predictive accuracy of the two
models (p-values are corrected with the Holm–Bonferroni method to control the familywise error rate
in the multiple testing framework), distinguishing when the best model is the LHAR-CJN (bars above
the horizontal axis), and when the best model is the LHAR-CJ (bars below the horizontal axis). For the
R2 MZ, the graphs report the percentage of assets for which the metric is higher for the LHAR-CJN
(bars above the horizontal axis) and for the LHAR-CJ (bars below the horizontal axis).
7. Concluding Remarks
We created an extensive and innovative database that contains macroeconomic announcements,
earnings announcements, firm-specific news stories from two professional news providers, and Google
Trends, all of which are useful in analyzing the asset price dynamics of the S&P 100 companies.
We applied a bag-of-words approach to detect the sentiment of news stories and introduced a set of
negations with the aim of generalizing the method in order to extract the sentiment of any type of
financial text. Then we built a set of news measures that provide natural proxies for the information
used by heterogeneous market players and for retail investors’ attention.
Our empirical results validate the MDH, showing the relevance of news in explaining volatility.
EPS and news stories are the most important drivers of volatility, followed by macro-news and Google
Trends. The topics of news stories that are most relevant in affecting volatility are earnings and
upgrades/downgrades, but the rest of the news is also influential. Aggregating information over
various time horizons and looking at variations of the volume of information across time helps to
explain volatility. By including news-based information, we significantly improve volatility forecasting.
Future research should develop a more refined sentiment-detection technique and study the
relationship between news and intraday asset price dynamics.
Acknowledgments: The authors thank for the comments provided the Editor, three anonymous referees,
Paolo Giudici, Cecilia Mancini, Andrea Menini, Giancarlo Nicola, Amedeo Pugliese, the participants to the
Summer School in Economics and Finance-Canazei 2016: Networks and Big Data Analysis in Economics,
Finance, and Social Systems, Alba di Canazei, Trento, the 1st DEM Workshop in Financial Econometrics,
University of Verona-Department of Economics, the 10th International Conference on Computational and Financial
Econometrics (CFE 2016), Higher Technical School of Engineering, University of Seville, the Seventh Italian
Congress of Econometrics and Empirical Economics (ICEEE 2017), Messina, and the Italian Statistical Society
Conference SIS 2017 Statistics and Data Science: New Challenges, New Generations, University of Florence.
The authors gratefully acknowledge financial support from Progetto di Ricerca di Ateneo (PRAT) Multi-jumps
in financial asset prices: detection of systemic events, relation with news, and implications for pricing CPDA143827. F.P.
worked on the empirical part of the paper while he was visiting the Goethe University of Frankfurt-SAFE center
whose hospitality is gratefully acknowledged.
Author Contributions: Both authors contributed equally to the paper.
Conflicts of Interest: The authors declare no conflict of interest.
where µt is predictable, σt is cadlag, dJt = ct dNt where Nt is a non-explosive Poisson process whose
intensity is an adapted stochastic process λt , the times of the jumps are (τj ) j=1,...,Nt and c j are i.i.d.
adapted random variables measuring the size, which is always positive, of the jump at time τj .
Quadratic variation of the process over a time window T, e.g. one day, is defined as:
Econometrics 2017, 5, 35 35 of 46
Z t+ T
[ X ]tt+T = X[2t+T ] − Xt2 − 2 Xs− dXs (A2)
t
where t indexes the day. It can be decomposed into its continuous and discontinuous component:
The quadratic variation process and its separate components are, of course, not directly observable.
Instead, we resort to popular model-free non-parametric consistent measures, including the familiar
realized variance:
n
RVδ ( X )t = ∑ ( ∆ j X )2 (A5)
j =1
[ T/δ] M
1
∑ ∏ Zγk (∆ j−k+1 X, ϑj−k+1 )
[γ1 ,...,γ M ]
C-TMPVδ ( X )t = δ1− 2 (γ1 +...+γ M ) (A7)
j = M k =1
where N ( x ) is the standard normal cumulative function, Γ(α, x ) is the upper incomplete gamma
function, ϑ = c2ϑ σ2 and σ2 is the variance of ∆ j X under the assumption that ∆ j X ∼ N (0, σ2 ).
Following Corsi et al. (2010), we set cϑ = 3.
R t+ T 2
As δ → 0, C-TBPV converges to t σ (s)ds
The difference between the realized variance and the corrected threshold bipower variation
consistently estimates the part of the quadratic variation due to jumps:
Econometrics 2017, 5, 35 36 of 46
P
RVδ ( X ) T − C-TBPVδ ( X ) T −−→ [ X d ]tt+T (A9)
δ →0
where C-TTriPVδ ( X ) T is a quarticity estimator, see Corsi et al. (2010), is asymptotically standard
normally distributed under the null hypothesis of no jumps.
Based on the above jump detection test statistic, the realized measure of the jump contribution to
the quadratic variation of the logarithmic price process is then measured by:
where I(·) denotes the indicator function and Φα refers to the appropriate critical value from the
standard normal distribution.
Consequently, the realized measure for the integrated variance is:
bt = RVt − b
C Jt (A12)
Table A3. Basic summary statistics of assets’ percentage of days with at least one jump.
Figure A1. Distribution of assets’ percentage of days with at least one jump.
Econometrics 2017, 5, 35 37 of 46
Table A4. Most selected regressors in the LHAR-CJN model. Sample: 2005–2007.
Macro Firm-Specific News Importance Topic Time Aggregation Measure % Selected % Pos % Neg
X JOBLESS overnight flag if surp 6= 0 7.87 7.87 0.00
X SA news earnings flow: day-to-day flag if ∆ n. words 6= 0 7.87 7.87 0.00
X SA news earnings flow: day-to-day flag if ∆ n. news stories < 0 7.87 7.87 0.00
X SA news earnings day log n. words 6.74 6.74 0.00
X TR news medium earnings flow: day-to-day flag if ∆ n. news stories < 0 5.62 5.62 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words < 0 5.62 5.62 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. news stories 6= 0 5.62 5.62 0.00
X SA news earnings day flag if n. news stories 6= 0 5.62 5.62 0.00
X TR news low earnings day sqrt n. words 4.49 4.49 0.00
X TR news low earnings day sqrt n. news stories 4.49 4.49 0.00
X TR news low earnings day flag if n. news stories 6= 0 4.49 4.49 0.00
X TR news low earnings day n. news stories 4.49 4.49 0.00
X SA news earnings day sqrt n. words 4.49 4.49 0.00
X TR news medium earnings flow: day-to-day flag if ∆ n. words 6= 0 3.37 3.37 0.00
X TR news medium earnings flow: day-to-day flag if ∆ n. news stories 6= 0 3.37 3.37 0.00
X TR news medium earnings day sqrt n. words 3.37 3.37 0.00
X TR news medium all day sqrt n. words 3.37 3.37 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words 6= 0 3.37 3.37 0.00
X TR news low earnings day log n. words 3.37 3.37 0.00
X TR news low earnings day log n. news stories 3.37 3.37 0.00
X SA news earnings flow: day-to-day flag if ∆ n. words < 0 3.37 3.37 0.00
X SA news earnings day flag if n. news stories ≥ 2 3.37 3.37 0.00
X SA news earnings day n. news stories 3.37 3.37 0.00
X SA news all flow: day-to-day ∆ n. news stories 3.37 0.00 3.37
X SA news all day flag if n. news stories ≥ 2 3.37 3.37 0.00
X SA news all day square n. news stories 3.37 3.37 0.00
Notes: Ranking of regressors (past volatility components plus the thirty most frequently selected news measures) by percentage of stocks for which they are selected by LASSO in the
LHAR-CJN model, percentage of positive and percentage of negative coefficients. Sample: Feb 2005 – Dec 2007 (expansion).
Econometrics 2017, 5, 35 39 of 46
Table A5. Most selected regressors in the LHAR-CJN model. Sample: 2007–2009.
Macro Firm-Specific News Importance Topic Time Aggregation Measure % Selected % Pos % Neg
X TR news high M&A flow: day-to-day flag if ∆ n. words 6= 0 1.12 1.12 0.00
X TR news high M&A month flag if sentiment < 0 1.12 1.12 0.00
X TR news high earnings flow: day-to-day flag if ∆ n. news stories < 0 1.12 1.12 0.00
X TR news high all flow: day-to-day flag if ∆ n. news stories < 0 1.12 1.12 0.00
X TR news high all month news burst index (M = 78, k = 2) 1.12 1.12 0.00
X TR news high all month sentiment 1.12 0.00 1.12
X TR news high all month square n. words 1.12 1.12 0.00
X TR news high all month n. words 1.12 1.12 0.00
X TR news high all week n. news 1.12 1.12 0.00
X TR news medium earnings day square n. words 1.12 1.12 0.00
X TR news medium all month sentiment 1.12 0.00 1.12
X TR news low regulatory flow: month-to-day flag if ∆ n. words < 0 1.12 1.12 0.00
X TR news low regulatory flow: month-to-day flag if ∆ n. news stories < 0 1.12 1.12 0.00
X TR news low M&A month sentiment 1.12 0.00 1.12
X TR news low litigation month words burst index (M = 78, k = 4) 1.12 1.12 0.00
X TR news low financial month news burst index (M = 78, k = 4) 1.12 1.12 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words < 0 1.12 1.12 0.00
Notes: Ranking of regressors (past volatility components plus the thirty most frequently selected news measures) by percentage of stocks for which they are selected by LASSO in the
LHAR-CJN model, percentage of positive and percentage of negative coefficients. Sample: Dec 2007 – Jun 2009 (contraction).
Table A6. Most selected regressors in the LHAR-CJN model. Sample: 2009–2015.
Macro Firm-Specific News Importance Topic Time Aggregation Measure % Selected % Pos % Neg
X EPS day flag for announcement 49.44 49.44 0.00
X SA news all day n. news stories 32.58 32.58 0.00
X SA news earnings flow: day-to-day flag if ∆ n. news stories < 0 20.22 20.22 0.00
X EPS day flag if surp 6= 0 17.98 17.98 0.00
X SA news earnings flow: day-to-day flag if ∆ n. words < 0 16.85 16.85 0.00
X SA news earnings day flag if n. news stories ≥ 2 16.85 16.85 0.00
X SA news earnings day log n. news stories 15.73 15.73 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. news stories 6= 0 14.61 14.61 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. news stories < 0 13.48 13.48 0.00
X SA news earnings flow: day-to-day flag if ∆ n. words 6= 0 12.36 12.36 0.00
X SA news earnings day n. news stories 12.36 12.36 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words 6= 0 11.24 11.24 0.00
X TR news low earnings day n. news stories 11.24 11.24 0.00
X SA news earnings day sqrt n. words 11.24 11.24 0.00
X SA news earnings day flag if n. news stories 6= 0 11.24 11.24 0.00
X SA news all flow: day-to-day flag if ∆ n. words 6= 0 11.24 11.24 0.00
X SA news all day n. words 11.24 11.24 0.00
X SA news all day square n. news stories 11.24 11.24 0.00
X TR news low earnings flow: day-to-day flag if ∆ n. words < 0 10.11 10.11 0.00
X SA news earnings flow: day-to-day flag if ∆ n. news stories 6= 0 10.11 10.11 0.00
X SA news earnings day log n. words 10.11 10.11 0.00
Google Trends month GSI 8.99 5.62 3.37
X TR news low earnings day flag if n. news stories 6= 0 8.99 8.99 0.00
X SA news earnings day sqrt n. news stories 8.99 8.99 0.00
X SA news all day flag if n. news stories ≥ 2 8.99 8.99 0.00
X SA news all day sqrt n. words 8.99 8.99 0.00
Google Trends week log GSI 7.87 7.87 0.00
Google Trends week sqrt GSI 7.87 7.87 0.00
Google Trends week GSI 7.87 6.74 1.12
X TR news medium earnings flow: day-to-day flag if ∆ n. news stories < 0 7.87 7.87 0.00
Notes: Ranking of regressors (past volatility components plus the thirty most frequently selected news measures) by percentage of stocks for which they are selected by LASSO in the
LHAR-CJN model, percentage of positive and percentage of negative coefficients. Sample: Jun 2009 – Feb 2015 (expansion).
Econometrics 2017, 5, 35 42 of 46
ZLL,t < RV
c LH AR−CJN,t < ZL,t c LH AR−CJN,t − ZL,t )/2
ZL,t + ( RV
ZL,t ≤ RV
c LH AR−CJN,t ≤ ZH,t RV
c LH AR−CJN,t
ZH,t < RV
c LH AR−CJN,t < ZHH,t c LH AR−CJN,t − ZH,t )/2
ZH,t + ( RV
ZHH,t ≤ RV
c LH AR−CJN,t ( ZH,t + ZHH,t )/2
In the Table, the following equalities hold: ZLL,t = RV
c LH AR−CJ,t /4; ZL,t = RV
c LH AR−CJ,t /2;
c LH AR−CJ,t · 2; ZHH,t = RV
ZH,t = RV c LH AR−CJ,t · 4
References
Allen, David E., Michael J. McAleer, and Abhay K. Singh. 2015a. Daily Market News Sentiment and Stock Prices.
Tinbergen Institute Discussion Papers 15-090/III, Tinbergen Institute, Amsterdam, The Netherlands.
Allen, David E., Michael J. McAleer, and Abhay K. Singh. 2015b. Machine News and Volatility: The Dow Jones
Industrial Average and the TRNA Real-Time Sentiment Series. In The Handbook of High Frequency Trading.
Edited by Greg N. Gregoriou. Amsterdam, The Netherlands: Elsevier, pp. 327–44.
Allen, David E., Michael J. McAleer, and Abhay K. Singh. 2017. An Entropy-Based Analysis of the Relationship
between the DOW JONES Index and the TRNA Sentiment Series. Applied Economics 49: 677–92.
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2003. Modeling and Forecasting Realized
Volatility. Econometrica 71: 579–625.
Andersen, Torben G., Tim Bollerslev, and Francis X. Diebold. 2007. Roughing It Up: Including Jump Components
in the Measurement, Modeling, and Forecasting of Return Volatility. The Review of Economics and Statistics
89: 701–20.
Andrei, Daniel, and Michael Hasler. 2015. Investor Attention and Stock Market Volatility. The Review of Financial
Studies 28: 33–72.
Antweiler, Werner, and Murray Z. Frank. 2004. Is All That Talk Just Noise? The Information Content of Internet
Stock Message Boards. The Journal of Finance 59: 1259–94.
Audrino, Francesco, and Simon D. Knaus. 2016. Lassoing the HAR Model: A Model Selection Perspective on
Realized Volatility Dynamics. Economet Reviews 35: 1485–521.
Bajgrowicz, Pierre, Olivier Scaillet, and Adrien Treccani. 2016. Jumps in High-Frequency Data: Spurious
Detections, Dynamics, and News. Management Science 62, 2198–217.
Baker, Scott, and Andry Fradkin. 2011. What Drives Job Search? Evidence from Google Search Data. Discussion
Papers, Stanford Instititute for Economic Policy Research, Stanford, CA, USA.
Baklaci, Hasan F., Gokce Tunc, Berna Aydogan, and Gulin Vardar. 2011. The Impact of Firm-Specific Public News
on Intraday Market Dynamics: Evidence from the Turkish Stock Market. Emerging Markets Finance and Trade
47: 99–119.
Barndorff-Nielsen, Ole E., and Neil Shephard. 2004. Power and Bipower Variation with Stochastic Volatility and
Jumps. Journal of Financial Econometrics 2: 1–37.
Barndorff-Nielsen, Ole E., and Neil Shephard. 2006. Econometrics of Testing for Jumps in Financial Economics
Using Bipower Variation. Journal of Financial Econometrics 4: 1–30.
Econometrics 2017, 5, 35 43 of 46
Berry, Thomas D., and Keith M. Howe. 1994. Public Information Arrival. The Journal of Finance 49: 1331–46.
Bergmeir, Christoph, and José M. Benitez. 2012. On the Use of Cross-validation for Time Series Predictor
Evaluation. Information Sciences 191: 192–213.
Berk, Kenneth N. 1978. Comparing Subset Regression Procedures. Technometrics 20: 1–6.
Birz, Gene, and John R. Lott. 2011. The Effect of Macroeconomic News on Stock Returns: New Evidence from
Newspaper Coverage. Journal of Banking & Finance 35: 2791–800.
Bollerslev, Tim. 1986. Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics
31: 307–27.
Bollerslev, Tim, and Eric Ghysels. 1996. Periodic Autoregressive Conditional Heteroscedasticity. Journal of Business
& Economic Statistics 14: 139–51.
Bomfim, Antulio N. 2003. Pre-announcement Effects, News Effects, and Volatility: Monetary Policy and the Stock
Market. Journal of Banking & Finance 27: 133–51.
Borovkova, Svetlana, and Diego Mahakena. 2015. News, Volatility and Jumps: the Case of Natural Gas Futures.
Quantitative Finance 15: 1217–42.
Brailsford, Timothy J. 1996. The Empirical Relationship between Trading Volume, Returns and Volatility.
Accounting & Finance 36: 89–111.
Brenner, Menachem, Paolo Pasquariello, and Marti Subrahmanyam. 2009. On the Volatility and Comovement of
U.S. Financial Markets around Macroeconomic News Announcements. Journal of Financial and Quantitative
Analysis 44: 1265–89.
Burman, Prabir, Edmond Chow, and Deborah Nolan. 1994. A Cross-validatory Method for Dependent Data.
Biometrika 81: 351–58.
Busse, Jeffrey A., and T. Clifton Green. 2002. Market Efficiency in Real Time. Journal of Financial Economics
65: 415–37.
Castle, Jennifer L., Jurgen A. Doornik, and David F. Hendry. 2011. Evaluating Automatic Model Selection. Journal
of Time Series Econometrics 3: 1–33.
Clark, Peter K. 1973. A Subordinated Stochastic Process Model with Finite Variance for Speculative Prices.
Econometrica 41: 135–55.
Corsi, Fulvio. 2009. A Simple Approximate Long-Memory Model of Realized Volatility. Journal of Financial
Econometrics 7: 174–96.
Corsi, Fulvio, Davide Pirino, and Roberto Reno. 2010. Threshold Bipower Variation and the Impact of Jumps on
Volatility Forecasting. Journal of Econometrics 159, 276–88.
Corsi, Fulvio, and Roberto Renò. 2012. Discrete-Time Volatility Forecasting With Persistent Leverage Effect and
the Link With Continuous-Time Volatility Modeling. Journal of Business & Economic Statistics 30: 368–80.
Cutler, David M., James M. Poterba, and Lawrence H. Summers. 1989. What Moves Stock Prices? The Journal of
Portfolio Management 15: 4–12.
Da, Zhi, Joseph Engelberg, and Pengjie Gao. 2011. In Search of Attention. The Journal of Finance 66: 1461–99.
Da, Zhi, Joseph Engelberg, and Pengjie Gao. 2015. The Sum of All FEARS Investor Sentiment and Asset Prices.
The Review of Financial Studies 28: 1–32.
D’Amuri, Francesco, and Juri Marcucci. 2012. The Predictive Power of Google Searches in Forecasting
Unemployment. Banca D’Italia Working Papers n. 891, Bank of Italy, Roma, Italy.
Diebold, Francis X., and Robert S. Mariano. 1995. Comparing Predictive Accuracy. Journal of Business &
Economic Statistics 13: 253–63.
Dimpfl, Thomas, and Stephan Jank. 2016. Can Internet Search Queries Help to Predict Stock Market Volatility?
European Financial Management 22: 171–92.
Dougal, Casey, Joseph Engelberg, Diego García, and Christopher A. Parsons. 2012. Journalists and the Stock
Market. The Review of Financial Studies 25: 639–79.
Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. 2004. Least Angle Regression. The Annals of
statistics 32: 407–51.
Engle, Robert F. 1982. Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United
Kingdom Inflation. Econometrica 50: 987–1007.
Engle, Robert F., and Jose Gonzalo Rangel. 2008. The Spline-GARCH Model for Low-Frequency Volatility and Its
Global Macroeconomic Causes. The Review of Financial Studies 21: 1187–222.
Econometrics 2017, 5, 35 44 of 46
Epps, Thomas W., and Mary Lee Epps. 1976. The Stochastic Dependence of Security Price Changes and Transaction
Volumes: Implications for the Mixture-of-Distributions Hypothesis. Econometrica 44: 305–21.
Fan, Jianqing, and Runze Li. 2001. Variable Selection via Nonconcave Penalized Likelihood and its Oracle
Properties. Journal of the American statistical Association 96: 1348–60.
Fang, Lily, and Joel Peress. 2009. Media Coverage and the Cross-section of Stock Returns. The Journal of Finance
64: 2023–52.
Feig, Douglas G. 1978. Ridge Regression: When Biased Estimation is Better. Social Science Quarterly 58: 708–16.
Flannery, Mark J., and Aris A. Protopapadakis. 2002. Macroeconomic Factors do Influence Aggregate Stock
Returns. The review of financial studies 15: 751–82.
Gallo, Giampiero M., and Barbara Pacini. 2000. The Effects of Trading Activity on Market Volatility. The European
Journal of Finance 6: 163–75.
García, Diego. 2013. Sentiment during Recessions. The Journal of Finance 68: 1267–300.
Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant.
2009. Detecting Influenza Epidemics Using Search Engine Query Data. Nature 457: 1012–14.
Gloß-Klußmann, Axel, and Nikolaus Hautsch. 2011. When Machines Read the News: Using Automated
Text Analytics to Quantify High Frequency News-Implied Market Reactions. Journal of Empirical Finance
18: 321–40.
Goddard, Arben, and Qingwei Wang. 2015. Investor Attention and FX Market Volatility. Journal of International
Financial Markets, Institutions & Money 38: 79–96.
Guo, Jian-Feng, and Qiang Ji. 2013. How does Market Concern Derived from the Internet Affect Oil Prices?
Applied Energy 112: 1536–43.
Hamid, Alain, and Moritz Heiden. 2015. Forecasting Volatility with Empirical Similarity and Google Trends.
Journal of Economic Behavior & Organization 117: 62–81.
Hautsch, Nikolaus, Dieter Hess, and David Veredas. 2011. The Impact of Macroeconomic News on Quote
Adjustments, Noise, and Informational Volatility. Journal of Banking & Finance 35: 2733–46.
Ho, Kin-Yip, Yanlin Shi, and Zhaoyong Zhang. 2013. How Does News Sentiment Impact Asset Volatility?
Evidence from Long Memory and Regime-Switching Approaches. The North American Journal of Economics
and Finance 26: 436–56.
Holm, Sture. 1979. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6:
65–70.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning.
New York: Springer-Verlag.
Janssen, Gust. 2004. Public Information Arrival and Volatility Persistence in Financial Markets. The European
Journal of Finance 10: 177–97.
Jones, Charles M., Owen Lamont, and Robin L. Lumsdaine. 1998. Macroeconomic News and Bond Market
Volatility. Journal of Financial Economics 47: 315–37.
Kalev, Petko S., Wai-Man Liu, Peter K. Pham, and Elvis Jarnecic. 2004. Public Information Arrival and Volatility of
Intraday Stock Returns. Journal of Banking & Finance 28: 1441–67.
Kim, Dongcheol, and Stanley J. Kon. 1994. Alternative Models for the Conditional Heteroscedasticity of Stock
Returns. Journal of Business 67: 563–98.
Kraussl, Roman, and Elizaveta Mirgorodskaya. 2016. Media, Sentiment and Market Performance in the Long Run.
The European Journal of Finance 22: 1–24.
Lamoureux, Christopher G., and William D. Lastrapes. 1990. Heteroskedasticity in Stock Return Data: Volume
versus GARCH Effects. The Journal of Finance 45: 221–29.
Li, Li, and R.F. Engle. 1998. Macroeconomic Announcements and Volatility of Treasury Futures. Discussion Paper,
University of California, San Diego, CA, USA.
Loughran, Tim, and Bill McDonald. 2011. When is a Liability Not a Liability? Textual Analysis, Dictionaries, and
10-Ks. The Journal of Finance 66: 35–65.
Martens, Martin, Dick van Dijk, and Michiel de Pooter. 2009. Forecasting S&P 500 Volatility: Long Memory, Level
Shifts, Leverage Effects, Day-of-the-Week Seasonality, and Macroeconomic Announcements. International
Journal of Forecasting 25: 282–303.
McMillan, David G., and Raquel Quiroga García. 2013. Does Information Help Intra-Day Volatility Forecasts?
Journal of Forecasting 32: 1–9.
Econometrics 2017, 5, 35 45 of 46
Mitchell, Mark L., and J. Harold Mulherin. 1994. The Impact of Public Information on the Stock Market. Journal of
Forecasting 49: 923–50.
Mitra, Gautam, and Leela Mitra. 2011. The Handbook of News Analytics in Finance. Hoboken: John Wiley and Sons.
Mitra, Gautam, and Xiang Yu. 2016. The Handbook of Sentiment Analysis in Finance. New York: Albury Books.
Omran, M.F., and E. McKenzie. 2000. Heteroscedasticity in Stock Returns Data Revisited: Volume versus GARCH
Effects. Applied Financial Economics 10: 553–60.
Pavlou, Menelaos, Gareth Ambler, Shaun Seaman, Maria De Iorio, and Rumana Z. Omar. 2016. Review and
Evaluation of Penalised Regression Methods for Risk Prediction in Low-Dimensional Data with Few Events.
Statistics in Medicine 35: 1159–77.
Preis, Tobias, Daniel Reith, and H. Eugene Stanley. 2010. Complex Dynamics of our Economic Life on Different
Scales: Insights from Search Engine Query Data. Philosophical Transactions of the Royal Society A 368: 5707–19.
Racine, Jeff. 2000. Consistent Cross-validatory Model-selection for Dependent Data: hv-block Cross-validation.
Journal of Econometrics 99: 39–61.
Rangel, José Gonzalo. 2011. Macroeconomic News, Announcements, and Stock Market Jump Intensity Dynamics.
Journal of Banking & Finance 35: 1263–76.
Riordan, Ryan, Andreas Storkenmaier, Martin Wagener, and S. Sarah Zhang. 2013. Public Information Arrival:
Price Discovery and Liquidity in Electronic Limit Order Markets. Journal of Banking & Finance 37: 1148–59.
Roll, Richard. 1988. R2. The Journal of Finance 43: 541–66.
Savor, Pavel, and Mungo Wilson. 2013. How Much Do Investors Care About Macroeconomic Risk? Evidence
from Scheduled Economic Announcements. Journal of Financial and Quantitative Analysis 48: 343–75.
Schwert, G. William. 1989. Why Does Stock Market Volatility Change over Time? The Journal of Finance 44: 1115–53.
Smales, Lee A. 2015. Time-Variation in the Impact of News Sentiment. International Review of Financial Analysis
Journal 37: 40–50.
Smith, Geoffrey Peter. 2012. Google Internet Search Activity and Volatility Prediction in the Market for Foreign
Currency. Finance Research Letters 9: 103–10.
Solomon, David H., Eugene Soltes, and Denis Sosyura. 2014. Winners in the Spotlight: Media Coverage of Fund
Holdings as a Driver of Flows. Journal of Financial Economics 113: 53–72.
Tauchen, George E., and Mark Pitts. 1983. The Price Variability-Volume Relationship on Speculative Markets.
Econometrica 51: 485–505.
Tetlock, Paul C. 2007. Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of
Finance 62: 1139–68.
Tetlock, Paul C., Maytal Saar-Tsechansky, and Sofus Macskassy. 2008. More than Words: Quantifying Language to
Measure Firms’ Fundamentals. The Journal of Finance 63: 1437–67.
Tibshirani, Robert. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B
58: 267–88.
Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. Sparsity and Smoothness
via the Fused Lasso. Journal of the Royal Statistical Society B 67: 91–108.
Vlastakis, Nikolaos, and Raphael N. Markellos. 2012. Information Demand and Stock Market Volatility. Journal of
Banking & Finance 36: 1808–21.
Vozlyublennaia, Nadia. 2014. Investor Attention, Index Performance, and Return Predictability. Journal of Banking
& Finance 41: 17–35.
Vrugt, Evert B. 2009. U.S. and Japanese Macroeconomic News and Stock Market Volatility in Asia-Pacific.
Pacific-Basin Finance Journal 17: 611–27.
Yuan, Ming, and Yi Lin. 2006. Model Selection and Estimation in Regression with Grouped Variables. Journal of
the Royal Statistical Society B 68: 49–67.
Zhang, Ying, Peggy E. Swanson, and Wikrom Prombutr. 2012. Measuring Effects on Stock Returns of Sentiment
Indexes Created from Stock Message Boards. Journal of Financial Research 35: 79–114.
Zhang, Yongjie, Lina Feng, Xi Jin, Dehua Shen, Xiong Xiong, and Wei Zhang. 2014. Internet Information Arrival
and Volatility of SME PRICE INDEX. Physica A 399: 70–74.
Econometrics 2017, 5, 35 46 of 46
Zou, Hui. 2006. The Adaptive Lasso and its Oracle Properties. Journal of the American Statistical Association
101: 1418–29.
Zou, Hui, and Trevor Hastie. 2005. Regularization and Variable Selection via the Elastic Net. Journal of the Royal
Statistical Society B 67: 301–20.
c 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).