Forecasting With Neural Netwworks - Fletscher
Forecasting With Neural Netwworks - Fletscher
59
North-Holland
Applications
Table 1
Sample of failed and non-failed firms.
l-l Westates Petroleum 1.39 0.67 0.34 o-1 Universal 6.43 4.49 0.19
l-2 Cott Corp. 2.00 0.27 0.02 o-2 ME1 Corp. 1.29 0.33 0.98
l-3 American Mfg. Co. 3.20 0.73 0.93 o-3 Gaynor-Stafford 5.20 0.78 -0.20
1-4 Scottex Corp. 1.59 0.09 -0.33 o-4 Compo Ind. 3.77 0.32 - 0.09
l-5 Lynnwear 1.70 0.24 0.09 o-5 Movie Star Inc. 2.80 0.23 0.13
1-6 Nelly Don, Inc. 1.70 0.15 0.13 O-6 Decorator Ind. 3.18 0.57 0.02
l-7 Mansfield Tire & Rubber 2.50 0.09 0.06 o-7 Pope & Talbot 1.90 0.15 0.09
1-8 Brody Seating Co. 2.70 0.14 0.01 O-8 Ohio-Scaly 2.60 0.97 0.35
1-9 Paterson Parchment Paper 2.41 0.09 - 0.04 o-9 Clevepak Corp. 3.28 0.30 0.42
I-10 Rowland Inc. 1.73 0.08 0.02 O-10 Park Chemical 4.91 1.15 0.15
1-11 Pasco Inc. 1.51 0.12 0.16 o-11 Holly Corp. 1.55 0.11 0.37
l-12 RAI Inc. 1.16 0.01 0.09 o-12 Barry (R.G.) 3.00 0.38 0.15
1-13 Gray Mfg. Co. 3.31 0.49 0.11 o-13 Struthers Wells 1.61 0.21 0.39
1-14 Gladding Corp. 2.08 0.04 0.07 o-14 Watkins-Johnson 4.01 0.09 0.19
1-15 Merchants, Inc. 2.73 0.35 0.23 o-15 Banner Ind. 2.30 0.32 0.19
1-16 Shulman Transport 1.13 0.13 0.22 O-16 WTC Inc. 1.17 0.33 0.85
1-17 Reeves Telecom Corp. 3.20 0.63 0.20 o-17 Gross Telecasting 8.80 6.91 0.25
l-18 Plaza Group Inc. 0.91 0.03 - 1.09 O-18 Total Petroleum 2.15 0.25 0.33
Information & Management D. Fletcher, E. Goss / Forecasting with neural networks 161
for new data. Typically, a risk cutoff value is (2) QR: quick ratio [(cash + other near cash as-
selected by the analyst and for categorization sets)/ (current liabilities)].
purposes is normally between 0.5 and 1. All ob- (3) IR: income ratio [(net income/working capi-
servations with predicted risks equal to or above tal].
this value are categorized as 1 (failed) and 0
(non-failed) when less than this value. The choice
of the cutoff value depends on the relative cost of Model training and selection methodology
incorrectly categorizing an observation as a 0
when it is not, versus the converse. Due to the small sample size after matching,
This binary decision approach is also used to direct training-to-test set validation is not advis-
develop an estimator to forecast bankruptcy. In able. A variation of the cross-validation method
this application, the independent variables are known as u-fold cross-validation (CV,) was se-
financial ratios as described below. The depen- lected to conduct analyses of the models. This
dent variable is determined to be 0 for a non- method was introduced by Geisser 171 and Wahba
failed firm and 1 for a failed firm. If the informa- et al. [18] and is illustrated in a case study of
tion contained within the independent variables corporate bond rating predictions by Utans and
is sufficient, the estimated dependent variable 9 Moody [17]. The selection of the optimal model(h)
then represents an empirical probability in the architecture is based on an estimator CV, of the
range of 0 to 1 of the event occurring. Further- prediction risk PA, which is a measure of general-
more, since determining. a firm’s trend to ization ability of model(h). Using this approach,
bankruptcy is of more interest than categorizing the data is divided into u = 18 subsets of six
bankruptcy ex post, we are concerned with the observations for each test set with three rotation-
accuracy of the model in risk categories less than ally selected from the failed group and three
0.5. from the non-failed group for a total of 108
Data for the empirical tests were drawn from observations in the test sets. This also provides 18
an earlier study by Gentry et al. [81. In their training sets of 30 observations. Each observation
sample selection, failed companies were matched is represented three times in the test sets. The
with a sample of non-failed companies that were cross-validation mean square error of each subset
in the same industries and approximately the j as defined by
same size in terms of total assets. Additionally, in
order to control for general economic conditions,
the time frames for the failed and non-failed
CV,,(A)= ;
I
c (h -qpj)(xk))27
firms were matched [12]. After the deletion of
where t, is the actual targeted dependent vari-
firms with incomplete data, there were 18
able for a given observation in the subset and
bankrupt firms. Each of the 18 failed companies
@;<P,>(x,> is the expected value generated by an
was then matched with a non-failed firm based
approximation function. The prediction risk is
on asset size and sales for the fiscal year previous
then defined for each model(h) by
to their bankruptcy.
A listing of the 36 companies used for the
empirical analysis is presented in Table 1. The W(A) = ; CCV,,(A). (2)
current ratio (CR), quick ratio (QR) and income
i
ratio (ZR) of each firm are also presented in the As with the LR model, the output (or pre-
table and defined as follows: dicted dependent variable 9) of the BPNN can
be constrained to values from 0 to 1. The network
Dependent Variable (Y): is composed of an input layer, a hidden layer and
1 for those companies that failed and 0 for those an output layer of nodes (also known as neurons
that did not. due to biological similarities). Information pro-
cessing is performed through modification of con-
Independent (Explanatory) Variables: nection weights (W,) as normalized observation
(1) CR: current ratio [(current assets/current lia- patterns are passed along connections from the
bilities)]. input through the hidden to the output layer.
162 Applications Information & Management
C-RATIO
I-RATIO
JRE
Q-RATIO
BIAS
NODES
ERROR
*p_i &Pk
J
Fig. 1. Diagram of empirical neural network model.
This distinction between layers can be traced to cesses the dependent variable. The input layer
Rosenblatt’s [15] early work, which divided net- distributes the patterns throughout the network
works into sensory, associative and response units. and the output layer generates an appropriate
In the BPNNs, the input nodes process the response. The middle layer of nodes acts as a
independent variables while the output node pro- collection of associative feature detectors and is
TRAINING EVENTS
(Thousands)
termed ‘hidden’ because it does directly process bankruptcy risk index, and hidden nodes ranging
information to or from the user. The state of from three to seven, respectively. Although neu-
each node is determined by signals sent to it from ral network research is still in its infancy, this
all connected nodes. These signals are biased by approach appears to provide a principled mecha-
the value of the connection weights WA between nism for determining the optimal network archi-
nodes. Appendix 1 contains a summary of the tecture as the following results will verify.
mathematical derivation of the BPNN signals be- Each of the BPNN(A) and LR(h) were then
tween layers as developed in 1986 by Rumelhart trained with LY= 0 and n = 0.01. In NeuroShell, a
et al. [16]. real-time comparison is maintained between the
Figure 1 displays the configuration of the opti- minimum errors of the training sets and hold-out
mal BPNN(A) developed using the commercial test sets. Figure 2 illustrates a segment of the
package NeuroShell 4.1. The BPNN model is a BPNN learning process at the test set classifica-
single hidden layer feed-forward network which tion error decreases to a minimum error value for
implements an error back-propagation methodol- one of the test sets at approximately 737,000
ogy. Back-propagation permits connection weights learning events. After this optimal point, the clas-
WA between nodes to be modified in a supervised sification errors of the test set increase as memo-
fashion using gradient descent to minimize the rization of the training set begins, generalizing
error function. In supervised learning models, capability declines, and the BPNN is less able to
known pattern pairs of target outputs and neural estimate the dependent variable from previously
network outputs are repeatedly presented to the “unseen” data. NeuroShell automatically retains
network to adjust the network WA. the optimal network WA when the minimum mean
Kolmogorov’s Mapping Neural Network Exis- square error in the test set has been reached.
tence Theorem [9] states that any continuous
function can be implemented with the network
structure described above using 2n + 1 hidden Empirical results
nodes, where n represents the number of input
nodes. This is also the recommendation of Caudill The five BPNN(A) were then compared to the
[4]. However, in some cases this may lead to output of the LR(A) for each of the 108 test set
fitting (or memorizing) the training set too well observations for training efficiency, percent cor-
resulting in poor generalization capabilities. In rect per risk category, and model efficiency as
practice, the number of hidden nodes for optimal determined by estimated prediction risk (CV,)
generalization should be tested in a range from and variance of the errors. These results are
approximately 2& + m to the value 2n + 1, presented in Table 2. Results for all of the test
where m represents the number of output nodes. sets were obtained using an 804861 microproces-
Therefore, for the empirical estimations, five sor running at approximately 2.5 megaHertz.
BPNN models were developed using three input Mean training times and number of learning
nodes, each corresponding to an independent events completed for the BPNN(A) above 4HN
variable, one output node, representing the generally increased with model complexity WA
Table 2
Model performances.
and training efficiency decreased. In all cases, The most statistically efficient model is deter-
training the BPNN(A) represents a larger devel- mined from the variance of errors and the predic-
opment cost than with the logit model. However, tion risk CV,. As can be seen from Table 2, the
as indicated by the neural network training effi- variance of the errors for all of the BPNN(A) is
ciencies, (which are ratios of mean training time less than for the LR(h). This indicates that the
to the average number of learning events per logit model is a less efficient estimator. Further-
minute), the 3,, and 4,, models are the most more, of the BPNN(A), the model with 4,, has
efficient at approximately 3.70. In terms of neural the least value for CV,. As the number of hidden
network development cost, the BPNN(4,,) is the nodes increases above 4, abstract mapping capa-
most desirable with the least training time. bilities appear to decrease and the CV, for the
The BPNN(4,,) model also most accurately BPNN(A) approaches that of the LR(A). The
predicted previously unseen observations from neural network model with 3,,, does not appear
the test sets. The total is 82.4 percent at a risk to be able to extract as much information from
cutoff value of 0.5. All of the models, including the independent variables. In terms of prediction
the LR(A), predicted approximately 77 percent of risk, model efficiency, and total percent correct,
the 54 unseen test set observations targeted as 1. the BPNN(4,) performed better than all other
However, of the 54 unseen test set observations models and is selected as the optimal BPNN.
targeted as 0, the BPNN(4,,) surpasses all other Also, with respect to development cost, slow
models with 89 percent accurately predicted. The learning rates are critical to minimize the vari-
bar chart in Figure 3 illustrates the comparison of ance of the errors and therefore increase the
the percentages accurately predicted by model efficiency.
BPNN(I,,,) versus the LR(A) in a range of risk A family of risk curves (at the 0.5 cutoff value)
categories with cutoff values from 0.25 to 0.75. was then generated from the optimal BPNN(4,,)
The BPNN(4,,) also predicts a higher percent- to illustrate the potential use of neural networks
age of non-failed firms at risk virtually indepen- in forecasting bankruptcy and graphically assess
dent of risk category. the impact of each independent variable on ?.
m BPNN(4HN) m LOGIT
NemoShell provides a function which calculates As presented in Figure 3 and Table 2, the
the contributions each of the independent vari- neural network outperforms the logit function
ables makes to the modification of W,,. The CR, with the BPNN(4,) selected as the most effi-
IR, and QR contributed 43.3, 45.4, and 12.3 per- cient predictor. In terms of forecasting capability,
cent, respectively. This relationship can be readily not only does this model more accurately predict
seen in Figure 4. For a range of CR (1.61 to 3.21, a higher percentage of firms in the test sets,
the curves represent the estimated 0.5 cutoff value BPNN(4,,) has less variance in the errors and
with respect to the IR and QR. Firms with low lower prediction risk as determined by CV,. This
current ratios are less tolerant to changes in QR means that BPNN(4,) is more statistically effi-
and there is a significant change in the y-inter- cient and will provide more accurate forecasts in
cept as IR increases. the population. Also, the results indicate that
26 + m hidden nodes is the most appropriate
for this application.
Summary Out of the thirty-six firms tested, four were not
correctly predicted by any model. This suggests
These results correspond closely to other stud- missing explanatory variables for the models(h).
ies which have likewise found neural networks For this reason the BPNN(I,,,) is not regarded
better at extracting information from attributes as a completed production model. Yet, this re-
for forecasting purposes. In a neural network search begins to bridge the gap between pure
application examining bond rating, Dutta and statistical methods and neural networking. It
Shekhar [6] show how a neural network is able to shows neural networks to be a viable alternative
forecast more accurately than standard regres- to more traditional methods of estimating causal
sion. However, their approach did not use a com- relationships in data and offers a new paradigm
parison with logit regression as in this research. of computational capabilities to the business
166 Applications Information & Management
practitioner. The pattern recognition and gener- crostructure of Cognition, Vol. 1, (ed.), D.E. Rumelhart
and J.L. McClelland, MIT Press Cambridge, MA, 1986,
alization capabilities of neural networks can en-
pp. 318-362.
hance decision making in cases where the depen-
[17] J. Utans and J. Moody, “Selecting Neural Network Ar-
dent variable is binary and available data is lim- chitectures via the Prediction Risk: Application to Cor-
ited. With over 16,000 neural network systems porate Bond Rating Prediction,” Proceedings: First Inter-
purchased to date and growth expected to exceed national Conference on Artificial Intelligence Applications
on Wail Street, IEEE Computer Society Press, Los
20% per year 121, business practitioners can ex-
Alamitos, CA, 1991.
pect to see intensified research efforts and bene-
(181 G. Wahba and S. Wold, “A Completely Automatic French
fit from the implementation of such applications Curve: Fitting Spline Functions by Cross-Validation,”
as are presented in this paper. Communications in Statistics,” 4(l): 1-17, 1975.
[19] H. White, “Neural Network Learning and Statistics,” AI
Expert, December (19891, pp. 48-52.
[20] B. Widrow and M.E. Hoff, 1960. “Adaptive Switching
References Circuits,” IRE, WESCON Convention Record, New
York, pp. 96-104.
111E. Altman, “Financial Ratios, Discriminant Analysis, and
the Prediction of Corporate Bankruptcy.” Journal of
Finance, September (1968), pp. 589-609.
Appendix 1
PI D. Bailey and D. Thompson, “How to Develop Neural-
Network Application,” AI Expert, June (19901, pp. 38-47.
[31 R. BarNiv and R. Hershbarger, “Classifying Financial The feed-forward process
Distress in the Life Insurance Industry,” The Journal of
Risk and Insurance, Spring (19901, pp. 110-135. In a back-propagation neural network, as de-
[41 M. Caudill, “Neural Network Training Tips and Tech- veloped by Rumelhart et al. [16], independent
niques,” AI Expert, January (19911.
variable patterns, or values, are normalized be-
I51 E. Collins, S. Ghosh and C. Scofield, “An Application of
Multiple Neural Network Learning System to Emulation
tween zero and one at the input layer to produce
of Mortgage Underwriting Judgements, Working Paper the signal Oi prior to presentation to the hidden
(Nestor, Inc., 1 Richmond Square, Providence, RI, 1989. node layer. Each connection between the input
161S. Dutta, and S. Hekhar. “Bond-Rating: A Non-Con- layer and a hidden node has an associated weight
servative Application of Neural Networks,” Proceedings
Wij. The net signal Zj to an individual hidden
of the IEEE International Conference on Neural Networks,
Vol. II (19881, San Diego, CA, pp. 443-450.
node is expressed as the sum of all connections
171 S. Geisser, “The Predictive Reuse Method with Applica- between the input layer nodes and that particular
tions,” Journal of The American Statistical Association, hidden node plus the connection value Wgj from
70(350), June 1975. a bias node. This relationship may be expressed
181 J. Gentry, A. Newbold and D. Whitford. “Classifying
as:
Bankrupt Firms with Funds Flow Components.” Journal
of Accounting Research, Vol. 23 (11, Spring (19851, pp. zi = c wijoi + WRj. (3)
146-160.
[91 R. Hecht-Nielsen, Neurocomputing, Addison-Wesley Co., The signal from the hidden layer is then pro-
New York, 1989. cessed with a sigmoid function which again nor-
]lOl D. Hillman, “Integrating Neural Nets and Expert Sys- malizes the values between 0 and 1 to produce 0,
tems,” AI Expert, June (19901, pp. 54-59.
prior to being sent to the output layer. The nor-
1111D. Hillman, “AUBREY: A Custom Expert System Envi-
ronment in LISP.” AI Expert, January (19901, pp. 34-39. malization procedure is performed according to:
I121 H.G. Hunt and J.K. Ord. “Matched Pair Discrimination: 1
Methodology and an Investigation of Corporate Account-
q= 1 +exp-Z,)’
(4)
ing Policies,” Decision Sciences, Vol. 19(2), Spring (19881,
pp. 373-382.
[13] D. Levine, “The Third Wave in Neural Networks,” AI The net signal to an output node Zk is the sum
Expert, December (1990), pp. 27-31. of all connections between the hidden layer nodes
[14] R. Marose, “A Financial Neural-Network Application,” and the respective output node, expressed as:
AI Expert, May (19901, pp. 50-53.
(151 F. Rosenblatt, Principles of Neurodynamics, D.D. Spar- (5)
tan Books, Washington, 1962.
[16] D.E. Rumelhart, G.E. Hinton and R.J. Williams. “Learn-
where Wgk represents a single connection weight
ing Internal Representations by Error Propagation,” in from a bias node with a value of 1 to the output
Parallel Distributed Processing Exploration in the Mi- layer.
Information & Management D. Fletcher, E. Goss / Forecasting with neural networks 167
The net signal is again normalized with the where 77 is a learning coefficient and (Y is a
sigmoid function to produce the final output value “momentum” factor. The momentum factor de-
O,, where termines the effect of past weight changes on the
1 current direction of movement in weight space
0, =
1+ exp( -Z,) ’
(6) and proportions the amount of the last weight
change to be added into the new weight change.
In terms of bankruptcy modeling, 0, represents The error signal S back-propagated to the
the risk of bankruptcy of the individual firm. connection weights between the hidden and out-
put layers is defined as the difference between
the target value Tpk for a particular input pattern
The process of error hack-propagation p and the neural network’s feed-forward calcula-
tions of the signal from the output layer 0, as:
At the output layer the net signal, O,, (esti-
mated dependent variable), is compared to the
spk= tTpk- opk)“pk(l- ‘,k). (8)
actual value of the dependent variable, Tk, to Then connection weights between the input and
produce an error signal which is propagated back hidden layers are changed by:
through the network. Neuroshell 4.1 implements
“Pj=Opj(‘-Opj)C~pk~k.
a variant of the Widrow/Hoff [20] or “least mean (9)
k
square” learning rule known as the Generalized
Delta Rule where output layer error signals are The data feed-forward and error back-propa-
propagated back through the network to perform gation process is continued until a “stopping”
the appropriate weight adjustments after each point is reached. This point is determined by
pattern presentation [9]. Rumelhart et al. [16] comparing the errors in the training set and the
describe the process of weight adjustment by: test set. This methodology prevents “overlearn-
ing” or fitting the training set “too” closely with
Ay,( n + 1) = r&,0,, + aAM$:.,( n), (7) consequent large errors in the test set.