Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
149 views10 pages

Defense Simulation Validation

This case study compares the outputs of two simulations of theater ballistic missile defense - the baseline EADSIM simulation and the newer Wargame 2000 (WG2K) simulation. The study focuses on developing practical validation methods by analyzing specific measures of effectiveness like detection time, engagement time, and intercept time for different defense batteries against various threat missiles. The goal is to provide analysts with usable validation tools and establish a body of practical case studies to support validation efforts across the Department of Defense.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views10 pages

Defense Simulation Validation

This case study compares the outputs of two simulations of theater ballistic missile defense - the baseline EADSIM simulation and the newer Wargame 2000 (WG2K) simulation. The study focuses on developing practical validation methods by analyzing specific measures of effectiveness like detection time, engagement time, and intercept time for different defense batteries against various threat missiles. The goal is to provide analysts with usable validation tools and establish a body of practical case studies to support validation efforts across the Department of Defense.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221527753

Case study in modeling and simulation validation methodology

Conference Paper · January 2001


DOI: 10.1109/WSC.2001.977364 · Source: DBLP

CITATIONS READS

12 1,114

3 authors, including:

Eugene P. Paulo Lyn Whitaker


Naval Postgraduate School Naval Postgraduate School
31 PUBLICATIONS   102 CITATIONS    32 PUBLICATIONS   558 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Eugene P. Paulo on 23 September 2014.

The user has requested enhancement of the downloaded file.


Proceedings of the 2001 Winter Simulation Conference
B. A. Peters, J. S. Smith, D. J. Medeiros, and M. W. Rohrer, eds.

CASE STUDY IN MODELING AND SIMULATION VALIDATION METHODOLOGY

Scott D. Simpkins Eugene P. Paulo

Program Analysis and Evaluation Directorate Director


U.S. Army Recruiting Command TRADOC Analysis Center
Fort Knox, KY 40121, U.S.A. Monterey, CA 93943, U.S.A.

Lyn R. Whitaker

Operations Research Department


Naval Postgraduate School
Monterey, CA 93943-5221, U.S.A.

ABSTRACT This study addresses the simulation of defending


against a ballistic missile attack. National missile defense
The military develops simulations to analyze nearly every (NMD) and theater missile defense are the pillars of the
aspect of defense. How accurate are these simulations and United States’ ballistic missile defense program. NMD was
to what extent do they produce dependable results? Most intended to provide a shield of protection across the territory
guidance available to DoD analysts provides broad rec- of the United States to intercept long-range missiles. TMD
ommendations geared towards management and coordina- has grown out of the NMD efforts. Theoretically, TMD es-
tion of the validation processes. Here, we focus on practi- tablishes an umbrella of protection for a theater of operations
cal validation from the analyst’s perspective in the form of much smaller than the United States. Each of these defenses
a case study. The platform used is the theater missile de- relies on integration of detection and intercept systems to
fense (TMD) aspects of Extended Air Defense Simulation engage and destroy inbound ballistic missiles.
(EADSIM) and a new simulation called Wargame 2000. Presented here is a case study intended to develop and
The focus is not to validate Wargame 2000 but to develop illustrate a bottoms-up approach to simulation validation by
real, usable tools for analysis. Measures of effectiveness using specific measures of effectiveness (MOE) and fairly
include defense battery search, engagement and intercept simple graphical and statistical methods. The secondary ob-
times against threat missiles. Insight is provided into de- jective is to begin a body of practical case studies that can be
velopmental and data production issues making the valida- used to support a move toward validation commonality
tion process more effective and meaningful. among Department of Defense modeling and simulation.

1 INTRODUCTION 2 SIMULATION OF THEATER


BALLISTIC MISSILE DEFENSE
Simulation effectively analyzed and supported can save
money on acquisition and reduce more costly live-fire test- This study centered on the analysis of two combat simula-
ing to verify results. Indeed there are military applications tions. EADSIM was considered the baseline system, while
where simulation may be the only method of quantitative a new simulation called Wargame 2000 (WG2K) was
analysis when enemy equipment or technology specifica- evaluated and its output compared to EADSIM output.
tions are required for real-world tests. Validation of a simu-
lation makes its results more acceptable to analysts and deci- 2.1 EADSIM
sion makers. However, the process can be very intricate and
extremely cumbersome depending on the level of accuracy EADSIM is a workstation-hosted, system-level simulation
and detail of expected results. Additionally, while numerous that is used to assess the effectiveness of theater missile de-
publications address need for validation, we know of none fense and air defense systems against the full spectrum of
that deal directly with analysis of model output in specific extended air threats. EADSIM provides a many-on-many
formats, or that even present methodologies, metrics or pro- theater-level simulation of air and missile warfare, an inte-
grams for validating simulation output. grated analysis tool to support joint and combined force

758
Simpkins, Paulo, and Whitaker

operations and a tool to provide realistic air defense train- meaningful and the accuracy demand increases as the mis-
ing to maneuver force exercises. EADSIM models fixed- sile progresses. For instance, 10 percent of the time-to-go-
and rotary-wing aircraft, tactical ballistic missiles, cruise to-impact gets smaller as impact approaches, demanding a
missiles, infrared and radar sensors, satellites, command better match, while 10 percent of time-after-launch be-
and control structures and fire support in a dynamic envi- comes large. This is to say that accuracy near impact is
ronment which includes the effects of terrain and attrition more important than accuracy near launch if the objective
on the outcome of the battle. is to avoid impact. However, impact time is not one of the
reported fields for the baseline EADSIM data and time
2.2 Wargame 2000 since launch was used under the assumption that detection
and intercept occur early enough in flight for time-since-
The department of defense is developing a software simu- launch to be equally significant as time-until-impact.
lation for ballistic missile defense that can be used for Six MOE’s for three batteries against six threats were
command and control analysis, provide insight into tech- adopted: detection time, detection range, 1st launch time, 1st
nology development and provide a training platform for intercept time, last launch time, last intercept time (see
system operators/users. Wargame 2000 is a virtual, real- Figure 1). This leads to potentially 3·6·6=108 comparisons
time, discrete event, command and control missile defense of output variables between EADSIM and WG2K. There
simulation used to investigate human interactions. It is the are those cases among the batteries and threats where no
successor to the Advanced Real-time Gaming Universal engagement occurs because the threat was beyond the ca-
Simulation (ARGUS) that has been used for years. WG2K pability or range of the battery and some cases where no
is intended to provide a simulated combat environment that values were reported; this reduced the data set and analy-
allows war-fighting commanders, their staffs and the ac- sis. Ultimately only 76 output variables were collected.
quisition community to examine missile and air defense Flight Geometry
concepts of operation. This is accomplished through the Front

use of human-in-the-loop experiments and other events.


Rear
Side

WG2K has been under development since 1997 and is


Direct

half way through its developmental lifecycle. It is prudent


at this point in development to assess the simulation’s ac- MOPs
Detection Time

curacy and ability to perform required tasks. Primary atten- Detection Range
First Launch
tion has been paid to NMD in the past and now develop-
Defense
Last Launch System
IMPACT THREAT
Battery #1 Threat #1

ment is shifting focus to include TMD (Deis, 2000).


First Intercept
Last Intercept Battery #2 Threat #2
Battery #3 Threat #3
Threat #4
Threat #5

3 MODEL VALIDATION Threat #6

Figure 1: Input Parameter and Output MOE Combinations.


3.1 Purpose and Necessity
3.3 Planning Data Collection
The policy of the Department of Defense is that all its
components establish VV&A policies and procedures for
Threat data is based on current intelligence estimates of
modeling and simulation projects they develop and/or
ballistic missiles and is considered to be valid for the in-
manage. Thus, the DoD stresses an effort to establish stan-
tended use. However, few of these missiles have ever been
dards and guidelines promoting VV&A procedural com-
engaged, monitored or possibly even launched in any set-
monality. Furthermore, VV&A is required to be part of all
ting other than fielding tests. Thus, the baseline data for
modeling and simulation developments (DoD 5000.61
combat is, in most cases, extrapolated from small samples
2000).
collected under artificial conditions.
National and theater missile defense have very differ-
3.2 Selection of Measures
ent parameters and expected engagement scenarios.
BMDO must ensure the simulation accurately portrays a
The obvious MOE to use in a timeline comparison is time.
United States defensive response to ballistic missiles
Clearly, there are multiple physics based characteristics of
within theater by using output from EADSIM. If there is
missile performance not associated with time (e.g. thrust or
no significant difference between WG2K when compared
radar sensitivity). However, only critical times and associ-
to EADSIM, WG2K may be validated with respect to
ated ranges were produced as output during this early stage
TMD. Results are taken from each model and compared
of development. Typically, time-to-go until some critical
using statistical analysis. Also, this work provides insight
event, such as impact, is useful for comparison – as op-
into developmental and data production issues which make
posed to time after some critical event, such as launch. The
the validation process more effective and meaningful.
percentage difference between the two simulations is more

759
Simpkins, Paulo, and Whitaker

4 DATA that it accurately reflects the input parameters that will be


provided to the perspective new simulation. Initially, a face
4.1 Summary of Data validation or common sense check must be conducted of
the baseline output to ensure the performance characteris-
Data was collected in two phases. The Applied Physics tics represent the expected performance of the actual sys-
Laboratory at The Johns Hopkins University collected tem and that critical functions of the real system are mod-
baseline data from EADSIM. For each of six different sce- eled and reported by the simulation. One of the first
narios (with different input parameters), one hundred repli- baseline runs produced detection and engagement output
cations were recorded where detection time varied ran- shown in Figure 2.
domly. The Joint National Test Facility ran Wargame 2000
Threat 5
stochastically but collected only one set of output data for System #2 Baseline
each scenario due to fiscal constraints.
0.9
The primary focus for initial simulation runs is high fi-
delity range and time performance behavior for the sensor 0.8
and interceptor. This is to imply that interceptor probability

% of Threat Flight Time


of kill Pk is not important in the consideration of physics 0.7
based flight. The data is structured so each battery will en- Detect
gage each threat missile separately. Additionally, each threat 0.6
1st Launch
1st Intercept
could assume any of the different geometries: front, back,
side and direct. That would be a maximum of 6·4·3 = 72 to- 0.5
tal runs of the system producing multiple output values each
to get all possible combinations for the three batteries. The 0.4
objective is to demonstrate single missile, detailed timeline
performance and behaviors in an uncluttered situation (Deis, 10 30 50 70 90
Runs
2000). All data collected was for one-on-one tests where a
single threat missile was detected, tracked and engaged by a
single battery. Each of the interceptor systems was posi- Figure 2: Initial Output from System #2
tioned in a location determined to represent a valid real-
world location. From this location, the defensive radar The few instances where launch and intercept are the
scanned for threat missiles. In some cases the defense sys- same in Figure 2 standout as incorrect. But, it is not clear
tem was placed such that it would not engage threats in or- what the problem is or how to correct it from an initial
der to examine detection only. look. This example is easily identified by inspection; most
Data was collected and recorded for flight times in are not this obvious. Subtle discrepancies can pass unde-
seconds. This data, representing flight time of both missiles tected through the control data set and derail an otherwise
and interceptors, reflects the actual performance character- solid validation effort.
istics of the modeled entities. As such, the recorded times Defensive batteries analyzed are identified as systems
are classified and are not available in this document. Nor- #1, #2 and #3 for easy reference. The actual identification
malized data representing percentages of the total threat of defense and threat systems has been classified by the
missile flight time is presented here to mask identifiable source and has no bearing on the comparative statistics
characteristics defense batteries. used or methodology recommended here. Defending bat-
teries were taken from a set of allied systems and six threat
4.2 Derivation of the Database missiles were selected from a data set of all known threat
ballistic missiles developed by the special projects center at
EADSIM and WG2K were loaded with identical threat the JNTF and approved by the Defense Intelligence
characteristics, threat positions, interceptor performance Agency (DIA). Thus, no discussion or explanation of threat
parameters and defensive battery locations. Thus, the radar characteristics such as RCS, velocity or range is included
cross-section (RCS), trajectory and geographic reference here. The modeling of threat systems was done by the
remain constant for each threat or battery across all runs JNTF and provided as input for both simulations.
involving that system. For these runs, detection and en-
gagement (if any) occur at the same location. Detection 4.3 Description of EADSIM Output
data is reported for missiles not engaged.
The development of baseline or control data must not Each system was placed in a location near an area of inter-
be taken for granted. This data set is considered ground- est designated as the impact point. Batteries were assigned
truth for the systems and attention must be given to its the mission of searching a sector of sky from which a
verification. The principle concern with baseline data is known threat would appear. The radar’s measure of effec-

760
Simpkins, Paulo, and Whitaker

tiveness is not whether it found the threat but when and 4.4 Scope and Limitations
with how much delay it identified and began tracking.
The defense was established as one-on-one; even Only one-on-one scenarios are considered at this stage of
though a threat was detected by radar and tracked, it is pos- WG2K development. There is no hand-off between radar
sible the threat was not engaged if the flight path toward systems as would happen if threats were detected early by a
the impact point was beyond the defended area of the bat- system and then ‘passed’ to another system with higher Pk.
tery. An interceptor missile was launched when a detected Therefore, no interaction between threats or interceptors is
threat entered the defended area or when a threat already in found. It is anticipated that there will be dependence on
the defended area was detected based on the limitations of threat and interceptor type when there are multiple systems.
the individual system. WG2K essentially has one random variable: stochastic ra-
Characteristic output between systems is shown be- dar detection using a Normal distribution with almost no vari-
low. Figure 3 shows invariability of interceptor launch and ability. EADSIM uses two stochastic variables, radar detection
engagement times that imply the interceptor is limiting the which is Normal and sensor frame time, which is Uniform,
system (i.e. the interceptor’s region of coverage is smaller producing detection times distributed with a convolution of the
than the radar’s area of coverage). Figure 4 demonstrates two input distributions. In other cases only one stochastic vari-
high variance in both launch and engagement that indicates able is used, sensor frame time which is uniform. Interceptor
it is limited by radar performance. flyout is always deterministic in both simulations.
Threat 1
System #1-EADSIM Baseline
5 RESULTS

0.9 5.1 Exploratory Data Analysis


Detect
1st Launch
1st Intercept
Last Launch
We assumed that the 100 runs for each scenario are in fact
% of Threat Flight Time

0.7
Last Intercept independent. As a quick check for independence of
EADSIM output, a test was used to see if MOE times from
EADSIM were truly independent. Here, a ‘runs test’ for
0.5
above and below the median is used to test the null hy-
pothesis that the sequence of detection times for the 100
iterations of the simulation is indeed independent. Specifi-
cally a sequence of 100 binary variables is constructed
0.3
where the ith variable takes value 1 if the MOE time of the
10 30 50 70 90 corresponding ith simulation run is above the median, and 0
Runs
otherwise. The number of runs above the median m, below
Figure 3: Interceptor Limitations Lead to Small Variance the median n and the total number of runs R = m + n are
in Launch and Intercept computed. In the example of Figure 5 there are three runs
of ones (of lengths 2,3,3) and two runs of zeros.
Threat 4
System #2-EADSIM Baseline

0.9
1 1 0 0 0 1 1 1 0 0 0 0 1 1 1

0.8 m=3
n=2
% of Threat Flight Time

0.7 R=5
Detect Observations above Observations below
1st Launch the median the median
0.6 1st Intercept
Last Launch
Last Intercept
0.5 Figure 5: Example of MOE Times Indicating Distribution
Above and Below the Median
0.4

0.3
For large samples, the test statistic
2m
R−
(1 + γ )
10 30 50 70 90
Runs

Figure 4: Radar Limitations Produce Immediate Launch


4γm
and High Variance (1 + γ )
3

761
Simpkins, Paulo, and Whitaker

where γ = m/n, has a standard Normal null distribution 4 is three times higher in System #2 than System #1. Vari-
(Lehaman, D’Abrera, 1975). At 5% level of significance ance in detection times is acceptable. However, critical to
none of the 76 sets of simulation runs failed the test for analysis by comparison is that WG2K demonstrate similar
randomness. behavior when modeling the same combinations. Graphic
As illustrated in Figures 6 and 7, the mean and vari- analysis can quickly identify areas of interest or anomalies
ance of detection times differed significantly between sys- within the data. But, when provided with a single output
tems. This may be attributed to specific limitations of the value, graphic analysis is limited in comparing the two
batteries. The distribution of detection times also differed simulations. An interesting observation taken from the
from system to system. Figures 6 and 7 is that for most threats variance in detec-
tion is the same for each system even though System #1
Detection Times
System #1-EADSIM Baseline
takes twice as long to detect on average.
0.9
Stochastic models can be viewed in two distinct
classes. The first class involves sampling from a probabil-
ity distribution of inputs such that, once a sample of inputs
0.7
is generated, the model is deterministic. In this situation,
% of Threat Flight Time

Threat 1 no random events occur within the simulation. The second


Threat 2
Threat 3 class contains those models in which events during the
Threat 4
0.5 Threat 5 simulated course of battle are affected by the results of dy-
namically generated random numbers. Of course, models
can contain both a random sample of inputs and have in-
0.3 ternal random events (Lucas, 2000).
Simulations modeling the TMD environment contain
10 30 50 70 90 both stochastic classes. The outcome of many simulation
Runs
runs will vary across a spectrum of values, even when the
model is provided deterministic input. It is expected that a
Figure 6: Detection Times for System #1 valid model will produce output MOE’s that tend to cover
the entire range of the model’s capabilities but are concen-
Detection Times
trated near the expected value of the functions producing
System #2-EADSIM Baseline them. Knowing the distribution functions of the model is
important when determining how close the simulation has
0.50 come to producing the expected outcome.
Threat 3
Threat 4
Threat 5
% of Threat Flight Time

0.45 Threat 6 System #1

0.40
15

0.35
10

0.30
10 30 50 70 90
Runs
5

Figure 7: Detection Times for System #2


0

There is a clear difference between these two systems 0.72 0.74 0.76 0.78 0.80 0.82 0.84

depicted in Figures 6 and 7. Both batteries are positioned Threat 5 Detection

in the same place and threats follow the same flight path
against them. Figure 6 shows an average near 80% of Figure 8: Distribution of Random Detection Times for Sys-
threat flight time for three missiles while Figure 7 shows tem #1 Against Threat 5
none of the detection times above 50%.Variance for Threat

762
Simpkins, Paulo, and Whitaker

System #1
A box plot places a box around the middle 50% of the
data, with the upper edge at the 3rd quartile and lower edge
at the 1st quartile (Devore, 1995). The whiskers in box
15

plots for all MOE’s extend from the box up to the largest
observation and down to the smallest observation. In gen-
eral, extreme observations are reported as points beyond
10

the whiskers, no such extreme values were observed in this


study. The most visual feature is the box that shows the
limits of the middle half of the data. Box plots not only
show the location and spread of data but indicate skewness,
5

as well. Box plots for EADSIM parameters were compared


to WG2K values as shown in Figure 10. Each parameter
that had output was compared by plotting the WG2K result
on the horizontal axis against the EADSIM spread on the
0

vertical axis. Ideally, a line with 45 slope will intersect the


0.34 0.36 0.38 0.40 o

Threat 1 Detection
box plot at the mean indicating a one-to-one match be-
tween the WG2K result and the average EADSIM result.
Figure 9: Distribution of Random Detection Times for Sys- Figure 10 displays several comparisons where EADSIM
tem #1 Against Threat 1 and WG2K outputs agree and are statistically similar. Even
System #1’s first launch times compare nicely although
In the case of WG2K, only one run of the simulation is EADSIM exhibits zero variance as represented by the flat
provided so a complete analysis of the output cannot be line in-lieu of a box plot. Small deviation, as shown in Sys-
compared to a distribution function. An analysis of the dis- tem #3’s first intercept times by the intersection of the box
tributions resulting from EADSIM provides some insight and line, is acceptable. In general, Figure 10 is good news
into the expected behavior of WG2K. Histograms in Fig- for the new simulation. A preliminary look at these compari-
ures 8 and 9 contrast detection time distributions from one sons indicates WG2K is producing detection, launch and in-
system against two separate threats. tercept times very close to EADSIM.
However, comparing Figure 8 and 9, it is clear detec-
tion times do not come from the same distribution. What System #1 First Launch
0.90

can be causing such a dramatic shift in detection time dis- Threat 3


Threat 2
0.85

tribution for the same system against similar threats? The Threat 4
0.80

battery modeled here has a search pattern that makes its de-
EADSIM Spread

0.75

System #2 Detection
tection time highly dependent on threat flight trajectory. Threat 5
0.5
0.70

Threat missiles launched from far away are detected nor-


0.4
0.65

mally as they enter the top of the radar coverage (Figure 8)


Threat 1 Threat 3
Threat 6
Threat 2
EADSIM Spread

0.3

while threats launched from close range are detected uni- 0.65 0.70 0.75 0.80 0.85 0.90 Threat 4

WG2K data (by missile)

formly as they enter the bottom of the radar coverage (Fig-


0.2

ure 9). It is sufficient to say that the distribution of detec-


0.1

System #3 First Intercept


tion times cannot be assumed normal and that no regularly
0.0

0.25 0.30 0.35 0.40 0.45

used parametric distribution captures all of the detection WG2K data (by missile)
0.80

time distributions. This type of disparity among time pa-


EADSIM Spread

rameter distributions relating MOE’s is common to many


0.75

Threat 5

systems/threat combinations.
Threat 3
0.70

Threat 6
0.65

5.2 Comparing Wargame 2000 with EADSIM 0.68 0.70 0.72 0.74 0.76 0.78

WG2K data (by missile)

5.2.1 Graphic Analysis

The spread of times can be further broken down into quar- Figure 10: Box Plots Reveal Trends when Compared
tiles separating the 100 observations into groups of 25
separated in sequence by the 1st, 2nd (the median) and 3rd Further inspection of the data reveals system/threat
quartiles. One expects the result provided by WG2K to fall combinations with larger variability however. Figure 11
between the 1st and 3rd quartile of EADSIM implying the indicates two detection times far outside the baseline dis-
value is relatively close to that expected for validation. tribution for two of the threats. One of the threats was not
detected by WG2K leaving only five; this further con-

763
Simpkins, Paulo, and Whitaker

founds the results and implies simulation issues larger than 5.2.2 Inference
interceptor flight time such as sensor detection modeling or
sensitivity parameters. Although each scenario was replicated 100 times for
EADSIM, because WG2K is run in real-time, only one re-
System #1 Detection alization of WG2K is available for each scenario.
Often, there is the temptation to treat the output of
such a run as an expected value (i.e. to treat a detection
0.8

time as the expected detection time). This seems reason-


able because WG2K has little to no variability among input
variables as discussed. However, for each run the output
0.6
EADSIM Spread

MOE’s of WG2K are certainly non-linear functions of the


input. In general, the expected value of a non-linear func-
0.4

tion is not equal to the function of the expected value of


those random variables. In practical terms this means that
0.2

the average detection time over many replications of


WG2K will not be the same as one detection time with av-
erage input. Thus, the output of one run of WG2K and in-
0.0

deed one run of any model run in small input variance


0.4 0.5 0.6 0.7 0.8
should not be considered to be the expected output.
WG2K data (by missile)
Another approach, and the one adopted here, is to treat
Figure 11: Box Plot Showing Detection Delays in WG2K WG2K output as one realization of WG2K had it been run
for a Specific System in true stochastic mode. This is not an entirely appropriate
model since some inputs are constant rather than random.
System #2
However, it will provide a more realistic comparison.
0.9

5.2.3 Re-Sampling from the Baseline


0.8

Wargame 2000 is striving to match the fidelity of


EADSIM. Thus, if WG2K is in fact reproducing EADSIM
the distributions of output parameters (under the same in-
Threat 3
0.7

put conditions) should be the same. This implies that the


mean and variance of output should also be the same for
0.6

both simulations. The difficulty with comparing the mean


and variance is that WG2K was only run one time in each
0.5

scenario. A bootstrapping approach is used to test the null


hypothesis that WG2K has the same output distribution as
0.4

0.5 0.6 0.7 0.8 0.9 1.0

WG2K data (by event)


EADSIM. The empirical distribution of the 100 values
from EADSIM is used as an estimate of the null distribu-
Figure 12: Box Plot Showing System Delays in WG2K for tion for both EADSIM and WG2K. With bootstrapping we
a Specific Threat can approximate the null distribution of a test statistic us-
ing Monte Carlo simulation. In this case, the simulation in-
It should be noted that poor performance was not con- volves repeatedly “drawing” a sample of 101 from
sistent with any system or threat combination. The overrid- EADSIM. These repeated draws of 101 observations do
ing consistency was in WG2K’s inability to identically en- not require re-running EADSIM or WG2K. They are inde-
gage threats EADSIM engaged. Furthermore, there were pendent 101 psuedo-random detection times generated
two cases where WG2K engaged threats that EADSIM from the empirical distribution of the 100 actual EADSIM
never engaged. detection times. Note that generating the WG2K psuedo-
Time for single systems can be compared as in Figure detection time from the empirical distribution of EADSIM
12, which shows results across all measures of effective- detection times is consistent with the null hypothesis that
ness for System #2. The boxes represent the individual WG2K and EADSIM have the same output distributions.
MOE’s beginning with detection in the lower left corner The percentage difference in mean detection times for
and moving forward in time through last intercept in the each scenario for both models can be computed to compare _
upper right. Each of these measures exceeds the baseline. WG2K with EADSIM. In particular, for a scenario let YE
All WG2K events occur much later than expected for this and YW represent the average detection time from 100 runs
system/threat combination.

764
Simpkins, Paulo, and Whitaker

of EADSIM and the detection time from one run of WG2K However, there are very low p-values for system #1, indi-
respectively. They define the test statistic: cating rejection for nearly every MOE clearly based on the
invariability of EADSIM. System #2 showed mixed results
depending on the MOE.
Y −Y
T =
E
W
Typical Time Distribution for EADSIM
Y
stat
E

30
The test statistic (Tstat) represents the percent difference Extreme Regions

between EADSIM’s percent of total flight time for each

25
MOE and WG2K’s.

Number of Observations
Bootstrapping is used to estimate the sampling

20
distribution of Tstat under the null hypothesis. Sampling

15
from the empirical distribution of EADSIM detection times
is equivalent to draws with replacement from the 100 ac-

10
tual EADSIM values. In total 1000·(100 + 1) draws with
replacement were made from the 100 EADSIM values.

5
The results are depicted in Figure 13; XEi i = 1,2,…, 1000,
are the 1000 averages of 100 draws each while XWi con-

0
sists of individual
_ 1000 draws. 250 Tstat 300 350
From the XEi, Xi, the bootstrapped value of the test sta- Time

tistic Ti is computed as in Figure 13. Figure 14 depicts this


process; shaded areas near the tails represent the proportion Figure 14: Using P-Values and Test Statistics to Identify
of observations more extreme than Tstat. This area is one Extreme Points.
half of the p-values for a two-sided test.
There are several cases where EADSIM exhibits zero
i XEi XWi Ti
variance. This is expected as most of the many random
1 110 109 .01 variables available in EADSIM are set for zero variance. In
2 104 119 -.14
3 114 115 -.01 these cases it is difficult to determine whether WG2K is
4
5
100
100
119
102
-.19
-.02
Actual difference between
EADSIM & WARGAME
accurately reflecting EADSIM or not. The simulation out-
6 120 114 .05 2000
puts differ very little and it may be that the developer con-
7 118 108 .08
8 107 119 -.11 siders a zero variance in WG2K at its current value to be
9
10
100
106
115
119
-.15
-.12
within an acceptable range.
11 101 115 -.14 XEi – XWi YE – Y W It is important to note that in general, the test statistic
XEi = Ti Tstat =
12 114 116 -.02 YE chosen here, the fact that only one realization of WG2K is
13 113 109 .04
14
.
114
.
115
. _
-.01
. < > used and the non-parametric nature of the bootstrapping
. . . . procedure all contribute to a testing procedure that is not
. . . .
996 110 114 -.04
Hypothesized difference
powerful against all alternatives to the null hypothesis.
997
998
119
100
104
119
.13
-.19
between EADSIM & Thus, this procedure will not be able to detect all types of
WARGAME 2000
999 101 104 -.03 differences between WG2K and EADSIM. In particular,
114 104 .09
1000
because only one realization of WG2K is available, there is
no way to tell if the variability in times simulated by
Figure 13: Re-Sampling for P-Values WG2K will be the same as the variability of those simu-
lated by EADSIM.
The 1000 bootstrapped values of Ti i = 1,2,…,1000,
and the actual value of the test statistic Tstat are used to ap- 6 CONCLUSIONS
proximate a p-value for the test where FW and FE are the
MOE distributions for WG2K and EADSIM respectively. WG2K has many capabilities and accepts diverse input from
The p-value is approximated as the proportion of boot- which it builds a TMD scenario. The challenge of validating
strapped Ti that are more extreme than the actual observed a simulation model early in its development comes from en-
Tstat. These p-values indicate the strength of evidence suring that a small set of capabilities and data produced fit
against the null hypothesis that WG2K’s result represents a into the big picture of the model using all capabilities; this is
possible result from EADSIM. There are examples sup- not unique to WG2K. Very little is available, published in
porting both acceptance and rejection of the null hypothe- the open literature or from DoD, that gives specific guidance
sis. High p-values for system #3 accept the null hypothesis for how to validate a simulation model based on the limited
in all cases where data is recorded for both simulations. data usually available in such attempts. Formulating such an

765
Simpkins, Paulo, and Whitaker

approach applicable to the majority of validation attempts REFERENCES


would be extremely difficult. Thus, although case studies are
situation dependent, they are vital for providing practical Deis, F., Wargame 2000 White Paper vs. EADSIM Anchor-
guidance for validation. ing Effort Planning Proposal, (2 Nov 2000).
Lucas, T. W., “The Stochastic Versus Deterministic Ar-
6.1 Summary of Results gument for Combat Simulations: Tales of when the
Average Won’t Do”, Military Operations Research,
Presented here is a straightforward approach to validation vol. 5, no. 3, pp. 9-28 (2000).
that uses various methods to examine simulation output. Department of Defense, DoD 5000.61: DoD Modeling and
This begins by choosing a small number of appropriate Simulation (M&S Verification, Validation, and Accredi-
MOE’s. Limiting the MOE’s examined can ignore signifi- tation (VV&A), 2000. Available: http://www.ailtso.com/
cant ranges of output but allows the analyst to focus on those simval/Documents/5000.61/dod5000.61.htm
most relevant to simulation performance. A graphic, ex- Department of Defense Research and Engineering, Verifi-
planatory analysis is used which is supported by subsequent cation, Validation and Accreditation (VV&A) Recom-
inferential statistics. The challenge in this particular case mended Practices Guide, 1996.
study is that WG2K scenarios are replicated only one time. Hodges, J., Dewar, J. Is It You or Your Model Talking? A
The importance of graphic analysis should not be Framework for Model Validation. Santa Monica:
overlooked. Graphics allow quick, accurate analysis as RAND, 1992.
long as appropriate comparisons are depicted. The box- Devore, J. Probability and Statistics for Engineering and
plots used in this study clearly show differences between the Sciences. Pacific Grove: Duxbury Press, 1995.
EADSIM and WG2K. The informed eye can discriminate Lehaman, E., D’Abrera, H. Nonparametrics: Statistical
very small differences. These must be confirmed but a Methods Based on Ranks. San Francisco: Holden-Day,
clearly thought-out, accurate graphic representation of the Inc, 1975.
data can narrow one’s focus by identifying those areas re- Mitchell, B. “Underwater Launch Technology Sustainment,”
quiring effort. briefing on Trident Missile Validation (3 Oct 2000).
A non-parametric bootstrap is used to test the null hy-
pothesis that the distributions of specific MOE’s are the AUTHOR BIOGRAPHIES
same for both models. The results of this inference confirm
the graphical analysis. Note though that this procedure is SCOTT D. SIMPKINS is a Captain in the United States
conservative in that rejecting the null hypothesis (i.e. find- Army. He is an analyst in the Program Analysis and
ing differences between the models) does provide evidence Evaluation Directorate of the US Army Recruiting Com-
that the models differ while failing to reject does not pro- mand. He received his Masters Degree from the Naval
vide evidence that they are the same. With only one run per Postgraduate School in 2001. His research interests are in
scenario from WG2K it is not possible to draw the conclu- entity level simulation and data analysis. His email address
sion that the distributions of MOE’s are the same for is [email protected].
WG2K and EADSIM.
EUGENE P. PAULO is a Lieutenant Colonel in the
6.2 Recommendations United States Army. He is the Director of TRAC-
Monterey and an Assistant Professor in the Department of
The most important recommendation is that a series of case Operations Research at the Naval Postgraduate School.
studies, such as this one, need to be compiled and made His email address is [email protected].
available to those analysts actually doing validation. These
case studies need to show by example practical but simple LYN R. WHITAKER is an Associate Professor in the
approaches to sorting through and making sense of complex Department of Operations Research at the Naval Post-
simulation output. It is clear that a single document that tries graduate School. Her research interests are in reliability,
to take a top down approach and that encompasses all types statistical analysis of simulation and combat simulation,
of validation and possible output is impractical. This ap- and in categorical data analysis. Her email address is
proach leads to volumes of general guidance but nothing [email protected].
specific enough to be useful to the analyst in practice. A case
study approach tackles the problem of practical analytical
guidance from the bottom up. With a number of such case
studies, available in one place, patterns of what approaches
prove most useful should emerge rather quickly.

766

View publication stats

You might also like