Distributed Computing
Distributed Computing
VERSION 2016
by
Gamma Technologies
GT SUPPORT
• TELEPHONE: (630) 325-5848
• E-MAIL: [email protected]
Distributed Computing
CHAPTER 1: Overview ......................................................................................................................... 1
CHAPTER 2: Setup, Update and Hardware Recommendations ....................................................... 3
CHAPTER 3: Evolving Distributed Computing Services from Previous Versions .......................... 5
3.1.1 Linux ....................................................................................................................................... 5
3.1.2 PC ........................................................................................................................................... 5
CHAPTER 4: Setting up Distributed Computing................................................................................ 7
4.1 License Usage .................................................................................................................................. 7
4.2 Installation Configuration ................................................................................................................ 8
4.3 Configuration of Solver Nodes ........................................................................................................ 9
4.3.1 Modify the Configuration ....................................................................................................... 9
4.3.2 Install the Service (PC only) ................................................................................................. 10
4.3.3 Start the Service/Daemon ..................................................................................................... 11
4.4 Configuration of the Scheduler ...................................................................................................... 12
4.4.1 Modify the Configuration ..................................................................................................... 12
4.4.2 Limiting the Number of Processes Running on the Scheduler ............................................. 12
4.4.3 Install the Service (PC only) ................................................................................................. 13
4.4.4 Start the Service/Daemon ..................................................................................................... 13
4.5 Running a Customized Script After Job Completion..................................................................... 13
4.6 Automatically Export Data to Text File......................................................................................... 15
CHAPTER 5: Running a Distributed Simulation .............................................................................. 16
5.1 File ⇒ Options ⇒ Run Distributed Menu ..................................................................................... 16
5.2 Run Simulation Wizard.................................................................................................................. 17
5.3 Queued Simulation Information Window ...................................................................................... 19
5.4 Running Remote/Distributed from Command Line....................................................................... 23
CHAPTER 6: Using 3rd Party Distribution Software (LSF, SGE, PBS, etc.) ................................ 24
6.1 Overview ........................................................................................................................................ 24
6.2 Installation Instructions to Link with 3rd Party Software ............................................................... 25
6.3 Scripts to Link with 3rd Party Software .......................................................................................... 28
6.3.1 Submit script ......................................................................................................................... 28
6.3.2 Script to cancel a running packet .......................................................................................... 31
6.3.3 Download script .................................................................................................................... 32
6.3.4 Upload script ......................................................................................................................... 32
6.3.5 Post-script ............................................................................................................................. 32
6.4 Configuration Considerations ........................................................................................................ 32
6.5 Changes Required to Linking Scripts from Previous Versions ..................................................... 33
CHAPTER 7: Using the Node Administrator .................................................................................... 35
CHAPTER 8: Running a Simulation with User Models ................................................................... 37
CHAPTER 9: Troubleshooting Distributed Runs ............................................................................. 39
9.1 Manual Recovery of Results if the Scheduler Hard Drive is Full.................................................. 39
Table of Contents
CHAPTER 1: Overview
Distributed computing allows a model with multiple cases to be divided among the processors of all
computers in the distributed network. For example, a simulation with 50 cases may be run on 5 different
processors simultaneously with each processor running 10 cases. At the end of the run, results from all of
the cases are combined and sent back to the computer that originally submitted the job, the client
computer. Details of this process are discussed below and shown schematically in the figure below.
Distributed computing can dramatically decrease total simulation time for a single simulation with
multiple cases. It also provides a convenient way to efficiently use all available GT-SUITE licenses and
processors by using a distribution server, which acts as a job scheduler. This server will queue jobs until
licenses become available. Each individual user can submit jobs to a centralized distribution server, which
has knowledge of all available processors and licenses, and can efficiently distribute jobs among the
different machines. Distributed computing can also be configured to work with 3rd party distribution
software as discussed in a later chapter.
All of the files necessary for distributed computing are installed in a standard GT-SUITE installation.
There are some steps necessary to configure distributed computing prior to use. Configuration is required
for the distribution server and the solver nodes, since distributed computing involves communication
between a server machine and one or more solver node machines. Note that a single machine can act as
both a server and a solver node if desired.
Note: The client computer, distribution server, and solver node can all be the same computer if desired. In
this case, the setup discussed in this document is not required. The "local distributed" capability, which
has automatic setup can be used.
The distributed computing process works in the following manner as illustrated above:
1) Within the pre-processor, GT-ISE, the user clicks the run simulation button. In the Simulation Wizard,
the user chooses Distributed Cluster (further details about how to run distributed from the client computer
are discussed in a later chapter). Note that it is also possible to submit simulations from the command
line. At this point the simulation is submitted to the distribution server (or "scheduler"), which must be
running the gtsched service/daemon. The scheduler divides all the cases of the simulation into "packets"
(groups of cases) and determines how to distribute these packets to the nodes based on the nodes' number
of cores, their relative performance, total number of licenses, etc.
2) Distribution server sends packets to selected solver nodes, which must be running the gtexecd
service/daemon. The solver nodes then run the solver to perform the calculations.
3) When the solver has completed the calculations for a given packet, its results and other simulation
output files are sent back to the distribution server.
4) Once all packets from a simulation are returned to the scheduler, they are combined in to a single set of
result files. They will remain on the distribution server until they are fetched by the user.
1) Use one version of services/daemons to run simulations from multiple versions - The
distributed computing services (gtsched and gtexecd) for a given version are designed to be able to run
simulations from that version as well as all older versions. In other words, the v2016 distributed services
can run version 2016, 7.5, and earlier simulations. For this reason, GT recommends to always update to
the latest version of services and only run one version of services on your cluster. In order for this
recommended setup to work, all versions should be installed in to the single GTIHOME used by the
cluster. GT-SUITE is specifically designed to support multiple versions in the same installation directory.
It is also possible to have multiple builds in the same installation directory. This allows each user to
choose which build of the solver will be run for each model at the time it is submitted. For further details
on installation, please see the installation notes.
2) Updating the installation without shutting down the scheduler - In some cases, it may be
desirable to update the installation, without shutting down the scheduler. This would allow users to still
submit jobs or fetch results while the installation is being updated. In that case, the easiest solution is to
do the following:
a. In the scheduler configuration, disable all nodes so that the scheduler no longer finds any
valid nodes to receive jobs. You will need to add a "dummy" node to the list, because the
UI logic will not allow you to save changes if there are no nodes listed. Just enable a
node with a hostname that does not exist. Once this is done, the scheduler will not be able
to submit any new packets. They will remain queued until active nodes are enabled again.
b. Wait until all currently running packets have completed.
c. Update the installation
d. After the update is complete, re-enable the nodes that were disabled in step a. You can
also remove the dummy node that was added in step a.
Important! The installation update program will update the distributed services/daemons in
GTIHOME/<version>/services/bin. If you are running the services from this location, you must shut
down the daemons before running the update. The dameons must be shut down if they are to be updated
at any point. The alternative is to run the services from another location outside the GTIHOME. To do
that you can simply copy the entire services directory to a location outside of the GTIHOME.
3) Services (gtsched/gtexecd) must be the same version - only services from the same version can
communicate with each other; a v2016 scheduler can only submit jobs to a v2016 execution node.
4) All solver nodes can use a common GTIHOME - In this configuration, each solver node points
to a network installation, rather than having a full installation on each solver node. This reduces both
drive storage space and maintenance. The administrator only has to update one installation when a new
build is released. To use the same GTIHOME for all solver nodes, the value of the GTIHOME parameter
in the gtexecd.cfg file on each solver node should be set to the same value (the location of the network
installation). Note this will result in increased network traffic. It is recommended to use a Gigabit
network. It is also recommended that the network installation is located on a file server. If the shared
installation is located on a Windows machine, please make sure to use the full UNC path (i.e. \\remote-
host\GTI) in the service configuration. The service will fail to start on Windows if the mapped drive path
is used (i.e. N:\GTI). Note: Further requirements when using a common GTIHOME are discussed in
"Setting up Distributed Computing".
5) Use dedicated workstations for solver nodes on the distributed computing cluster - Using
everyday computers such as employee desktops/laptops is not suggested since people may turn off
desktops when they leave. Desktops are also normally restarted more often than isolated workstations.
When a solver node machine is restarted or turned off, any packet running on that node will be terminated
and will need to be resubmitted to another node, resulting in lost computational time.
6) gtsched/gtexecd executable update procedure - To update the distribution server or solver node
service/daemon executables (gtsched and gtexecd) the services/daemons must first be stopped. The files
can then be updated. Finally, the services/daemons must be restarted.
7) Solver node hardware recommendations - for hardware recommendations for the machines
executing the simulations, please see GTIHOME\<version>\documents\InstallationNotes.pdf.
8) Scheduler node hardware recommendations - the scheduler's main task is to split the single job
in to packets, communicate with the nodes, and then recombine results from all packets. The
recombination is by far the most intensive task it performs. We have found that time to recombine results
is most dependent on processor speed and I/O performance to a lesser extent. In some rare situations, such
as use of the Advanced Direct Optimizer (ADO), it may be necessary to limit the number of merging (or
multi-collector) operations that run at a single time. The ADO is unique in that it submits many models at
nearly the same time, and often these models all start merging their results at around the same time. This
can result in a high load on the scheduler depending on hardware capabilities. If you wish to limit the
number of merging operations that can occur at a single time, please see the section on scheduler
configuration later in the document.
Important! Distributed computing services from a given version can be used to run models from that
same version as well as all older versions as long as the installation used by the distributed computing
cluster contains those versions. In other words, distributed computing services from the
$GTIHOME/v2016/services directory can run models from version 2016, 7.5, 7.4, etc.
3.1.1 Linux
3rd party queuing software (LSF, PBS, SGE) - If you link GT-SUITE distributed software to a 3rd party
queuing software, it may be necessary to update the scripts that link gtexecd to the 3rd party queue. For
more details see the section "Changes Required to Linking Scripts from Previous Versions".
1. After you ensure that there are no distributed jobs running, stop the scheduler (gtsched) and solver
node (gtexecd) services from the old version. This can be done from the "Start/Stop/Reload" tab of
the configuration editors which are launched by the following commands
a. gtexecd: $GTIHOME/<old_version>/services/bin/gtexecdconf
a. gtsched: $GTIHOME/<old_version>/services/bin/gtschedconf
2. Copy the old cfg files from $GTIHOME/<old_version>/services/config/*.cfg to the new location
$GTIHOME/<new_version>/services/config
3. Evolve the configuration files to the new version by opening them in the configuration editor for the
new version.
a. gtexecd: $GTIHOME/<new_version>/services/bin/gtexecdconf
b. gtsched: $GTIHOME/<new_version>/services/bin/gtschedconf
4. Start the services from the new version using the configuration editor opened step 3. The Start button
can be found in the Start/Stop/Reload tab.
Note that it is possible to stop and start the services from command line using the following commands,
where <version> would be replaced with the appropriate version folder:
gtexecd: "$GTIHOME/<version>/services/bin/gtexecd start"
gtsched: "$GTIHOME/<version>/services/bin/gtsched start"
3.1.2 PC
1. After you ensure that there are no distributed jobs running, stop the scheduler (gtsched) and solver
node (gtexecd) services from the old version. This can be done the "Start/Stop/Reload" tab of the
configuration editors which are launched by the following commands
a. gtexecd: "%GTIHOME%\<old_version>\services\bin\gtexecdconf.bat "
b. gtsched: "%GTIHOME%\<old_version>\services\bin\gtschedconf.bat"
2. Copy the old cfg file from %GTIHOME%\<old_version>\services\config\*.cfg to the new location
%GTIHOME%\<new_version>\services\config
3. Evolve the configuration files to the new version by opening them in the configuration editor of the
new version.
a. gtexecd: "%GTIHOME%\<new_version>\services\bin\gtexecdconf.bat"
b. gtsched: "%GTIHOME%\<new_version>\services\bin\gtschedconf.bat"
4. Now the new services must be re-"installed" so that the Windows service control manager knows to
use the new services (located at GTIHOME\<new_version>\services\bin). To install the new services
click the "Install" button on the Start/Stop/Reload tab of the dialogs opened in step 3.
5. Start the new services. This can also be done from the Start/Stop/Reload tab.
Note that it is possible to stop and start the services from command line using the following commands,
where <version> would be replaced with the appropriate version folder:
gtexecd: "%GTIHOME%\<version>\services\bin\gtexecd start"
gtsched: "%GTIHOME%\<version>\services\bin\gtsched start"
This chapter contains detailed step by step instructions for a system administrator setting up a distributed
computing cluster to use multiple machines on a network. For users who simply want to use the multiple
cores of their local machine, the "local distributed" feature is available and requires almost no setup or
configuration. It can be accessed through the run simulation wizard in GT-ISE through the radio button
option "Local Distributed/Batch".
If you have not already done so, please read the previous section entitled "Important Setup/Update
Considerations" before continuing.
Services Note: For clusters containing all Linux machines using a common gtexecd.cfg
directory file, it is not necessary to have a services directory on each node.
This directory contains the gtexecd/gtsched executables and configuration files.
The directory %GTIHOME%\<version>\services can be copied to all solver
nodes and the distribution server (GTIHOME refers to the common installation).
Note that the location of this directory on the distribution server/solver node is
arbitrary and user-chosen. It does not require a full installation to exist on the
local computer, and only relies on the relative paths between the configuration file
and the executable.
Database The directory for temporary database files, specified during software installation,
directory must exist locally on all solver nodes and have read and write permission for all
GT-SUITE users. To determine the location of the temporary database directory
run the command "$GTIHOME/bin/gtcollect dbmode". To change the directory,
run "$GTIHOME/bin/gtcollect dbconf".
Working The temporary working directory for data storage on the distribution and solver
Directory node defined in the gtsched.cfg and gtexecd.cfg configuration files. The
directories should be local to each machine. It is important that a different
working directory is used for gtsched and gtexecd if they are running on the same
machine.
Suggested Configurations
shared GTIHOME
(GTIHOME specified in gtsched.cfg and
gtexecd.cfg on distribution server and
solver nodes)
It can also be done through a batch file (PC) or shell script (Linux).
PC - %GTIHOME%\<version>\services\bin\gtexecdconf.bat
Linux - execute $GTIHOME/<version>/services/bin/gtexecdconf
The dialog launched by any of the above methods will look like this:
The attributes which are most commonly modified are mentioned below. Further details on these
attributes can be viewed from the online help available from the editor by clicking the button with the "?"
in the top left hand corner of the dialog.
The following steps are required for the execution node service if BOTH of the following are true:
- The execution node service will run v7.1 or earlier simulations (note that the distributed services
are able to run simulations from all solver versions that are equal to or less than the version of the
service. In other words, a v2016 service can run v2016, 7.5, 7.4, etc. simulations. For this to
work, all relevant versions must be installed in the same GTIHOME)
- The operating system is Windows Vista or Windows 7
If either of those are not true, this step can be skipped. If both are true, then the execution node service
must be set to run as a specific user after it has been installed. Open the Control Panel > System >
Maintenance > Administrative Tools > Services. (Services can also be found by performing a search in
the control panel or running "services.msc" from the command prompt). Here you will see the installed
"GTISOFT Solver Service". Double click on it and select the Log on tab. Change "Log on as" to "This
account" and enter a username and password of the person using the computer.
To ensure that a solver node is working correctly, one can run a model remotely to the solver node. To do
this, open a model in GT-ISE on a computer that has access to the solver node (as defined in the
configuration Access Control panel). Then go to File > Options > Run Remote. A dialog box similar to
the one below will appear. The "Host Name" is the name of the solver node on which the remote run will
occur (the machine running the gtexecd service/daemon). The port number should match the setting in the
gtexecd.cfg file. Port 3490 is usually acceptable unless that port is reserved for another application. If the
model runs on the specified "Host Name", this will indicate that the service is working correctly.
IMPORTANT: If the configuration is modified, the file must be reloaded for the changes to take effect.
To reload the file, click the "Reload" button in the Start/Stop/Reload tab.
Before starting the service for the first time, some changes to the default configuration set during
installation may be necessary. Open the user interface to modify the configuration. This can be done
from within GT-ISE, through File > Advanced > Local Job Scheduler Configuration.
It can also be done through a batch file (PC) or shell script (Linux).
PC - %GTIHOME%\<version>\services\bin\gtschedconf.bat
Linux - execute $GTIHOME/<version>/services/bin/gtschedconf
The attributes which are most commonly altered are mentioned below. Further details on these attributes
can be viewed from the online help available from the editor by clicking the help button in the top left
hand corner of the dialog.
The ADO is unique in that it submits many models at nearly the same time, and often these models all
start merging their results at around the same time. This can result in a high load on the scheduler
depending on the hardware and resources. To limit the number of merging operations that can occur at a
single time manually edit the gtsched.cfg file and add the following line in the <preferences> section of
the file. In the example below, the scheduler will only run 10 merging processes at a time. If the attribute
is missing from the file or is set to 0, there is no limit to the number of merging processes running at the
same time.
To ensure that the scheduler is working properly, submit a simulation to the scheduler as detailed in the
following chapter.
IMPORTANT: If gtsched.cfg is modified, the file must be reloaded for the changes to take effect. To
reload the file, click the "Reload" button in the Start/Stop/Reload tab.
1) The simulation model contains a user subroutine, which creates output files that should be
copied back to the client. By default, these output files will not be copied back to the client
machine. This is because the GT-SUITE scheduler does not know what to do with these files.
After all packets have completed, the scheduler will run a process (called the multi-collector)
which combines the results from all packets in to a single set of results files. The files which are
combined are the only ones which are brought back to the client machine. If the output files from
the user routine should be copied back to the client machine, the customized script may be used to
do this. More details are discussed below.
2) An e-mail should be sent to the user when the job completes. This option would typically be
used only with 3rd party queuing systems, such as LSF, SGE, etc. Even with a 3rd party queuing
software, this may not normally necessary, because GT-ISE has an option (in File-Options) to
automatically retrieve files when the job is complete. The only disadvantage of the automatic
retrieval is that it only works if GT-ISE is open.
3) Many other possibilities. There are probably many other scenarios in which the customized
script can be used. The ability to call these scripts should give great flexibility to users and their
administrators.
Below are details about the scripts. There are two optional scripts, called the pre and post scripts. To
explain the differences between the two scripts, the order of operations on the scheduler will be described:
Gamma Technologies anticipates that the post script will be the most commonly used script, but to
provide the most flexibility, the pre script option has also been added.
Both the pre and post script will be called with the current working directory set to
WORK_DIR/SIMULATION_ID. In this directory, there will be nodeX directories
WORK_DIR/SIMULATION_ID/node1
WORK_DIR/SIMULATION_ID/node2
…
WORK_DIR/SIMULATION_ID/nodeN
which contain the results from each packet. If you wish to have files created, for example by a user
subroutine, copied back to the client, the script should copy files from the nodeX directory up one level to
the SIMULATION_ID directory. The scheduler will copy all files in the SIMULATION_ID directory
back to the client machine, but it will not copy anything from the node* directories.
If the script returns no error code or an error code of 0, it will be considered to be successful. If it returns
a non-zero error code, this will be interpreted as an error, the remaining processes will be skipped, and the
status of the simulation will be set to "Post-Processing Error".
If default is set to "false" this means that the script will be used. If it is set to "true" it means the script is
ignored. The value field specifies the location and name of the pre and/or post script.
Then enable the exporting on the distribution server by adding a post-process call to the file
%GTIHOME%\<version>\services\bin\post\ call_data_export.bat/sh. This is done by adding a line
similar to that shown below to the preferences section of the file gtsched.cfg typically found in
%GTIHOME%\<version>\services\config , and then restarting the scheduler.
Then open the call_data_export.bat/sh and make sure the correct version of GT-SUITE is in the path,
similar to below:
%GTIHOME%\bin\gtperl.bat "%GTIHOME%\<version>\services\bin\post\data_export.pl"
In the same directory, open data_export.pl and check the $gdtfile variable line is using .gdx instead of
.gdt, and also check the call to gtexport in the following lines is declared for the proper version:
if ($^O eq "MSWin32"){
system "$GTIHOME\\bin\\gtexport -v2016 \"$expfile\" \"$gdtfile\"";
} else {
system "$GTIHOME/bin/gtexport –v2016 \"$expfile\" \"$gdtfile\"";
}
For details regarding creating the *.exp file or regarding exporting data in general, please see the
"Exporting Plot or Table Data" section in the GT-POST user’s manual.
Username
This is the user name which you log in to the machine as. This user name must have an account on the
domain where the scheduler is running.
Note that it is not necessary to have a separate scheduler for each version of GT-ISE. This is not the
intended use of this feature. A given version's distributed services can run simulations from that version
and all earlier versions. For example a v2016 scheduler can accept simulations submitted to be run in
version 2016, 7.5 or earlier. For more details see the earlier sections in this document on configuration
and setup.
The "default" column selects the scheduler that will be pre-selected in the run simulation wizard (see next
section) when a new simulation is started. The user can change the scheduler which should receive the job
in the run simulation wizard. If the run simulation wizard is not turned on, distributed runs will be
submitted to the "default" scheduler.
Case Weighting
The Case Weighting menu items can be used to specify a linear relationship between a user-chosen
parameter and the expected simulation time for a case. This information will be used when splitting a
simulation into packets. Take, for example, a speed sweep, where everything else being the same, the
simulation time decreases with increasing engine speed. In this case, the Case Weight Parameter would be
set to the engine speed parameter (typically [RPM] or [N]). For the case weighting option the user would
choose "Larger Weight –> Faster Case", because a higher RPM corresponds to a faster running simulation
(i.e. less computational time). If all computers on the cluster were identical (same performance) then the
packets which contained higher RPM cases would have more cases compared to the packets which
contained lower RPM cases. The intention is that all packets finish as closely to the same time as
possible.
Initial Priority
The initial priority defines whether the job submitted by the user will be set to "Normal" or "Low". A
packet from a simulation with normal priority will always run before a packet from a simulation with low
priority regardless of the order of the simulations in the queue. For example, suppose User A has a job in
the queue at low priority and some of the packets are running. Later on User B submits a job at normal
priority. The packets currently running for User A will complete, but then User B's job will take priority
and User A's packets will wait for User B's job to finish.
and where it should be run (local machine, distributed cluster, or remote machine). When a user hits the
Run Simulation button in the toolbar of GT-ISE, the following dialog is displayed:
The "Distributed Cluster" option will submit the job to the distributed computing cluster. Note that the
user can choose the build number of the solver to be run on the remote execution nodes. The *.dat file
sent to the remote nodes will be created with the build number of GT-ISE on the machine submitting the
job (where GT-ISE is running), but the solver build used to run the job will be that specified in the
wizard. Note that if "Standard Installation" is selected, this means that the latest officially released build
available on the remote execution nodes will be used. This may be different that the latest build on the
local installation from which GT-ISE was launched.
When a user chooses the "Distributed Cluster" option and hits "Next", another window pops up as shown
below. This window allows the user to specify a Case Weighting parameter if it wasn't specified earlier in
the File ⇒ Options ⇒ Run Distributed folder. The user can also set the Initial Priority of the simulation
as well as select the Run All Cases on a Single Core (Do Not Distribute) option to run all cases as a single
packet (no distribution of cases). Hitting the Finish button will start the distributed simulation. The user
may optionally choose to hit "Next" again to see advanced options.
The advanced options for submitting a distributed computing simulation are shown below. These allow
the user to override some of the settings on the scheduler. For example, assume the administrator has
configured the scheduler to allow each simulation to use 6 licenses. However, the user is running a large
DOE and does not want to use all 6 licenses that would be allotted to his job. In the advanced options he
can limit the number of licenses used by this simulation. The user can also specify limits on the minimum
and maximum number of cases that will be included in a packet.
a Detailed Info dialog which is useful in determining the status of particular packets in a run. These
dialogs are described in further detail below.
1) The "Toggle Priority" option may be used to prioritize simulations in the queue if there are multiple
simulations. Currently there are two priority levels available (Regular and Low).
2) The "Move Up" and "Move Down" options may also be used to prioritize the simulations. If the
simulations have the same priority setting then they will run in the order that they are in the queue.
3) The "Remove" option allows the user to remove any undesired simulations. Users can only remove
simulations that they have submitted and will not be able to remove jobs submitted by other users.
4) The "Detailed Info" option allows the user to view or modify the packets in a given simulation.
This window gives information for all packets in a given simulation. There are several columns shown in
the Detailed Info dialog, and they are explained below.
Column Description
Packet The specific packet of a simulation. The largest number in this column denotes the
number of packets a given simulation has been divided into.
Cases The range of cases a given packet contains.
Status This column will display the status of a given packet. The different values for the
status are discussed below:
Elapsed Time Displays the amount of time that has passed since the packet was originally
submitted from gtsched to gtexecd. The time is in hours: minutes: seconds.
Node Name Displays the host name of the processor running the GT-Suite solver. When 3rd
party software is used, the Node Name will be the same for all packets and denotes
the name of the node where the single gtexecd daemon is running.
Solver Build Displays the build number and version of the GT-Suite solver.
For a simulation, each of its packets will have its own status. For details on the different statuses and their
descriptions, please see the online help in GT-ISE available from the queued simulation information
window.
The Detailed Info dialog also provides several options for the user including:
1) The "Refresh" option may be used to update the information in the Detailed Info dialog.
2) The "Resubmit" option allows the user to resubmit a given packet. This is typically used to force a
packet to run again if there was previously an error when running the packet.
3) The "View .msg" option will display the message file (<model-name>.msg) of a given packet, which
contains brief information regarding the running cases. This is generally the recommended option to view
progression of the given packet. The file size is small, which makes viewing the file much quicker than
the screen.output or .out file.
4) The "View Screen" option will display the screen.output file of a given packet, which contains more
detailed information than the .msg file regarding the running cases. The screen.output file displays
information very similar to the DOS / term window contents when running a local run. It is most useful
when a packet has an error status.
5) The "View Browse" option will display all information files available for a given packet as shown
below (additional files may be available when using 3rd party distribution software). This option is
particularly useful for users to diagnose a problem when a particular packet fails. Please note this option
is only available when a packet has completed or if there was an error with a packet.
6) The "Halt" option allows the current packet to be halted. There are multiple options once the halt
button is selected. A dialog will open explaining the halt options to the user.
If the simulation ends without an error, the Status will be displayed as "Completed" and the Simulation
will be moved from the Queue tab to the Processed tab as in the following figure:
If the simulation has ended with a status of Packet Error or Post-Processing Error, the user will not be
able to use the "Fetch Data" button to retrieve results. To determine how to proceed in the event of either
of these errors, see the online help available by clicking the "Help" button in the dialog shown below.
The Help will explain how and when to use the "Recombine All Cases" and the "Combine Good Cases"
buttons on the left side of the dialog below.
Once the simulation has completed the user may use the "Fetch Data" option to bring the simulation data
from the Distribution Server to the directory on the client machine where the user started the model.
Users can only fetch the data belonging to the jobs that they have submitted. The output files will be
brought automatically if the 'Automatically retrieve files every …… minutes' option was checked in the
Run Distributed folder of File –> Options in GT-ISE.
The general command for running remote is "gtsuite -remote -h:host model.gtm". This command will run
the simulation and fetch the results back to the client when the simulation is complete. There are other
options and details explained in the command line help.
It is also possible from command line to submit simulations and fetch results for distributed computing.
Both commands will return exit codes depending on their success (0) or failure (any non-zero value). The
submit command will submit the job and return control to the command prompt. The fetch command will
wait until the job has reached a final state until it fetches the results. It will then return control to the
command prompt. Below is sample pseudo-code for one method of using these commands:
# Submit simulation
gtsuite -distributed:submit model.gtm
# Check exit code. In Windows batch files you may use the %ERRORLEVEL% environment variable.
# Please read the comments below on the option "-forcemerge" available with the submit command.
if exit code not equal to 0, there was an error submitting the model. Use some method to notify the end
user of the problem, perhaps an automated e-mail.
Important! For automated runs which submit and fetch in a loop (i.e. in-house optimization tools), it is
recommended to use the "-forcemerge" option with the submit command. If this option is not used, results
will not be merged by the scheduler if one or more packets fails. In order for the results to be fetched in
this case, the user would need to open the UI and click a button to force the merging of results for only
the successful cases. At that point the results could be fetched from command line or through the GUI. If
the -forcemerge option is added with the submit command, the results will be merged even if there are
failed packets, thus allowing the fetch operation to complete without manual intervention.
Note that this solution requires daemons (gtsched and gtexecd) to be running. By default, all jobs
submitted to the external queuing system by gtexecd will be submitted as the user who started the gtexecd
daemon - not the user who is actually submitting the simulation. Note that it is possible to use the "sudo"
command in your scripts to run the job as the actual user; the name of the user submitting the job is
available to the scripts as described later.
6.1 Overview
When using 3rd party distribution software, gtsched and gtexecd behave in much the same manner. In fact,
the behavior of gtsched is exactly the same whether or not 3rd party distribution software is used. In both
cases gtsched will break up the simulation into packets, send those packets to the available nodes, and
then re-combine the results from each packet when all packets from a job are completed. Note that when
only the GT distributed computing services are used (no 3rd party software) a gtexecd daemon must be
running on each node which will run the solver.
However, when using 3rd party distribution software, it is only required to have one computer running
gtexecd. The reason is that gtexecd will mainly be used to call a script which will submit the job to the 3rd
party software. In other words, gtsched will send a certain number of packets to the gtexecd node based
on settings in the scheduler configuration. Then gtexecd will call a customer-written script for each
packet. This script links gtexecd to the 3rd party software. This script and all other scripts required are
discussed in detail later in this section.
The scripts linking gtexecd and 3rd party software should be created by someone knowledgeable in shell
scripting and in operating the 3rd party software. Sample scripts for LSF and SGE are provided in
%GTIHOME%\<version>\services\scripts\. They typically require only minor modifications to suit your
environment.
IMPORTANT! Each packet that gtexecd submits creates a thread. For 32 bit systems, typically the
maximum number of threads that a process (i.e. gtexecd) can create at a given time will be ~400. This
means that a 32 bit gtexecd will fail to submit additional packets once there are already ~400 submitted to
the 3rd party queue. It is therefore strongly recommended to run gtexecd on a 64 bit machine when
linking to a 3rd party queue. This greatly increases the stack size available to gtexecd, allowing it to
create many more threads ( > 30,000).
Step 2) Create the necessary scripts to link GT supplied software to the 3rd party distribution software.
Please see the "Defining the Scripts" section below for more information.
Step 3) Modify the gtsched and gtexecd configuration using the configuration editors. They can be
launched by running GTIHOME/<version>/services/bin/gtexecdconf and
GTIHOME/<version>/services/bin/gtschedconf or through GT-ISE File > Advanced > Local *
Configuration.
The attributes below are the most important and most commonly modified when integrating with 3rd party
software:
Step 4) Start gtexecd by clicking the "start" button in the configuration editor launched in the previous
step. Alternatively, it can be started from command line via: $GTIHOME/<version>/services/bin/gtexecd
start
Note: To tell if the gtexecd daemon started properly, go to the Start/Stop/Reload tab of the configuration
editor and click the View Log button. If the daemon started correctly, it should look similar to:
07/14/05 08:36:29 Log file was created in: <WORK_DIR>
07/14/05 08:36:29 Start DB command executed
07/14/05 08:36:41 Environment variable GTIHOME: ($GTIHOME)
07/14/05 08:36:41 Environment variable LCTYPE: [No Value]
07/14/05 08:36:41 Environment variable GTISOFT_LICENSE_FILE: (27005@<license-server>)
07/14/05 08:36:41 Server starting on port: 3490
Step 5) After the gtexecd daemon has been started, verify the gtexecd daemon, script and 3rd party
software are all working properly together by submitting a remote job from GT-ISE to the hostname
running gtexecd. If the remote run is successful, then the gtexecd daemon, script linking to 3rd party
software, and the 3rd party software itself are all running properly. If it is not successful, look in the
various .log, .gdxlog, and .out files under the Working Directory of gtexecd, as well as the log files of the
3rd party software to troubleshoot the problem.
Note: When submitting a remote run with 3rd party software, the Java window will not display any
information until the simulation has been fully completed and all files have been passed back to the
WORK_DIR/<temp-packet-id> directory of gtexecd. The only information displayed in the Java window
is information echoed inside the scripts linking gtexecd and 3rd party software to be output to the
screen.output file.
Step 6) Start gtsched by clicking the "start" button in the configuration editor launched in step 3.
Alternatively, it can be started by from command line via: $GTIHOME/<version>/services/bin/gtsched
start
Step 7) Verify the gtsched daemon was started correctly by starting the Node Administrator (command
shown below) and logging in to the host running gtsched. Note that only one gtexecd daemon should
appear under the Nodes tab when linking with 3rd party software. The Node Administrator can be
launched by typing the following in a shell window:
Step 8) Submit a model to run distributed in GT-ISE and verify the distributed run is successful.
Important! All scripts are always called by gtexecd with the current working directory set to the working
directory on the gtexecd node where the files needed to run the simulation exist (i.e.
WORK_DIR/<packet-id>)
where VERSION, PRODUCT, and the values for the other arguments come from a file named job.info.
See the sample scripts for the exact command including the other arguments. The job.info file is created
by gtexecd for each packet. It is used to pass additional information from gtexecd to the linking scripts. A
listing and description of all variables contained in the job.info file is shown below:
BITS Specifies whether the 32 or 64 bit solver should be launched. The value
will be either 32 or 64.
CASES The range of cases for the given packet
CASE_COUNT Number of cases for the given packet
GUI_BUILD The build number of the GT-SUITE pre-processor, GT-ISE, used to
create the <model>.dat file. This value is very rarely used by the linking
scripts.
LAST_PACKET Specifies whether this packet is the last one created by the scheduler. Will
have a value of "true" or "false". It will be "true" for only 1 packet in a
given simulation, and false for all other packets. Note that when this value
is "true", the value of PACKET_ID gives the total number of packets for
the simulation.
MAX_CASES_PER_PACKET An (optional) override value entered from the run simulation wizard
indicating the maximum number of cases in a packet. This value will
override the setting on the scheduler.
MAX_LICENSES_TO_USE An (optional) override value entered from the run simulation wizard
indicating the maximum number of licenses this simulation can use. This
value will override the setting on the scheduler.
MIN_CASES_PER_PACKET An (optional) override value entered from the run simulation wizard
indicating the minimum number of cases in a packet. This value will
override the setting on the scheduler.
MODELNAME The prefix of the model name (i.e. "4cyl" for a model named "4cyl.gtm")
PACKET_ID A unique id for each packet of a simulation. The values start at 1 and
increase sequentially such that the PACKET_ID of the last packet equals
the total number of packets.
PRECISION Precision of the solver to be used as specified in the user's setting in File-
Options-Run in GT-ISE. A value of sp or dp will be written, for single
and double precision respectively. This is typically used in running the
solver in the run script.
PRIORITY Simulation priority assigned by during job submission. As far as GT-
SUITE is concerned, this only affects the priority that the job has in the
gtsched queue (i.e. the order in which the submit script is called for
individual packets).
PRODUCT The product (i.e. GT-POWER, GT-SUITEmp, etc.) of the model being
run. This is typically used in running the solver in the run script (i.e.
$GTIHOME/bin/$PRODUCT …)
SIMULATION_ID An id unique to each simulation. Each packet for a given simulation will
have the same SIMULATION_ID
SOLVER_BUILD_NO The build number of the solver which should be run. This build number is
chosen by the user from the GT-ISE user interface. This is typically used
in running the solver in the run script.
USERNAME Username of the user who submitted the job. By default all jobs started by
gtexecd will be owned by the user who owns the gtexecd process. If this
is not desirable, the run script can use the "sudo" command to change to
the user who actually submitted the job.
VERSION Version of the solver which should be used to run the given packet. This
is typically used in running the solver in the run script
2) The files necessary to run the solver will be placed in the gtexecd packet directory by gtexecd. There
are two options with regards to the handling of these files:
• Shared directory - the files are left exactly where gtexecd creates them. This means that the working
directory for gtexecd must be a shared location so that all 3rd party execution nodes have access to the
files. The advantage of this solution is that the following scripts are not necessary: upload script,
download script, script to cancel a running packet. The disadvantage to this option is most like
increased network traffic. This option is the one shown in the sample scripts
• Local copy to 3rd party node - the files are copied from the gtexecd working directory to the local
directory where the solver is executed on the 3rd party execution node. The advantage of this solution
is perhaps decreased network traffic and file access times since the files are local to the node. The
disadvantage is that there is more responsibility placed on the administrator-written scripts to transfer
files between the gtexecd working directory and the local directory on the execution node. Gtexecd
does not have any knowledge of the directory on the execution node. It only knows about the working
directory it created. Therefore all results files must be copied back to the gtexecd working directory.
The run script will be responsible for this. In addition, the following scripts must be developed: upload
script, download script, script to cancel a running packet. These scripts effectively link the gtexecd
working directory to the local directory on the execution node so that files can be transferred back and
forth. More details about these scripts are shown below.
3) GT-SUITE uses a database to store temporary results during a simulation run. A special mode of the
database, called "ds" mode, should be used when linking with 3rd party software. It is an alternative to the
conventional mode used for non-3rd party simulations.
If the database is run in conventional mode, when a simulation starts a database will be created if one is
not already running. After the simulation has completed the database will stay running. This will cause a
problem with some external queuing systems, because they require that all processes launched within a
task are finished before the task can be considered complete. If using third party distribution software,
each GT-SUITE solver process should create its own database that can be started and stopped on demand
without conflicting with other processes running on the same node. This is the purpose of the "ds" mode.
To run the database in "ds" mode, include the following as part of the run script:
There are some limitations of the "ds" mode, which are listed below. Normally, these do not present any
problems when linking to 3rd party software, but they are mentioned here for completeness.
a) Only one simulation with the ds flag can be run in a given directory at the same time.
b) The simulation must be started in the directory where the ds database exists. In other words, do not
specify a path (relative or absolute) to the model file (.dat or .gtm) when invoking the solver; it will
not connect to the "ds" database.
c) The "ds" mode is only suitable for non-interactive mode. Neither GT-POST nor GT-ISE should be
launched from the model working directory while the "ds" mode is active.
d) The "ds" flag is only available on Linux
e) The following capability is very rarely needed. It is not recommended to use this feature unless
specifically directed to do so by GT support. To ensure that the parent process ID of the database is
not reverted to 1, add the file lock.db in the folder before starting the database. Then launch the
database in the background (i.e. "${GTIHOME}/bin/gtcollect -V ${VERSION} dbstart ds &").
4) In the gtexecd configuration, there is an option to "Use .status file?" In nearly all cases, this should be
set to true (check box on). The .status file is the mechanism to pass status information between the 3rd
party system and the GT-SUITE distributed computing services.
If the .status file is used, gtexecd will check for a .status file in the gtexecd working directory at an
interval specified in the gtexecd configuration. The status shown in the GT-ISE user interface will be
updated based on the text written in the .status file. If the .status file contains the text "RUNNING" the
packet status will change to "Running" in GT-ISE. If the .status file contains the text "FINISHED", this
will inform gtexecd that the 3rd party execution is complete. At that point, gtexecd will determine the
packets status based on the files produced by the solver. Typically the .status file should be created with
the text "RUNNING" in the beginning of the run script. At the end of the run script, the .status file should
be updated with the text "FINISHED". If the .status file does not exist (i.e. submit script has been called
and the run script has not started running because it is queued on the 3rd party queue) the status in GT-ISE
will show as "Queued Externally".
If the .status file is not used, then gtexecd will set the packet status to "Running" as soon as the submit
script is called. The status will stay as running until the submit script exits. At that point, gtexecd will
assume the 3rd party execution is complete and will determine the packets status based on the files
produced by the solver. Please note that the sample scripts provided do not conform to this use. This
option is generally not recommended, and has been left as an option only for legacy reasons.
5) The standard error and standard out from the submit script are written to a file screen.out which the
user can view from the interface by clicking the "View Screen" button in the detailed simulation
information window. When 3rd party queuing software is not used, the screen.output file will show the
standard error and standard out from the GT-SUITE solver. This information can be very useful to the
user.
When linking with 3rd party software, it is recommended that the run script appends its output to the
screen.output file in the gtexecd working directory. This way the user will be able to see the output of the
submit script as well as the output from the solver.
6) All packets will be run on 3rd party software as the user who owns the gtexecd process. In some
environments, this is desirable to run the run script at the actual user who submitted the job. In this case,
the "sudo" command can be used, and the name of the user can be obtained from the job.info file.
When the user chooses to remove a simulation or resubmit a packet while it is running, the currently
running solver process for that packet must be stopped. To accomplish this, gtexecd will create a
<model>.hlt file containing text informing the solver to stop immediately, and do a proper clean up of
temporary files. The script should simply copy the .hlt file created by gtexecd to the local directory where
the solver is running.
In the Detailed Info dialog of the Queue window (as shown below), the user can view solver output for
each packet as the simulation is running through the "View .msg" and "View Screen" buttons. These
buttons will allow the user to view the <model>.msg and screen.output files contained in the gtexecd
working directory. If the solver is not running in the gtexecd working directory, the files in this location
do not have up to date information. When the user clicks either of the buttons, the download script will be
called. The argument passed to the script will be the name of the file requested. The download script
should copy the requested file from the directory where the solver is running to the gtexecd working
directory.
This script is used to transfer a specific file from the gtexecd working directory to the directory where the
solver is running. The only currently known use of this script is to transfer a <model>.hlt file to the
directory where the solver is running in order to tell the solver to halt a currently running packet and save
the results. When the user clicks the halt button, the upload script will be called and the file name to
upload will be passed as the argument to the script.
6.3.5 Post-script
This script is not specific to 3rd party queuing, but is mentioned here, because the most common
application occurs with 3rd party queuing software. There is an option to have the scheduler execute a
customized script, written by the administrator when all packets of a simulation have completed. One
example of an application could be to send an e-mail to the user when the job is complete. For more
details please see the section titled "Running a Customized Script After Job Completion".
It is recommended that the user refer to the section "Setting up Distributed Computing" for general setup
instructions. This section will discuss issues specific to use with 3rd party distribution software. A
graphical representation of the distributed computing system and the processes which run on each
computer is shown below.
GT-ISE runs here The requirements for an environment using a 3rd party queuing
software:
A Client Computer 1) Computer A should be able to connect to Computer B
through TCP/IP (communication method between GT-
ISE and gtsched)
2) Computer B should have a different working directory
GTI Distribution (gtsched) for gtsched and gtexecd. The working directories are
B GTI Node (gtexecd) specified during configuration.
3) gtsched and gtexecd do not need to run on the same
computer, but in most cases this is the most convenient.
Submit script If they run on different computers, they must be able to
communicate through TCP/IP
3rd Party Distribution 4) Computer B must be able to run the scripts which link to
C Head Node the 3rd party queuing software
5) Computers A and B can be any platform supported by
GT-SUITE (http://www.gtisoft.com/platform.html)
6) Computers B and C may be the same computer, but do
not need to be.
Computers that GTI Solvers run on 7) Computer D must have access to the GT-SUITE
rd installation to run the solver.
D (3 party execution nodes)
Note: GT-SUITE needs to be installed on computers A, B, and D or they may point to a network GT-
SUITE installation.
The run_gtsuite.sh script needs to be updated. If it is not, then V2016 will be interpreted to be "less than"
7.X due to the result of the string comparisons which only compare the first number in the string. This
would result in the wrong section of the script being executed. This would ultimately cause values of
"Solver build number" and "bits" (32 or 64) chosen by the user in GT-ISE to be ignored by the scripts for
jobs run from 2016 or later.
There are two main tabs of the Node Administrator: Distribution Node and Solver Node.
Distribution Node
The Distribution Node tab displays information about the Distribution Server including a list of available
nodes, performances and availabilities of each node and log information showing the past distributed
events that were submitted. Additionally, users have the same options in the Node Administrator as in the
"View Queue..." window of GT-ISE where detailed information for a given simulation can be viewed,
and data of a completed simulation can be fetched.
To connect to the distribution server, specify the Host Name (or IP address) of the Distribution Server
(machine running gtsched), the Username and the Port number (typically 3491), then hit "Connect".
The "Nodes" folder displays the list of the nodes connected to the distribution server, as well as the IP
address, performance and availability of each node. The performance indexes can be set inside either the
gtsched.cfg or the gtexecd.cfg configuration files. These indices identify the relative speed of the
computers and they are used when splitting a simulation into packets.
Solver Node
The Solver Node tab is used to monitor the system information of a given solver node including number
of processors used by the solver, processor speed, available memory, GTIHOME, and other useful
information. To connect to a solver node, specify the Host Name (machine running gtexecd), Username
and the Port number (typically 3490) and hit "Connect".
- If a Fortran user routine is used (i.e. using 'UserCodeFReference') the user .dll or .so on the client
machine will NOT be transported to solver nodes by default. In order to make use of the user
library with distributed computing the user has two options shown below. Note that for C code
(i.e. using 'UserCodeCControls'), this step is not necessary.
1) There is a MiscFiles tab under the Run->Advanced Setup menu. This allows the user to specify
additional files that should be transferred to the solver node for distributed computing, such as
GTIusrXX.dll. Note that the library will only work on the platform on which it was compiled. User
libraries should therefore be compiled on the same platform as the solver nodes (if different than the
client computer platform).
If your solver nodes contain a heterogeneous mix of PC and Linux machines, it will be necessary to
compile the user routine for both platforms. The following platforms have the following extensions for
the user libraries:
IMPORTANT: Although the user library is transferred to the solver node, it will not be used unless its
location on the solver node is specified in the library search path. This can be done by adding {JOBDIR}
to the "Prepend Library Path" attribute in the solver node configuration. {JOBDIR} will be resolved to
the directory on the solver node where the model will be run.
computing cluster. For the recommended installation in which all solver nodes point to a common shared
installation, it is only necessary to paste the file once.
On the scheduler
- When the simulation is started, it is assigned an ID number. In the example below, the model
Virtual_Vehicle_System_Demo has an ID 1586. This will be referred to as SIMULATION_ID.
- The directory <sched_wd>\<SIMULATION_ID> is created.
- When packet X (X = 1, 2, 3…) completes on the solver node, those results are copied back to
<sched_wd>\<SIMULATION_ID>\nodeX.
Figure 9.1 Scheduler job queue dialog showing SIMULATION_ID value in "ID" column
example), if everything is working correctly and there is disk space available on the scheduler. At
that time the directory <execd_wd>\<REMOTE_ID> will be deleted on the solver node since it is
no longer needed.
- If there is no disk space left on the scheduler, the copy back to the scheduler will fail, and the
results will be left on the solver node in <execd_wd>\<REMOTE_ID>
Process to manually recover results that were not copied back to the scheduler
1) Find which packets were not successfully copied back to the scheduler by viewing the directories
located at <sched_wd>\<SIMULATION_ID>. For each packet that was copied back successfully,
there will be a nodeX directory, with a model.gdx (.gdt in 7.3 and earlier) and other output files.
2) For any nodeX directories that are completely missing or are missing a .gdx (.gdt in 7.3 and
earlier) file, you will need to manually copy the entire <execd_wd>\<REMOTE_ID> directory on
the corresponding solver node and move it to
<sched_wd>\<SIMULATION_ID>\node<PACKET_ID> on the scheduler machine. The easiest
way to get the solver node name and REMOTE_ID is from the "Detailed Simulation Information"
dialog shown in Figure 9.2
3) Once all the node* directories exist in <sched_wd>\<SIMULATION_ID> the results need to be
merged together. This can be done in one of two ways:
a. First try this approach: from the "Processed" tab in the distributed queue, click the button
"Recombine All Cases" or "Recombine Good Cases". Only one of the options will be
available, so use that one. If it completes successfully, the results can be fetched back the
client as usual.
b. If that does not work, results can be combined by running the following at command line:
For example:
C:\GTI\bin\gtcollect.bat -multi -c FULL -d C:\temp\gtidata\serverdir\1586 -m
Virtual_Vehicle_System_Demo -v ALL)
The fetch operation from the UI probably will not work, so the files will need to be
manually copied back to the client machine. Copy all files in the
<sched_wd>\<SIMULATION_ID>, but not the node* directories. The node* directories
are not needed. You can then remove the job from the scheduler through the distributed
queue interface in GT-ISE, and it will delete <sched_wd>\<SIMULATION_ID>.
For example, consider that packets 1 and 2 of the job with SIMULATION_ID = 1586 shown in Figures
9.1 and 9.2 were successfully copied back to the scheduler, but packet 3 was not. To manually recover the
data, these steps would be followed:
- From Figure 9.2 determine that packet 3 ran on the node Cerberus
- Go to the <execd_wd> on cerberus and copy the files in <execd_wd>\379
- On the scheduler create a directory named "node3" inside <sched_wd>\1586.
- Paste the files from <execd_wd>\379 in to <sched_wd>\1586\node3
- Complete step 3 above to merge the results of all packets in to a single result file and fetch results
The steps for local distributed/batch are identical, except that the working directory will be determined by
settings in File > Options > Run Distributed > Local Distributed. There is a "Working directory for
temporary files". Under this directory there will be a serverdir (scheduler working directory) and simdir
(solver node working directory)
2. If gtexecd is running on a solver node but cannot be seen by gtsched, check the following:
If it is a firewall issue, then add two ports on the solver node machine as exceptions to the
firewall as shown below, by going to Start>Settings>Control Panel>Windows
Firewall>Exceptions>Add Port...
3.