Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
7fd34fa
first iteration of multidb tests
Aug 10, 2023
b8ba2e6
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
Aug 15, 2023
acf92f8
tests and implementation of db_identifier in colo and standard db
Aug 15, 2023
40cc023
linting
Aug 15, 2023
eeee266
rework uniqueness testing and lint
Aug 15, 2023
9fc0913
changes to db_ident initialization and env variable settings
Aug 21, 2023
aa2b4d6
fixing case for no db_id in orchestrator
Aug 21, 2023
2537b0a
typo
Aug 21, 2023
75f21d2
env var and client tests
Aug 23, 2023
be3c298
linting and big fixes
Aug 23, 2023
6ff15af
change how im getting num_shards
Aug 23, 2023
5565084
after rework of tests
juliaputko Aug 25, 2023
78e6dcc
all test run - working
juliaputko Sep 1, 2023
5e58db1
resolve merge and synch with develop
Sep 1, 2023
7171f3a
Bill PR comments and pylint fixes
juliaputko Sep 5, 2023
10e0dca
comment out mypy for CI test
Sep 6, 2023
17dfd96
fixing keyerror
Sep 6, 2023
07b78ae
addressing Andrew pr comment and typehint fix
juliaputko Sep 8, 2023
fa15e43
andrew pr review and type hint
Sep 8, 2023
d93d12b
added multi node test with db id, change for no db-id in naming
juliaputko Sep 8, 2023
2210616
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 8, 2023
7359f70
type change
juliaputko Sep 8, 2023
97b3e64
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 8, 2023
c4bf7ed
fix type error
juliaputko Sep 8, 2023
d27d5b9
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 8, 2023
a5b606a
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
Sep 11, 2023
43ddefc
type check fix
juliaputko Sep 11, 2023
21b4ba5
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 11, 2023
37d99b9
point to bill's branch
juliaputko Sep 11, 2023
ca56297
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 11, 2023
3558d54
fix on pointing to bill's branch
juliaputko Sep 11, 2023
32c5ee2
change so point to bill's branch
Sep 11, 2023
3bcb48b
type change, bug fix, clean note
Sep 12, 2023
b10c612
mypy and linting error fix
Sep 12, 2023
86eb426
allow for multiple start, remove error check if multiple orchestrators
juliaputko Sep 13, 2023
511c741
github CI error fix
juliaputko Sep 13, 2023
f4af032
merge conflicts
Sep 13, 2023
eb87ff3
change redis version
Sep 14, 2023
4e372c2
added test_interface and passed dbid to Dbnode
juliaputko Sep 15, 2023
c366be3
Merge branch 'multdb' of github.com:juliaputko/SmartSim into multdata
Sep 15, 2023
e090d86
run-tests conflict resolve
Sep 15, 2023
60dd9fb
Merge branch 'develop' into multdata
juliaputko Sep 15, 2023
9dfb8f3
run tests change
Sep 15, 2023
58a3067
run tests change
Sep 15, 2023
2b51361
Merge branch 'multdata' of github.com:juliaputko/SmartSim into multdata
Sep 15, 2023
0afd27c
test with adjusted helped and adjusted tests
Sep 15, 2023
775c71c
test with adjusted helped and adjusted tests
Sep 15, 2023
d834f8d
Merge branch 'multdata' of github.com:juliaputko/SmartSim into multdata
Sep 15, 2023
27b7314
Merge branch 'multdata' of github.com:juliaputko/SmartSim into multdata
Sep 15, 2023
6193c44
Merge branch 'multdata' of github.com:juliaputko/SmartSim into multdata
Sep 15, 2023
85472b3
whitespace
Sep 15, 2023
91a0194
upload artifact on test failure
ashao Sep 18, 2023
5a843fb
comment upload
ashao Sep 18, 2023
ff4dc34
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
ashao Sep 18, 2023
091a720
uncomment upload test artifacts
ashao Sep 18, 2023
c17a733
Iterate over dbs in experiment.stop
ashao Sep 18, 2023
96a61a1
Change order of arguments in Client constructor
ashao Sep 18, 2023
dc793d3
Reorder arguments in SmartRedis client for compatability with new
ashao Sep 18, 2023
f5cbe5c
Remove artifact upload
ashao Sep 18, 2023
12acfd9
tests in new file, remove Client(None), rework multiple start() for m…
Sep 19, 2023
7f424f5
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
Sep 20, 2023
cc468f0
restart of dbs,debugged tests,renamed helper function
Sep 21, 2023
327cd72
linting and stop_db change
Sep 21, 2023
a7452a3
linting
Sep 21, 2023
1f6fbdc
key error bug squash
Sep 21, 2023
6fa2587
key error bug squash
Sep 21, 2023
81adeaf
Merge branch 'multdata' of github.com:juliaputko/SmartSim into multdata
Sep 21, 2023
a51d77c
key error and test bug fix
Sep 21, 2023
bff1633
lint
Sep 21, 2023
9f4c590
refactor of uniqueness checking - in controller
juliaputko Sep 26, 2023
17991fa
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
Sep 26, 2023
f31b319
fixing orchestrator no db_id error,adjusting test ports for colo then…
Sep 27, 2023
72b935b
empty db-id if orchestrator not named
juliaputko Sep 27, 2023
6c4f648
fix orchestrator failures
juliaputko Sep 27, 2023
6031d56
database name changed to orchestrator
Sep 27, 2023
804bbcb
Al PR comment fixes
Oct 2, 2023
1d7689d
sync with develop
Oct 2, 2023
a7b3399
pylint fix
Oct 3, 2023
821d8a0
Merge branch 'develop' of github.com:CrayLabs/SmartSim into multdata
Oct 3, 2023
c7fe5b4
whitespace fix
Oct 3, 2023
5a1bbd2
Andrew Pr review fix
Oct 4, 2023
680d4f2
synch with develop
Oct 4, 2023
0e90072
added missing arg from setup_test_colo call
Oct 5, 2023
4e46d86
sync with develop - resolve merge conflicts
Oct 10, 2023
47edef4
pylint and mypy error fixes
Oct 10, 2023
4c9810f
mypy bug revert
Oct 11, 2023
9f9e6e6
orchestrator as default name to db_identifier
juliaputko Oct 11, 2023
5bdf6eb
pointing to develop smartredis and changelog
Oct 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,11 @@ jobs:
# on developments of the client are brought in.
- name: Install SmartSim (with ML backends)
run: |

python -m pip install git+https://github.com/CrayLabs/SmartRedis.git@develop#egg=smartredis
python -m pip install .[dev,ml]


- name: Install ML Runtimes with Smart (with pt, tf, and onnx support)
if: (matrix.py_v != '3.10')
run: smart build --device cpu --onnx -v
Expand Down
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ valid-metaclass-classmethod-first-arg=mcs
max-args=9

# Maximum number of locals for function / method body
max-locals=19
max-locals=20

# Maximum number of return / yield for function / method body
max-returns=11
Expand Down
5 changes: 4 additions & 1 deletion conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -675,6 +675,7 @@ def setup_test_colo(
fileutils: t.Type[FileUtils],
db_type: str,
exp: Experiment,
application_file: str,
db_args: t.Dict[str, t.Any],
colo_settings: t.Optional[t.Dict[str, t.Any]] = None,
colo_model_name: t.Optional[str] = None,
Expand All @@ -683,7 +684,8 @@ def setup_test_colo(
"""Setup things needed for setting up the colo pinning tests"""
# get test setup
test_dir = fileutils.make_test_dir(level=2)
sr_test_script = fileutils.get_test_conf_path("send_data_local_smartredis.py")

sr_test_script = fileutils.get_test_conf_path(application_file)

# Create an app with a colo_db which uses 1 db_cpu
if colo_settings is None:
Expand All @@ -705,6 +707,7 @@ def setup_test_colo(
"deprecated": colo_model.colocate_db,
"uds": colo_model.colocate_db_uds,
}

colocate_fun[db_type](**db_args)
# assert model will launch with colocated db
assert colo_model.colocated
Expand Down
3 changes: 3 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Released on 14 September, 2023

Description

- Add support for multiple databases
- Add typehints throughout the SmartSim codebase
- Provide support for Slurm heterogeneous jobs
- Provide better support for `PalsMpiexecSettings`
Expand All @@ -61,6 +62,7 @@ Description

Detailed Notes

- Add support for creation of multiple databases with unique identifiers. (PR342_)
- Add methods to allow users to inspect files attached to models and ensembles. (PR352_)
- Add a `smart info` target to provide rudimentary information about the SmartSim installation. (PR350_)
- Remove unnecessary generation producing unexpected directories in the test suite. (PR349_)
Expand All @@ -84,6 +86,7 @@ Detailed Notes
- Update pylint dependency, update .pylintrc, mitigate non-breaking issues, suppress api breaks. (PR311_)
- Refactor the `smart` CLI to use subparsers for better documentation and extension. (PR308_)

.. _PR342: https://github.com/CrayLabs/SmartSim/pull/342
.. _PR352: https://github.com/CrayLabs/SmartSim/pull/352
.. _PR351: https://github.com/CrayLabs/SmartSim/pull/351
.. _PR350: https://github.com/CrayLabs/SmartSim/pull/350
Expand Down
2 changes: 1 addition & 1 deletion smartsim/_core/_cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import importlib
import importlib.util
import shutil
import subprocess as sp
import sys
Expand Down
4 changes: 2 additions & 2 deletions smartsim/_core/_cli/validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def _make_managed_local_orc(
exp.start(orc)
try:
(client_addr,) = orc.get_address()
yield Client(address=client_addr, cluster=False)
yield Client(False, address=client_addr)
finally:
exp.stop(orc)

Expand Down Expand Up @@ -243,7 +243,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
forward_input = torch.rand(1, 1, 3, 3)
traced = torch.jit.trace(net, forward_input) # type: ignore[no-untyped-call]
buffer = io.BytesIO()
torch.jit.save(traced, buffer) # type: ignore[no-untyped-call]
torch.jit.save(traced, buffer) # type: ignore[no-untyped-call]
model = buffer.getvalue()

client.set_model("torch-nn", model, backend="TORCH", device=device)
Expand Down
160 changes: 106 additions & 54 deletions smartsim/_core/control/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,34 @@
from __future__ import annotations

import os.path as osp
from os import environ
import pickle
import signal
import threading
import time
import typing as t

from smartredis import Client
from smartredis import Client, ConfigOptions

from ..._core.launcher.step import Step
from ..._core.utils.redis import db_is_active, set_ml_model, set_script, shutdown_db
from ..._core.utils.helpers import (
unpack_db_identifier,
unpack_colo_db_identifier,
)
from ...database import Orchestrator
from ...entity import Ensemble, EntityList, EntitySequence, Model, SmartSimEntity
from ...error import LauncherError, SmartSimError, SSInternalError, SSUnsupportedError
from ...error import (
LauncherError,
SmartSimError,
SSInternalError,
SSUnsupportedError,
SSDBIDConflictError,
)
from ...log import get_logger
from ...settings.base import BatchSettings
from ...status import STATUS_CANCELLED, STATUS_RUNNING, TERMINAL_STATUSES
from ...servertype import STANDALONE, CLUSTERED
from ..config import CONFIG
from ..launcher import (
CobaltLauncher,
Expand Down Expand Up @@ -213,6 +225,7 @@ def stop_entity_list(self, entity_list: EntitySequence[SmartSimEntity]) -> None:
:param entity_list: entity list to be stopped
:type entity_list: EntitySequence
"""

if entity_list.batch:
self.stop_entity(entity_list)
else:
Expand Down Expand Up @@ -308,18 +321,24 @@ def _launch(self, manifest: Manifest) -> None:
:param manifest: Manifest of deployables to launch
:type manifest: Manifest
"""
orchestrator = manifest.db
if orchestrator:

# Loop over deployables to launch and launch multiple orchestrators
for orchestrator in manifest.dbs:
for key in self._jobs.get_db_host_addresses():
_, db_id = unpack_db_identifier(key, "_")
if orchestrator.name == db_id:
raise SSDBIDConflictError(
f"Database identifier {orchestrator.name}"
" has already been used. Pass in a unique"
" name for db_identifier"
)

if orchestrator.num_shards > 1 and isinstance(
self._launcher, LocalLauncher
):
raise SmartSimError(
"Local launcher does not support multi-host orchestrators"
)
if self.orchestrator_active:
msg = "Attempted to launch a second Orchestrator instance. "
msg += "Only 1 Orchestrator can be active at a time"
raise SmartSimError(msg)
self._launch_orchestrator(orchestrator)

if self.orchestrator_active:
Expand All @@ -335,10 +354,8 @@ def _launch(self, manifest: Manifest) -> None:
batch_step = self._create_batch_job_step(elist)
steps.append((batch_step, elist))
else:
# if ensemble is to be run as separate job steps, aka not in a batch
job_steps = [(self._create_job_step(e), e) for e in elist.entities]
steps.extend(job_steps)

# models themselves cannot be batch steps. If batch settings are
# attached, wrap them in an anonymous batch job step
for model in manifest.models:
Expand Down Expand Up @@ -368,7 +385,6 @@ def _launch_orchestrator(self, orchestrator: Orchestrator) -> None:
:type orchestrator: Orchestrator
"""
orchestrator.remove_stale_files()

# if the orchestrator was launched as a batch workload
if orchestrator.batch:
orc_batch_step = self._create_batch_job_step(orchestrator)
Expand Down Expand Up @@ -491,23 +507,45 @@ def _prep_entity_client_env(self, entity: Model) -> None:
:param entity: The entity to retrieve connections from
:type entity: Model
"""

client_env: t.Dict[str, t.Union[str, int, float, bool]] = {}
addresses = self._jobs.get_db_host_addresses()
if addresses:
if len(addresses) <= 128:
client_env["SSDB"] = ",".join(addresses)
else:
# Cap max length of SSDB
client_env["SSDB"] = ",".join(addresses[:128])
if entity.incoming_entities:
client_env["SSKEYIN"] = ",".join(
[in_entity.name for in_entity in entity.incoming_entities]
)
if entity.query_key_prefixing():
client_env["SSKEYOUT"] = entity.name
address_dict = self._jobs.get_db_host_addresses()

for db_id, addresses in address_dict.items():
db_name, _ = unpack_db_identifier(db_id, "_")

if addresses:
if len(addresses) <= 128:
client_env[f"SSDB{db_name}"] = ",".join(addresses)
else:
# Cap max length of SSDB
client_env[f"SSDB{db_name}"] = ",".join(addresses[:128])
if entity.incoming_entities:
client_env[f"SSKEYIN{db_name}"] = ",".join(
[in_entity.name for in_entity in entity.incoming_entities]
)
if entity.query_key_prefixing():
client_env[f"SSKEYOUT{db_name}"] = entity.name

# Retrieve num_shards to append to client env
client_env[f"SR_DB_TYPE{db_name}"] = (
CLUSTERED if len(addresses) > 1 else STANDALONE
)

# Set address to local if it's a colocated model
if entity.colocated:
if entity.colocated and entity.run_settings.colocated_db_settings is not None:
db_name_colo = entity.run_settings.colocated_db_settings["db_identifier"]

for key in self._jobs.get_db_host_addresses():
_, db_id = unpack_db_identifier(key, "_")
if db_name_colo == db_id:
raise SSDBIDConflictError(
f"Database identifier {db_name_colo}"
" has already been used. Pass in a unique"
" name for db_identifier"
)

db_name_colo = unpack_colo_db_identifier(db_name_colo)
if colo_cfg := entity.run_settings.colocated_db_settings:
port = colo_cfg.get("port", None)
socket = colo_cfg.get("unix_socket", None)
Expand All @@ -516,13 +554,15 @@ def _prep_entity_client_env(self, entity: Model) -> None:
"Co-located was configured for both TCP/IP and UDS"
)
if port:
client_env["SSDB"] = f"127.0.0.1:{str(port)}"
client_env[f"SSDB{db_name_colo}"] = f"127.0.0.1:{str(port)}"
elif socket:
client_env["SSDB"] = f"unix://{socket}"
client_env[f"SSDB{db_name_colo}"] = f"unix://{socket}"
else:
raise SSInternalError(
"Colocated database was not configured for either TCP or UDS"
)
client_env[f"SR_DB_TYPE{db_name_colo}"] = STANDALONE

entity.run_settings.update_env(client_env)

def _save_orchestrator(self, orchestrator: Orchestrator) -> None:
Expand Down Expand Up @@ -653,39 +693,51 @@ def _set_dbobjects(self, manifest: Manifest) -> None:
if not manifest.has_db_objects:
return

db_addresses = self._jobs.get_db_host_addresses()
address_dict = self._jobs.get_db_host_addresses()
for (
db_id,
db_addresses,
) in address_dict.items():
db_name, name = unpack_db_identifier(db_id, "_")

hosts = list({address.split(":")[0] for address in db_addresses})
ports = list({int(address.split(":")[-1]) for address in db_addresses})
hosts = list({address.split(":")[0] for address in db_addresses})
ports = list({int(address.split(":")[-1]) for address in db_addresses})

if not db_is_active(hosts=hosts, ports=ports, num_shards=len(db_addresses)):
raise SSInternalError("Cannot set DB Objects, DB is not running")
if not db_is_active(hosts=hosts, ports=ports, num_shards=len(db_addresses)):
raise SSInternalError("Cannot set DB Objects, DB is not running")

client = Client(address=db_addresses[0], cluster=len(db_addresses) > 1)
environ[f"SSDB{db_name}"] = db_addresses[0]

for model in manifest.models:
if not model.colocated:
for db_model in model.db_models:
environ[f"SR_DB_TYPE{db_name}"] = (
CLUSTERED if len(db_addresses) > 1 else STANDALONE
)

options = ConfigOptions.create_from_environment(name)
client = Client(options, logger_name="SmartSim")

for model in manifest.models:
if not model.colocated:
for db_model in model.db_models:
set_ml_model(db_model, client)
for db_script in model.db_scripts:
set_script(db_script, client)

for ensemble in manifest.ensembles:
for db_model in ensemble.db_models:
set_ml_model(db_model, client)
for db_script in model.db_scripts:
for db_script in ensemble.db_scripts:
set_script(db_script, client)

for ensemble in manifest.ensembles:
for db_model in ensemble.db_models:
set_ml_model(db_model, client)
for db_script in ensemble.db_scripts:
set_script(db_script, client)
for entity in ensemble.models:
if not entity.colocated:
# Set models which could belong only
# to the entities and not to the ensemble
# but avoid duplicates
for db_model in entity.db_models:
if db_model not in ensemble.db_models:
set_ml_model(db_model, client)
for db_script in entity.db_scripts:
if db_script not in ensemble.db_scripts:
set_script(db_script, client)
for entity in ensemble.models:
if not entity.colocated:
# Set models which could belong only
# to the entities and not to the ensemble
# but avoid duplicates
for db_model in entity.db_models:
if db_model not in ensemble.db_models:
set_ml_model(db_model, client)
for db_script in entity.db_scripts:
if db_script not in ensemble.db_scripts:
set_script(db_script, client)


class _AnonymousBatchJob(EntityList[Model]):
Expand Down
Loading