Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: NVIDIA/OSMO

6.0.0

20 Nov 18:25
b8ba4ff

Choose a tag to compare

Major features

Workflow Management

OSMO provides a sophisticated workflow orchestration system that allows users to define, submit, and monitor complex AI workflows through both a web UI and CLI:

  • Multi-Task Orchestration: Define complex workflows with serial and parallel task execution patterns, with automatic dependency management and synchronization through barriers
  • Priority-Based Scheduling: Support for HIGH, NORMAL, and LOW priority levels with intelligent preemption and GPU borrowing across pools to maximize utilization
  • Interactive Development: Exec into running containers, port-forward services, and rsync files between local workstations and remote tasks for seamless debugging
  • Resource Management: Flexible resource specification with support for GPUs, CPUs, memory, and storage across multiple platforms and node types
  • Automatic Rescheduling: Handle transient failures gracefully with configurable retry policies and exit code handling
  • Template Support: Create reusable, parameterized workflow specifications with Jinja templating for automation and scaling

Data Management

OSMO's data layer provides efficient storage and access to datasets with versioning and metadata support:

  • Dataset Versioning: Track dataset evolution with automatic versioning and deduplication to optimize storage
  • Multiple Storage Backends: Support for AWS S3, Azure Blob Storage, and Google Storage with seamless integration
  • Efficient Data Transfer: Multi-threaded, multi-process uploads and downloads with automatic resume capabilities
  • Collections: Group related datasets together for easier organization and management
  • Metadata and Labels: Tag datasets with custom metadata and labels for powerful querying and discovery
  • Regex-Based Selection: Upload or download partial datasets using regex patterns for fine-grained control

Applications

The Apps feature allows users to create reusable applications from workflow specifications:

  • Parameterized Applications: Define apps with customizable parameters that users can adjust at launch time
  • Easy Sharing: Package complex workflows as simple-to-launch applications for team collaboration
  • Workflow Abstraction: Hide complexity behind user-friendly interfaces while maintaining full workflow power

Pools and Resource Management

OSMO introduces a sophisticated resource management system based on pools and platforms:

  • Pool-Based Access Control: Teams are granted access to specific resource pools with RBAC for secure multi-tenancy within an organization
  • Dynamic Pool Sizing: Pool sizes can be adjusted dynamically to respond to changing workload priorities
  • Platform Support: Each pool supports one or more platform types (GPU models, architectures) with automatic resource validation
  • Resource Sharing: Resources can be allocated to multiple pools simultaneously for maximum utilization
  • Quota Management: View and track resource quotas and usage across pools
  • Maintenance Mode: Admins can mark pools for maintenance to prevent new submissions during updates

Compute Backend Integration

OSMO seamlessly integrates with Kubernetes clusters and various compute backends:

  • Multi-Cluster Support: Manage workflows across multiple Kubernetes clusters (AWS EKS, Azure AKS, GCP GKE, on-premise)
  • KAI Scheduler Support: Support for advanced workflow scheduling with NVIDIA KAI Scheduler
  • Customizable Pod Templates: Flexible pod template configurations allowing administrators to customize resource requests, limits, tolerations, and node affinities per backend

Web User Interface

A modern, responsive web interface provides comprehensive workflow and data management:

  • Workflow Dashboard: View, filter, and manage workflows with real-time status updates
  • Interactive Task Graphs: Visualize workflow structure and task dependencies
  • Live Log Streaming: Stream logs in real-time from running workflows with syntax highlighting
  • Resource Visualization: Monitor cluster resources, pool quotas, and node utilization
  • Dataset Browser: Browse, visualize, and manage datasets with metadata editing
  • Shell Access: Browser-native terminal for executing commands in running tasks
  • Pool Management: View pool information, supported platforms, and available quotas

Command Line Interface

A powerful CLI provides full access to OSMO capabilities with scripting and automation support:

  • Intuitive Commands: Organized command structure for workflows, datasets, resources, pools, and configuration
  • Multi-Platform Support: Native support for Mac (Apple Silicon), as well as both x86-64 and ARM 64 architectures on Linux
  • Auto-Completion: Tab completion support on Linux and macOS for faster command entry
  • Multiple Output Formats: JSON and human-readable text output formats for easy integration with scripts
  • Profile Management: Configure default settings for backend, bucket, and notification preferences
  • Automatic Reconnection: Port-forward and exec commands automatically reconnect on disconnection

Security and Authentication

Enterprise-grade security features protect workflows and data:

  • OIDC Integration: OAuth2.0-based authentication via Keycloak, which can be configured to connect to other OAuth 2.0 or SAML authentication providers
  • RBAC: Role-based access control for pools, backends, and resources
  • Token Scoping: Limited-scope JWT tokens with appropriate time-to-live durations
  • Limited Scope Access Tokens: Users can create access tokens with restricted scopes, enabling secure and granular control over permissions. These tokens can be used for login and automated access, ensuring users and services only have the access they require.

Framework Integration

OSMO integrates seamlessly with popular AI/ML frameworks and tools, and comes with tutorials to demonstrate their use:

  • Distributed Training: TorchRun, DeepSpeed, and elastic training support for multi-node DNN training
  • Reinforcement Learning: Isaac Lab integration for RL training workflows
  • Simulation: Isaac Sim integration for synthetic data generation (SDG) and simulation workflows
  • ROS/ROS2: Support for robotics workflows with multi-node communication and hardware-in-the-loop testing
  • Development Tools: Jupyter Notebook, VSCode, and File Browser integration for interactive development
  • ML Tools: Weights & Biases (wandb) integration for experiment tracking
  • NVIDIA GR00T: Sample workflows for Gr00T finetuning, GR00T mimic, and GR00T interactive notebook

Getting OSMO

Helm Charts and Containers

Helm charts and docker containers are available in NGC

CLI Client

The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assests to this release.