Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@VedantMahabaleshwarkar
Copy link
Contributor

What this PR does / why we need it : Local Model Cache Enhancements

Change 1: Optional Storage Spec for LocalModelCache

Proposal

The goal of this change is to simplify the process of caching models that require authentication (e.g., private HuggingFace models or S3 buckets) by adding an optional storage specification directly to the LocalModelCache CRD.

Previously, to cache a model with credentials, users were required to:

  1. Create a ClusterStorageContainer CRD with workloadType: localModelDownloadJob.
  2. Manually embed credentials (e.g., environment variables) into this container specification.
  3. Manage secrets in a dedicated namespace (kserve-localmodel-jobs).

Pros of the new approach:

  • Eliminates the need for ClusterStorageContainer: Users no longer need to define a separate cluster-level resource just to handle download credentials.
  • Consistency: The configuration now mirrors the InferenceService storage configuration, providing a unified experience for defining storage access.
  • Simplified Operations: Credentials can be managed via ServiceAccountName or direct Storage parameters within the cache definition itself, reducing operational overhead.

Explanation of Changes

  • Updated LocalModelCache CRD: Added serviceAccountName and storage fields to the LocalModelCacheSpec.
    • serviceAccountName: Allows referencing a ServiceAccount (and its attached secrets) for credential lookup.
    • storage: Allows specifying a storage key and parameters (e.g., endpoint, region) to override default configurations.
  • Dynamic Download Job: The model download job now dynamically configures its container based on the storageInitializer configuration found in the inferenceservice-config ConfigMap. It injects the necessary environment variables and volume mounts derived from the specified credentials.
  • Backward Compatibility: Existing ClusterStorageContainer configurations continue to function and take precedence if they exist.

Change 2: LocalModelNamespaceCache CRD

Proposal

The LocalModelCache resource is Cluster-scoped, making cached models available globally across the cluster. To support multi-tenancy and better isolation, we propose the LocalModelNamespaceCache CRD.

Key Benefits:

  • Isolation: Allows defining model caches that are restricted to a specific namespace.
  • Security: Ensures that sensitive models cached in a namespace are only usable by InferenceServices within that same namespace.
  • Multi-tenancy: Different teams (namespaces) can manage their own cache policies and models without affecting the cluster-wide state.

Explanation of Changes

  • New CRD LocalModelNamespaceCache: Introduced a new namespace-scoped Custom Resource Definition.
    • It shares the same specification structure as LocalModelCache (SourceModelUri, ModelSize, NodeGroups, ServiceAccountName, Storage).
  • Controller Implementation: Added a new reconciler to handle the lifecycle of LocalModelNamespaceCache resources. The controller manages the underlying LocalModelNode resources to ensure models are downloaded to the specified node groups.
  • InferenceService Integration: Updated the InferenceService logic to support resolving models from LocalModelNamespaceCache. When an InferenceService requests a model, the system checks for a matching cache in the local namespace.
  • API Types: Added corresponding Go types (LocalModelNamespaceCache, LocalModelNamespaceCacheList) and generated client code.

Workflow Differences: Cluster vs. Namespace Cache

The workflow for LocalModelNamespaceCache differs from LocalModelCache in several key ways to support isolation:

  1. Job Execution Context:

    • LocalModelCache: Download jobs run in a centralized namespace (default kserve-localmodel-jobs).
    • LocalModelNamespaceCache: Download jobs run in the same namespace where the CR is defined. This ensures that the job uses the namespace's quota and service accounts.
  2. Resolution Scope:

    • LocalModelCache: Available to any InferenceService in the cluster.
    • LocalModelNamespaceCache: Only available to InferenceServices within the same namespace. The webhook logic prioritizes checking the local namespace cache before falling back to the cluster cache.
  3. Credential Access:

    • LocalModelCache: Credentials (Secrets/ServiceAccounts) must exist in the centralized job namespace.
    • LocalModelNamespaceCache: Credentials must exist in the user's namespace, allowing teams to manage their own access tokens securely.

LocalModelNode CR and Status Updates

The LocalModelNode CR, which tracks the models present on a specific node, has been updated to handle both cluster-scoped and namespace-scoped models.

  1. Spec Updates:

    • The LocalModelInfo struct in LocalModelNodeSpec now includes a Namespace field.
    • If Namespace is empty, it represents a cluster-scoped LocalModelCache.
    • If Namespace is set, it represents a LocalModelNamespaceCache.
    • Credential fields (ServiceAccountName, Storage) are also passed down to the LocalModelNode to ensure the node agent can launch the download job with correct permissions.
  2. Status Key Changes:

    • To prevent name collisions between models with the same name in different namespaces, the status map keys have been updated:
      • Cluster-scoped models: Key remains modelName.
      • Namespace-scoped models: Key is now namespace/modelName.
    • This allows the node agent to track the status of my-model (cluster) and my-ns/my-model (namespace) independently.
  3. Storage Deduplication & Status:

    • The node agent uses a hash of the SourceModelUri to determine the physical storage folder on the node.
    • Deduplication: If multiple caches (e.g., in different namespaces) point to the exact same SourceModelUri, they will share the same physical folder on the disk.
    • Smart Status Updates: The agent tracks which storage keys have been processed. If a download for a specific URI is already in progress or completed (initiated by another CR), the agent reuses that status for the current model. This prevents redundant download jobs while correctly updating the status of all referencing LocalModelNode entries.

Type of changes

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:

Add namespace scoped LocalModelNamespaceCache CRD for caching models in a namespaced scope

Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

@terrytangyuan
Copy link
Member

terrytangyuan commented Dec 10, 2025

I wonder if we should use these names instead:

  • ModelCache (namespaced)
  • ClusterModelCache (cluster scoped)

Similar to what we are doing in Argo Workfows, e.g. WorkflowTemplate vs ClusterWorkflowTemplate

Although the renaming might be a breaking change.

If not, LocalModelNamespacedCache might be better than LocalModelNamespaceCache.

@terrytangyuan
Copy link
Member

cc @yuzisun @greenmoon55 @johnugeorge @cjohannsen-cloudera Could you take a look at this proposal?

@yuzisun
Copy link
Member

yuzisun commented Dec 12, 2025

@VedantMahabaleshwarkar If the intent is to cache the model on each node in the cluster, how do you control the access to the model cache to only allow a specific namespace.

@VedantMahabaleshwarkar
Copy link
Contributor Author

VedantMahabaleshwarkar commented Dec 12, 2025

@yuzisun

If the intent is to cache the model on each node in the cluster, how do you control the access to the model cache to only allow a specific namespace.

The existing LocalModelCache logic checks for a StorageUri match with all existing LocalModelCache CRs (which are clusterwide), and adds a ModelCache label (internal.serving.kserve.io/localmodel: my-model) if a match is found, which triggers the logic to inject the cache PV/PVC to the deployment.

For the LocalModelNamespaceCache, the StorageUri matching logic is modified as follows :
The ModelCache labels are added only if a matching StorageURI is found in either

  • A LocalModelNamespaceCache CR in the same namespace as the ISVC (First precedence, check for namespaced cache)
    • This adds an additional internal.serving.kserve.io/localmodel-namespace: <namespace> label which is checked by the LocalModelNamespaceCache Reconciler while injecting the PV and PVC.
  • A LocalModelCache CR (second priority, check for global cache if no namespaced cache is found)

storageClassName: local-storage
persistentVolumeSpec:
accessModes:
- ReadWriteOnce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be RWX?

@spolti
Copy link
Contributor

spolti commented Dec 12, 2025

Wouldn't it be better move the discussion to a GH issue and attach the architecture proposal?

@greenmoon55
Copy link
Contributor

Looks great! We thought about something similar. One follow up could be validating disk quota to make sure disk space is not used up by some namespaces.

@VedantMahabaleshwarkar VedantMahabaleshwarkar marked this pull request as ready for review January 19, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants