feat: Add Revamped Artifact V2 Adapter with Storage Backend#74
feat: Add Revamped Artifact V2 Adapter with Storage Backend#74
Conversation
d3b7da7 to
a61bde7
Compare
SafeDep Report SummaryNo dependency changes detected. Nothing to scan. This report is generated by SafeDep Github App |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a next-generation artifact adapter system (v2) for the package registry with improved storage abstraction, caching, and content-addressable design. The system provides a unified interface for fetching, storing, and managing package artifacts across different ecosystems.
Key Changes:
- Extended storage interface with new methods (
Exists,GetMetadata,List,Delete) - Implemented storage manager with multiple artifact ID strategies (convention, content-hash, hybrid)
- Created NPM adapter v2 with HTTP mirror support and intelligent retry logic
- Added archive utilities for tar.gz file operations with index caching
- Implemented in-memory metadata store for artifact tracking
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| storage/storage.go | Extended Storage interface with context-aware metadata and list operations |
| storage/gcs.go | Implemented new storage interface methods for Google Cloud Storage |
| storage/fs.go | Implemented new storage interface methods for filesystem storage |
| packageregistry/artifactv2/types.go | Core type definitions for artifact adapter v2 system |
| packageregistry/artifactv2/storage.go | Storage manager implementation with artifact ID strategies |
| packageregistry/artifactv2/npm_adapter.go | NPM-specific artifact adapter with mirror support |
| packageregistry/artifactv2/metadata.go | In-memory metadata store implementation |
| packageregistry/artifactv2/config.go | Configuration system with functional options pattern |
| packageregistry/artifactv2/archive_utils.go | Archive reading utilities with index caching |
| packageregistry/artifactv2/adapter_utils.go | HTTP fetching utilities with retry and mirror logic |
| .tool-versions | Go version update to 1.25.1 |
| .github/workflows/go.yml | Workflow file formatting improvements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* feat: Add struct validation utils * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> Signed-off-by: Abhisek Datta <[email protected]> --------- Signed-off-by: Abhisek Datta <[email protected]> Co-authored-by: Copilot <[email protected]>
* feat: Add support for container exec IO capture * fix: Linter issues * refactor: Test to use Go TDD
* feat: support token limit error handing * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> Signed-off-by: Abhisek Datta <[email protected]> --------- Signed-off-by: Abhisek Datta <[email protected]> Co-authored-by: Abhisek Datta <[email protected]> Co-authored-by: Copilot <[email protected]>
vet Summary ReportThis report is generated by vet Policy Checks
Malicious Package AnalysisMalicious package analysis was performed using SafeDep Cloud API Malicious Package Analysis Report
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| reqCtx, cancel := context.WithTimeout(ctx, config.Timeout) | ||
| defer cancel() |
There was a problem hiding this comment.
🔴 Single timeout context is shared across all retry attempts, making retries ineffective
The Timeout field is documented as "Timeout for each fetch attempt" (adapter_utils.go:24), but fetchHTTPWithMirrors creates a single context.WithTimeout at line 202 that spans ALL retry attempts including sleep delays between them.
Root Cause
At packageregistry/artifactv2/adapter_utils.go:202:
reqCtx, cancel := context.WithTimeout(ctx, config.Timeout)
defer cancel()This creates one context for the entire retry loop. With default settings (Timeout: 30s, RetryAttempts: 3, RetryDelay: 1s with linear backoff), the first attempt uses most of the timeout budget. Subsequent attempts reuse the same reqCtx which may already be expired or nearly expired, especially after time.Sleep(delay) calls at line 217. The retry loop at lines 205-290 creates HTTP requests with this same reqCtx, so later retries will immediately fail with context.DeadlineExceeded.
Impact: Retries after the first attempt may be ineffective or fail immediately because the shared context has expired. For example, if the first attempt takes 25 seconds to fail and there's a 1-second retry delay, the second attempt only has ~4 seconds instead of the configured 30 seconds.
Prompt for agents
In packageregistry/artifactv2/adapter_utils.go, move the context.WithTimeout call inside the retry loop so each attempt gets its own fresh timeout. Replace the single reqCtx at line 202-203 with a per-attempt context created inside the for loop (after the sleep delay). Each iteration should create its own context: reqCtx, cancel := context.WithTimeout(ctx, config.Timeout), and cancel should be deferred or called at the end of each iteration. Alternatively, rename the Timeout field documentation from "Timeout for each fetch attempt" to "Timeout for the entire fetch operation including retries" if the current behavior is intended.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| checksum := "" | ||
| if attrs.MD5 != nil { | ||
| checksum = string(attrs.MD5) |
There was a problem hiding this comment.
🔴 GCS GetMetadata produces garbled checksum by converting raw MD5 bytes to string instead of hex-encoding
The GCS GetMetadata at storage/gcs.go:170 uses string(attrs.MD5) to convert the raw MD5 hash bytes to a string, producing non-printable binary characters instead of a hex-encoded checksum.
Root Cause
At storage/gcs.go:169-170:
if attrs.MD5 != nil {
checksum = string(attrs.MD5)
}GCS ObjectAttrs.MD5 is []byte containing the raw 16-byte MD5 hash. Using string() converts these raw bytes directly to a string with non-printable characters. The filesystem driver at storage/fs.go:129 correctly uses hex.EncodeToString(hash.Sum(nil)) to produce a human-readable hex string.
Impact: Any code comparing checksums across storage backends (or expecting hex-encoded checksums from ObjectMetadata.Checksum) will get garbled binary data from GCS instead of the expected hex string. Checksum comparisons will silently fail.
| checksum = string(attrs.MD5) | |
| checksum = hex.EncodeToString(attrs.MD5) | |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return err | ||
| } | ||
|
|
||
| keys = append(keys, relPath) |
There was a problem hiding this comment.
List returns relPath using OS-specific separators. Other parts of the codebase build storage keys with path.Join (forward slashes), so on Windows this will return keys with \ and break prefix-based callers. Consider normalizing via filepath.ToSlash(relPath) before appending.
| keys = append(keys, relPath) | |
| keys = append(keys, filepath.ToSlash(relPath)) |
| func (sm *storageManager) Store(ctx context.Context, info ArtifactInfo, reader io.Reader) (string, error) { | ||
| var artifactID string | ||
| var buf bytes.Buffer | ||
| var contentHash string | ||
|
|
||
| needsContentHash := sm.config.ArtifactIDStrategy == ArtifactIDStrategyContentHash || | ||
| sm.config.ArtifactIDStrategy == ArtifactIDStrategyHybrid || | ||
| sm.config.IncludeContentHash | ||
|
|
||
| if needsContentHash { | ||
| hash := sha256.New() | ||
| tee := io.TeeReader(reader, &buf) | ||
|
|
||
| if _, err := io.Copy(hash, tee); err != nil { | ||
| return "", fmt.Errorf("failed to compute hash: %w", err) | ||
| } | ||
|
|
||
| hashBytes := hash.Sum(nil) | ||
| contentHash = hex.EncodeToString(hashBytes[:8]) | ||
| } else { | ||
| if _, err := io.Copy(&buf, reader); err != nil { | ||
| return "", fmt.Errorf("failed to read content: %w", err) | ||
| } | ||
| } | ||
|
|
||
| artifactID = generateArtifactID(info, sm.config.ArtifactIDStrategy, contentHash) | ||
|
|
||
| if sm.config.CacheEnabled { | ||
| exists, err := sm.Exists(ctx, artifactID) | ||
| if err == nil && exists { | ||
| return artifactID, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
StorageManager.Store fully buffers (and sometimes hashes) the artifact before checking the cache. For ArtifactIDStrategyConvention the ID can be generated without reading the content, so you can check Exists first and skip the expensive read when the artifact is already present.
| reqCtx, cancel := context.WithTimeout(ctx, config.Timeout) | ||
| defer cancel() | ||
|
|
||
| for attempt := 0; attempt <= config.RetryAttempts; attempt++ { |
There was a problem hiding this comment.
fetchHTTPWithMirrors creates a single reqCtx with config.Timeout for the whole retry loop, but the struct comment says the timeout is per fetch attempt. As written, earlier delays/retries consume the same deadline and later attempts can fail immediately. Create a fresh per-attempt context (and make the retry sleep respect ctx.Done()), or update the comment if total-timeout is intended.
| type Storage interface { | ||
| Put(key string, reader io.Reader) error | ||
| Get(key string) (io.ReadCloser, error) | ||
|
|
||
| // Exists checks if a key exists in storage | ||
| Exists(ctx context.Context, key string) (bool, error) | ||
|
|
||
| // GetMetadata retrieves metadata for a stored object | ||
| GetMetadata(ctx context.Context, key string) (*ObjectMetadata, error) | ||
|
|
||
| // List returns keys matching a prefix | ||
| List(ctx context.Context, prefix string) ([]string, error) | ||
|
|
||
| // Delete removes an object from storage | ||
| Delete(ctx context.Context, key string) error |
There was a problem hiding this comment.
The Storage interface mixes context-less methods (Put/Get) with context-aware methods (Exists/List/Delete/GetMetadata). This makes it impossible to propagate cancellation/timeouts for the most expensive operations on backends like GCS. Consider adding context.Context to Put/Get (or adding new PutCtx/GetCtx methods) for a consistent contract.
| checksum := "" | ||
| if attrs.MD5 != nil { | ||
| checksum = string(attrs.MD5) | ||
| } |
There was a problem hiding this comment.
In GetMetadata, attrs.MD5 is raw bytes; converting it with string(attrs.MD5) will produce non-printable data and is not a stable textual checksum. Encode it (e.g., hex/base64) or use GCS-provided hash fields consistently with other backends (filesystem uses SHA256 hex).
| func (a *npmAdapterV2) Exists(ctx context.Context, info ArtifactInfo) (bool, string, error) { | ||
| // Try to find by metadata first (more efficient) | ||
| if a.config.metadataEnabled && a.config.storageManager != nil { | ||
| // For Convention strategy, we can predict the artifact ID using common function | ||
| if a.config.artifactIDStrategy == ArtifactIDStrategyConvention { | ||
| // Use common ID generation function (single source of truth) | ||
| predictedID := generateArtifactID(info, ArtifactIDStrategyConvention, "") | ||
|
|
||
| exists, err := a.storage.Exists(ctx, predictedID) | ||
| if err == nil && exists { | ||
| return true, predictedID, nil | ||
| } | ||
| } | ||
|
|
||
| // For other strategies, we need to query metadata | ||
| // This is not implemented in the current MetadataStore interface | ||
| // but could be added via GetByPackage/GetByArtifact | ||
| } | ||
|
|
||
| return false, "", nil | ||
| } |
There was a problem hiding this comment.
Exists currently only attempts a predicted-ID check for Convention strategy and only when metadataEnabled is true; otherwise it returns (false, "", nil) even if the artifact is already in storage. This breaks adapter-level caching and also ignores the MetadataStore.GetByArtifact capability already present in the interface for looking up existing artifacts by (ecosystem,name,version).
| fileKey := path.Join(baseKey, fileInfo.Path) | ||
|
|
||
| // Stream file content directly to storage using LimitReader to avoid memory buffering | ||
| // LimitReader ensures we only read fileInfo.Size bytes from the tar stream | ||
| limitedReader := io.LimitReader(fileInfo.Reader, fileInfo.Size) | ||
|
|
||
| if err := store.Put(fileKey, limitedReader); err != nil { |
There was a problem hiding this comment.
Path traversal risk during extraction: fileInfo.Path comes directly from the tar header and is joined into fileKey without validation. A malicious archive can use paths like ../artifact or absolute paths to escape baseKey and overwrite sibling keys. Sanitize/validate entry paths (reject absolute paths and any cleaned path starting with ..), and ensure the final key remains within baseKey.
| func applyFetchConfigDefaults(config *fetchConfig) { | ||
| if config.RetryAttempts == 0 { | ||
| config.RetryAttempts = defaultRetryAttempts | ||
| } | ||
| if config.RetryDelay == 0 { | ||
| config.RetryDelay = defaultRetryDelay |
There was a problem hiding this comment.
fetchConfig doc says RetryAttempts of 0 means "no retries" (single attempt), but applyFetchConfigDefaults overwrites RetryAttempts==0 with the default (3). This makes it impossible to disable retries and contradicts the contract; treat 0 as a valid value and only default when the caller truly left it unset (e.g., use a pointer/optional or a separate boolean).
| SHA256: sha256Hash, | ||
| Size: int64(len(content)), | ||
| FetchedAt: time.Now(), | ||
| StorageKey: computeStorageKeyFromID(artifactID, ""), |
There was a problem hiding this comment.
storeArtifactWithMetadata records StorageKey using computeStorageKeyFromID(artifactID, ""), which ignores any configured storage prefix (WithStoragePrefix / StorageConfig.KeyPrefix). This will store incorrect metadata when a prefix is used; compute the key with the active prefix (or have StorageManager expose a helper for the effective key).
| StorageKey: computeStorageKeyFromID(artifactID, ""), | |
| StorageKey: "", |
| if sm.config.MetadataEnabled && sm.metadata != nil { | ||
| _, err := sm.metadata.Get(ctx, artifactID) | ||
| if err == nil { | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
In StorageManager.Exists, returning true purely because metadata exists can lead to false positives (e.g., metadata present but artifact missing/corrupted in the underlying storage). This can cause cache hits where Get later fails. Consider checking both metadata and the storage backend (or making metadata a hint and still verifying storage existence).
| if sm.config.MetadataEnabled && sm.metadata != nil { | |
| _, err := sm.metadata.Get(ctx, artifactID) | |
| if err == nil { | |
| return true, nil | |
| } | |
| // Metadata can be consulted as a hint, but must not be treated as authoritative | |
| // for existence; we always verify against the underlying storage backend. | |
| if sm.config.MetadataEnabled && sm.metadata != nil { | |
| _, _ = sm.metadata.Get(ctx, artifactID) |
Uh oh!
There was an error while loading. Please reload this page.