Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(template-commit): enforce requestID uniqueness and add idempotent commit resue#336

Open
fslongjin wants to merge 3 commits into
TencentCloud:masterfrom
fslongjin:fix-template-commit-duplicate-key
Open

fix(template-commit): enforce requestID uniqueness and add idempotent commit resue#336
fslongjin wants to merge 3 commits into
TencentCloud:masterfrom
fslongjin:fix-template-commit-duplicate-key

Conversation

@fslongjin
Copy link
Copy Markdown
Member

Close: #332

  • Validate requestID is non-empty in HTTP handler and service layer
  • Add unique (request_id, operation) index handling for COMMIT jobs
  • Reuse existing job on duplicate request payload; reject on payload drift
  • Introduce commit-specific job record builder and duplicate-key detection

… commit

reuse

- Validate requestID is non-empty in HTTP handler and service layer
- Add unique (request_id, operation) index handling for COMMIT jobs
- Reuse existing job on duplicate request payload; reject on payload drift
- Introduce commit-specific job record builder and duplicate-key detection

Signed-off-by: jinlong <[email protected]>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 21, 2026

Claude finished @fslongjin's task in 1m 10s —— View job


Code Review: enforce requestID uniqueness and add idempotent commit reuse

Overview

This PR closes #332 by enforcing requestID uniqueness for COMMIT operations and providing idempotent job reuse when the same (request_id, operation) pair is submitted with the same payload. The implementation spans the HTTP handler, service layer, database migration, and tests.


Review Findings

1. ‼️ Missing ErrTemplateAttemptInProgress error mapping in HTTP handler

template_commit.go:130-136 maps errors from SubmitTemplateCommit to HTTP status codes, but omits ErrTemplateAttemptInProgress. This error is returned when:

  • A duplicate requestID arrives with a different commit payload (line 62, 120)
  • A template has an active job already running (line 84)

Without this mapping, callers get ErrorCode_MasterInternalError (500) instead of ErrorCode_MasterParamsError (400). The create-from-image and delete handlers already handle this error correctly (template.go:107):

// Missing in template_commit.go — should be added:
case errors.Is(err, templatecenter.ErrTemplateAttemptInProgress):
    code = int(errorcode.ErrorCode_MasterParamsError)

Fix this →

2. Error swallowing in buildCommitTemplateSpecFingerprint

template_commit.go:1965-1969 — the marshal error from marshalTemplateCommitJobRequest is silently discarded with _. If marshaling fails (extremely rare but theoretically possible), the fingerprint would be an empty string's SHA256, which could cause false-positive collisions.

func buildCommitTemplateSpecFingerprint(req *types.CreateCubeSandboxReq) string {
    payload, _ := marshalTemplateCommitJobRequest(req)  // error discarded
    sum := sha256.Sum256([]byte(payload))
    return hex.EncodeToString(sum[:])
}

Recommendation: Consider logging the error or propagating it to the caller, though I recognize the pre-existing pattern in buildTemplateSpecFingerprint (template_image.go:1912) does the same thing.

3. Test coverage gaps

The existing tests cover the static paths well (reject empty requestID, persist identity fields, classify duplicate key errors, infer legacy operations), but these scenarios are not tested:

  • Idempotent reuse: Submitting the same (requestID, COMMIT) with the same payload should return the existing job ID
  • Payload drift rejection: Submitting the same requestID with a different payload should error with ErrTemplateAttemptInProgress
  • Active job conflict: Submitting a commit for a template that already has a pending/running job
  • Duplicate key race fallback: The concurrent-writer path in template_commit.go:112-122
  • Migration disambiguation: Legacy data with duplicate (request_id, operation) pairs being correctly deduplicated

These are likely difficult to test without a real database, but unit tests at the handler level (mocking the service layer) or integration tests would strengthen coverage.

4. Concurrent migration safety

migrateTemplateImageJobTable in template_image.go uses HasIndex/HasColumn guards but runs outside a lock or DDL transaction. If multiple master instances run the migration concurrently, one could fail on the ALTER TABLE ... ADD UNIQUE KEY when the index was already created by another instance. The HasIndex check mitigates this in most cases, but the race window exists.

This is a pre-existing pattern in the codebase and the risk is low, but worth noting for production deployments.

5. Well-designed concurrent idempotency

The "check-then-act with duplicate-key fallback" pattern in SubmitTemplateCommit (template_commit.go:53-126) is the right approach for MySQL-level idempotency:

  1. First, check by (request_id, operation) → reuse/reject
  2. Create the record
  3. On ErrDuplicatedKey → re-check by (request_id, operation) → reuse/reject on drift

This correctly handles the window between the SELECT and INSERT.

6. isDuplicateKeyError is robust

template_commit.go:316-325 correctly handles both gorm.ErrDuplicatedKey and raw MySQL error strings ("Duplicate entry", "1062"). This is important because some GORM database drivers return raw driver errors rather than the sentinel value.

7. Legacy data migration is well-handled

normalizeTemplateImageJobRequestIDs (template_image.go:283-328) and inferLegacyJobOperation (template_image.go:337-351) carefully classify and disambiguate existing rows to avoid unique index violations. The #JobID suffix approach for collisions is a pragmatic solution.


Summary

Category Count Details
‼️ Bugs 1 Missing ErrTemplateAttemptInProgress error mapping (should be 400, currently 500)
⚠️ Warnings 2 Error swallowing in fingerprint, test coverage gaps
✅ Strengths 4 Concurrent idempotency design, duplicate key detection, legacy migration, clean separation

The PR is well-structured and the core idempotency design is sound. The main actionable finding is the missing error mapping in the HTTP handler (#1).
| Branch: fix-template-commit-duplicate-key

Treat cubelet's success code as ErrorCode_Success, run commit jobs asynchronously, and preserve actionable failure messages so successful snapshots are not recorded as failed jobs.

Signed-off-by: jinlong <[email protected]>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 22, 2026

Claude finished @fslongjin's task in 1m 19s —— View job


Re-reviewing after synchronize event (commit d4be464)

Progress

  • Gather context — read diff, previous review, new commit
  • Analyze new commit d4be464 — handle cubelet success correctly
  • Re-check previous review findings against current state
  • Re-review complete — post updated findings

Incremental Review: Commit d4be464

This commit fixes a critical bug in runTemplateCommitJob: the response code check used != 0 instead of != int(errorcode.ErrorCode_Success). The Cube API returns 200 for success, not 0, which means every successful commit was silently flipped to FAILED with an empty error_message. This is the root cause of #332 and is correctly resolved.

Additional improvements in this commit

Change File:Line Assessment
Asynchronous job dispatch (go runTemplateCommitJob(...)) template_commit.go:155 ✅ Matches SubmitTemplateImage() pattern; prevents HTTP handler from blocking
RPC timeout for CommitSandbox (5 min) template_commit.go:33 ✅ Prevents hung cubelet from leaving job stuck in SNAPSHOTTING forever
Panic recovery in job goroutine template_commit.go:177-188 ✅ Logs stack trace and marks job FAILED
buildCommitFailureMessage with detailed error messages template_commit.go:326-345 ✅ Never returns empty string; handles nil response, nil Ret, whitespace-only RetMsg
Cleanup RPC timeout (1 min) template_commit.go:37,245-251 ✅ Prevents cleanup path from hanging
Test for buildCommitFailureMessage template_commit_test.go:109-162 ✅ Covers nil, non-success codes, whitespace-only messages

Previously flagged issues — status update

1. ‼️ Missing ErrTemplateAttemptInProgress error mapping — STILL UNFIXED

template_commit.go:129-136 still only maps ErrTemplateIDRequired, ErrDuplicateTemplate, and ErrTemplateStoreNotInitialized. The ErrTemplateAttemptInProgress error (returned from SubmitTemplateCommit lines 77, 99, 135) falls through to ErrorCode_MasterInternalError (500) instead of ErrorCode_MasterParamsError (400).

The sibling handler in template.go:107-108 already handles this correctly:

case errors.Is(err, templatecenter.ErrTemplateAttemptInProgress):
    code = int(errorcode.ErrorCode_MasterParamsError)

Fix this →

2. Warning: Error swallowing in buildCommitTemplateSpecFingerprintSTILL PRESENT

template_image.go:1965-1969:

func buildCommitTemplateSpecFingerprint(req *types.CreateCubeSandboxReq) string {
    payload, _ := marshalTemplateCommitJobRequest(req)  // error discarded
    sum := sha256.Sum256([]byte(payload))
    return hex.EncodeToString(sum[:])
}

If marshaling fails, the fingerprint becomes SHA256 of empty string, which could cause false-positive collisions. A log line on error would mitigate this without changing the signature.

3. Test coverage gaps — PARTIALLY ADDRESSED

The new TestBuildCommitFailureMessageNeverEmpty is good. However, the following scenarios remain untested:

  • Idempotent reuse (same requestID + same payload → returns existing job)
  • Payload drift rejection (same requestID + different payload → ErrTemplateAttemptInProgress)
  • Active job conflict (pending/running job for template)
  • Duplicate key race fallback (concurrent writer path at template_commit.go:126-139)

These require integration-level testing with a database and are likely tracked separately.


New findings in this commit

4. ✅ Clean separation of concerns

The error message construction in buildCommitFailureMessage and the new buildCommitFailureMessage test are well-structured. The function correctly handles all edge cases (nil response, nil Ret, empty RetMsg, whitespace-only RetMsg).

5. ✅ Idempotent path returns full job info

When an existing job is reused (lines 73-76, 95-98, 131-134), the function returns GetTemplateImageJobInfo(ctx, jobID) on line 145, which provides the caller with the current job status — not just a confirmation of reuse. This is correct behavior.

6. ⚠️ buildCommitFailureMessage vs buildCommitFailureMessage test

The test function TestBuildCommitFailureMessageNeverEmpty at template_commit_test.go:113 has a minor inconsistency: wantParts uses partial string matching (strings.Contains), but the test failure message uses t.Fatalf which aborts on the first failing sub-case — hiding results of remaining cases. Preferring t.Errorf or subtests (t.Run) would surface all failures in one run. This is a minor testing style point.


Summary

Category Count Details
‼️ Previously unfixed 2 Missing error mapping, error swallowing in fingerprint
✅ Fixed in this commit 1 Cubelet success code check (!= 0ErrorCode_Success)
✅ Strengths added 5 Async dispatch, timeouts, panic recovery, error message quality, tests
⚠️ Minor testing style 1 t.Fatalf hides remaining sub-case failures

The new commit addresses the root cause of #332 and adds several important hardening measures (timeouts, panic recovery, async dispatch). The two remaining items from the previous review (error mapping, fingerprint error swallowing) are pre-existing and should be addressed in a follow-up.

…commit

- Extract error code mapping into`commitTemplateErrorCode`
function, adding support for`ErrTemplateAttemptInProgress`
- Refactor fingerprint generation to use request snapshot instead of stored
request, reducing data dependency
- Update corresponding tests to cover new error cases and fingerprint logic

Signed-off-by: jinlong <[email protected]>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 22, 2026

Claude finished @fslongjin's task in 1m 27s —— View job


Review: enforce requestID uniqueness and add idempotent commit reuse

Progress

  • Gather context — read diff, previous reviews, current code state
  • Analyze all changed files
  • Verify previous review findings against current state
  • Check for new issues
  • Post updated review

Previous review findings — status update

Finding Status Details
Missing ErrTemplateAttemptInProgress mapping (→ 500 instead of 400) Fixed commitTemplateErrorCode function at template_commit.go:152-163 now correctly maps ErrTemplateAttemptInProgress to ErrorCode_MasterParamsError. Tested in template_commit_test.go:79-82.
Error swallowing in buildCommitTemplateSpecFingerprint Fixed Refactored from buildCommitTemplateSpecFingerprint(req) string to buildCommitTemplateSpecFingerprintFromSnapshot(requestSnapshot string) string at template_image.go:1966. By receiving a pre-marshaled string, the marshal error path is eliminated entirely. Clean fix.
Test coverage gaps (idempotent reuse, payload drift, active job conflict) ⚠️ Partially addressed New tests cover identity fields, duplicate key classification, legacy operation inference, failure message construction, and empty requestID. Integration-level scenarios (idempotent reuse, payload drift) require a real database and are understandably absent.
t.Fatalf in table-driven tests hides sub-case failures ⚠️ Still present TestIsDuplicateKeyErrorClassifiesMySQLAndGormErrors (template_commit_test.go:75-93) and TestInferLegacyJobOperationCoversAllShapes (:95-113) use t.Fatalf which aborts on the first failure. Using t.Errorf + t.Run subtests would surface all failures per run.

New review findings

✅ Robust concurrent idempotency design

The "check-then-act with duplicate-key fallback" pattern in SubmitTemplateCommit (template_commit.go:68-141) is well-designed:

  1. Check by (request_id, COMMIT) → reuse/reject
  2. Check active jobs by template_id → reuse/reject
  3. INSERT with unique index → on ErrDuplicatedKey, re-check and reuse/reject

The withTemplateWriteLock (cache.go:76-84) provides in-process mutual exclusion, while the MySQL unique index and duplicate-key fallback handle cross-instance races correctly.

isDuplicateKeyError handles both GORM and raw driver errors

The function at template_commit.go:384-393 correctly checks gorm.ErrDuplicatedKey, raw MySQL "Duplicate entry", and error code "1062". Test coverage at template_commit_test.go:75-93 exercises all paths.

✅ Migration handles legacy data disambiguation

normalizeTemplateImageJobRequestIDs (template_image.go:284-329) and inferLegacyJobOperation (template_image.go:341-358) carefully classify and disambiguate legacy rows before creating the unique index. The \x00 separator in requestBindingKey (template_image.go:359-361) prevents collisions between (requestID, operation) pairs.

✅ Async dispatch with proper safeguards

runTemplateCommitJob at template_commit.go:168-317 correctly:

  • Detaches from the HTTP context to survive client disconnects (line 155)
  • Has a 5-minute RPC timeout for CommitSandbox (line 33, 195)
  • Has a 1-minute RPC timeout for cleanup (line 37, 245)
  • Recovers from panics with stack trace logging (lines 177-188)

⚠️ getLatestTemplateImageJobByTemplateID ordering dependency

At template_commit.go:104-109, the code calls getLatestTemplateImageJobByTemplateID to get the most recent job for attempt number calculation. The correctness of nextAttemptNoFromLatest(attemptNo) depends on consistent ordering from this query. If two COMMIT jobs for the same template complete simultaneously (under the write lock this shouldn't happen), the ordering by attempt_no desc, id desc ensures deterministic behavior. Low risk.

⚠️ isDuplicateKeyError "1062" substring match is broad

template_commit.go:392 checks strings.Contains(msg, "1062"). Any MySQL error message incidentally containing "1062" (unlikely but possible with certain versions/drivers) would be misclassified. The primary check against gorm.ErrDuplicatedKey handles the well-traveled path; the raw string matching is a secondary fallback. Very low practical risk.

ℹ️ buildCommitTemplateSpecFingerprintFromSnapshot not directly unit-tested

The fingerprint function at template_image.go:1966-1969 is tested indirectly through TestNewCommitTemplateImageJobRecordPersistsIdentityFields (template_commit_test.go:64-69), but has no dedicated unit test. Given the trivial one-liner nature of the function, this is acceptable.


Summary

Category Count Details
✅ Previously fixed 2 Error mapping (ErrTemplateAttemptInProgress → 400), fingerprint error swallowing
⚠️ Previously unfixed (minor) 1 t.Fatalf style in table-driven tests
✅ Strengths 5 Concurrent idempotency, duplicate key detection, legacy migration, async dispatch with timeouts/panic recovery, error message quality
⚠️ New minor findings 2 Ordering dependency for attempt calculation, broad "1062" substring match

The PR is well-structured and addresses the root cause of issue #332 (the != 0 vs != ErrorCode_Success bug) along with robust idempotent job reuse. The previous review's two actionable issues have both been resolved. No blocking issues remain.


@fslongjin
Copy link
Copy Markdown
Member Author

Thanks for the review. The actionable items from the previous AI review have been addressed in commit 3ee9ac10.

What changed:

  • Fixed the missing ErrTemplateAttemptInProgress mapping in handleSandboxCommitAction.

    • Extracted commitTemplateErrorCode(err error) int.
    • ErrTemplateAttemptInProgress now maps to ErrorCode_MasterParamsError instead of falling through to ErrorCode_MasterInternalError.
    • The handler now also sets rt.RetCode on this error path, matching the delete-template handler behavior.
  • Removed the buildCommitTemplateSpecFingerprint marshal-error swallow path.

    • Commit fingerprint generation now uses the already-validated requestSnapshot generated by marshalTemplateCommitJobRequest in SubmitTemplateCommit.
    • This avoids re-marshalling the stored request and eliminates the previous payload, _ := ... behavior.
  • Improved tests around the review findings.

    • Added error-code mapping coverage for commitTemplateErrorCode, including wrapped ErrTemplateAttemptInProgress.
    • Updated commit fingerprint tests to verify fingerprinting from request snapshots.
    • Updated the commit job record test to assert the snapshot-derived fingerprint.
    • Converted TestBuildCommitFailureMessageNeverEmpty to subtests so all cases are reported independently.

Validation performed:

  • go test ./pkg/service/httpservice/cube -run 'Commit|TemplateErrorCode' -count=1
  • go test ./pkg/templatecenter -run 'Commit|Fingerprint|FailureMessage' -count=1
  • go test ./pkg/templatecenter/... -count=1
  • go test ./pkg/service/httpservice/cube -run 'HandleSandboxCommit|CommitTemplateErrorCode|DeleteTemplate' -count=1

I also deployed the current branch to the target test machine and verified the user-facing behavior:

  • First POST /cube/sandbox/commit returned ret_code=200 with a build_id.
  • Reusing the same requestID with a different commit payload now returns ret_code=130400 (MasterParamsError) instead of an internal error.
  • The persisted job reached READY, had a non-empty snapshot path, and template_spec_fingerprint length was 64.

The migration concurrency concern and deeper DB-backed scenarios (idempotent reuse, duplicate-key race fallback, legacy migration disambiguation) are valid broader coverage items, but they require integration-level database tests and are intentionally not folded into this focused PR fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CubeMaster sandbox commit jobs insert empty request_id/operation and break subsequent commits

1 participant