Codestin Search App

jmsperu · 2026-04-27T15:46:59Z

Summary

Implements incremental backup support for the NAS backup provider on KVM, using QEMU dirty bitmaps and libvirt's backup-begin API. RFC: #12899.

For large VMs this reduces daily backup storage 80–95% and shortens backup windows from hours to minutes (e.g. a 500 GB VM with moderate writes goes from ~500 GB/day to ~5–15 GB/day after the initial full backup).

What's in the PR

Commit	What
`f2a9202d74`	RFC document at the RFC comment on issue #12899
`1981469099`	`NASBackupChainKeys` constants + zone-scoped `nas.backup.full.every` ConfigKey (default 10)
`fbb916b254`	`nasbackup.sh` mode-aware: full+checkpoint or incremental+rebase via `backup-begin`
`1f2aebca36`	Java orchestration: full-vs-incremental decision in provider, chain metadata in `backup_details`
`43e2f7504a`	On-demand bitmap recreation when CloudStack rebuilt the domain XML on VM restart
`39303fbf88`	Restore path: relative-path rebase + `qemu-img convert` flatten for file-based primary
`b8d069e127`	Cascade delete: `RebaseBackupCommand`, chain repair for delete-middle, refuse-delete-full-with-children
`49edc7f22c`	Five new smoke tests in `test/integration/smoke/test_backup_recovery_nas.py`

Full diff: 11 files, +1617 / −30.

Review feedback addressed (all from #12899 thread)

#	Reviewer	Concern	Resolution
1	@JoaoJandre	No new columns on `backups`	Chain metadata stored in existing `backup_details` kv table via `NASBackupChainKeys`
2	@abh1sar	`nas.backup.full.interval` (days) doesn't fit hourly/ad-hoc	Replaced with count-based `nas.backup.full.every` (default 10)
3	@abh1sar	Use `backup-begin` for full backups too	Done — both modes use `backup-begin`; full omits `<incremental>`
4	@abh1sar	Timestamp-based bitmap names	`backup-<epoch>` (`System.currentTimeMillis()/1000`)
5	@abh1sar	No explicit `block-dirty-bitmap-add`	libvirt manages bitmaps via `--checkpointxml`; manual bitmap commands removed
6	@abh1sar	`qemu-img rebase` after each incremental	Done in `nasbackup.sh`, with relative backing path so chain survives mount-point churn
7	@abh1sar	Stopped VMs	Stopped VMs always full; agent emits `INCREMENTAL_FALLBACK=` if cadence asked for inc
8	@abh1sar	Cascade delete behaviour	Implemented: middle-inc rebases child onto grandparent; full-with-children refuses unless `forced=true`
9	@abh1sar	Bitmap recreation on VM restart	Lazy recreation at next backup attempt — agent checks `virsh checkpoint-list`, recreates if missing, emits `BITMAP_RECREATED=`
10	@abh1sar	Smoke tests	5 new cases in `test_backup_recovery_nas.py`
11	@abh1sar	Single PR for 4.23	This PR

Backwards compatibility

The new -M / --bitmap-* flags on nasbackup.sh are optional. Without them, the script preserves the legacy full-only behaviour exactly (no checkpoint creation, same XML).
TakeBackupCommand new fields default to null; LibvirtTakeBackupCommandWrapper only emits the new flags when set, so a 4.22 management server talking to a 4.23 agent still works.
Existing backups (no chain_id in backup_details) are treated as standalone fulls by the cascade-delete logic — no migration needed.

Test plan

Environment

Branch feature/nas-backup-incremental against main (4.23-SNAPSHOT)
KVM on OL8 (Trillian ol8 mgmt + kvm-ol8 profile)
File-based primary storage (qcow2 on NFS); NAS repo on a separate NFS share
libvirt 9.x + qemu 7.x+ (dirty bitmaps + backup-begin --checkpointxml)

Automated coverage

Layer	Suite	Cases
Unit	`NASBackupProviderTest`	15 total, 5 new: chain decision under master switch / no-active-checkpoint, restore-clears-checkpoint, delete-with-live-child marks pending-delete, leaf-delete sweeps up pending parent
Unit	`LibvirtRestoreBackupCommandWrapperTest`	stubs added for incremental restore path
Smoke (Trillian)	`test/integration/smoke/test_backup_recovery_nas.py`	5 new cases, all `required_hardware="true"`

Smoke scenarios

Case	What it asserts
`test_incremental_chain_cadence`	With `nas.backup.full.every=3` and 5 backups, observed type sequence is `['FULL','INCREMENTAL','INCREMENTAL','FULL','INCREMENTAL']`
`test_restore_from_incremental`	Marker files written between each backup are all present after restoring from the tail INC
`test_delete_middle_incremental_repairs_chain`	After deleting a middle INC, child's `parent_id` is repointed to the surviving ancestor, backing file is rebased, downstream restore still correct
`test_refuse_delete_full_with_children`	Deleting a FULL that has descendants → `CloudRuntimeException`; `forced=true` cascades
`test_stopped_vm_falls_back_to_full`	Stopped VM → next backup is FULL, no checkpoint XML in agent command

Manual scenarios (outside smoke scope)

#	Scenario	Method	Expected
A	Long-run cadence stability	`full.every=10`; take 25 backups across 5 days	FULLs at positions 1, 11, 21; INCs at all others; no chain drift
B	4.22 agent ↔ 4.23 mgmt	Run a 4.22 agent against the 4.23 mgmt; take backup of a VM on that host	FULL succeeds; no new flags emitted in agent command; `backup_details` carries no chain keys
C	Master-switch flip mid-chain	After a chain has formed, set `nas.backup.incremental.enabled=false` zone-scoped	Next backup is FULL regardless of cadence; new chain anchored
D	Bitmap recreation after VM stop/start	Take FULL+INC, stop and start the VM, take next INC	Agent recreates checkpoint via `virsh checkpoint-create`; INC succeeds; restore from this INC is correct
E	Relative-path rebase survives mount churn	Unmount/remount the NAS at a different mount point between backup and restore	Relative backing paths keep the chain valid
F	`nasbackup.sh` legacy invocation	Invoke without `-M` / `--bitmap-*`	Behaves byte-for-byte as 4.22; no checkpoint side-effects

Backwards-compat checks

TakeBackupCommand new fields default null → 4.22 agents ignore them (covered by Scenario B).
Pre-PR backups with no chain_id in backup_details are treated as standalone FULLs; cascade-delete short-circuits without touching them.
RebaseBackupCommand is only sent when chain metadata is present, so a downgraded agent never receives it.

Results

Test results from running this plan will be posted as a follow-up comment after execution.

Refs

Issue: [RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Adds the design document for incremental NAS backups using QEMU dirty bitmaps and libvirt's backup-begin API. Reduces daily backup storage 80-95% for large VMs. Refs: apache#12899

NASBackupChainKeys defines the keys this provider stores under the existing backup_details kv table (parent_backup_id, bitmap_name, chain_id, chain_position, type). This keeps the backups table provider-agnostic per the RFC review. nas.backup.full.every is a zone-scoped ConfigKey that controls how often a full backup is taken; the remaining backups in the cycle are incremental. Counts backups (not days), so it works for hourly, daily, and ad-hoc schedules. Default 10. Set to 1 to disable incrementals (every backup is full). Refs: apache#12899

Adds three new optional CLI flags to nasbackup.sh: -M|--mode <full|incremental> --bitmap-new <name> (checkpoint to create with this backup) --bitmap-parent <name> (incremental: parent bitmap to read changes since) --parent-path <path> (incremental: parent backup file for rebase) Behavior: - When -M is omitted, behavior is unchanged (legacy full-only, no checkpoint created), so existing callers are not affected. - With -M full + --bitmap-new, a full backup is taken AND a libvirt checkpoint of that name is registered atomically (via backup-begin's --checkpointxml), giving the next incremental its starting bitmap. - With -M incremental, libvirt's <incremental> element references the parent bitmap; only changed blocks are written. After completion, qemu-img rebase wires the new file to its parent so the chain on the NAS is self-describing for restore. - Stopped VMs cannot use backup-begin; if -M incremental is requested while VM is stopped, the script falls back to a full and emits INCREMENTAL_FALLBACK= on stderr so the orchestrator can record it correctly in the chain. - The script echoes BITMAP_CREATED=<name> on success so the Java caller can store it under backup_details (NASBackupChainKeys.BITMAP_NAME). Works across local file, NFS-file, and LINSTOR primary storage. Ceph RBD running-VM support is a pre-existing limitation of this script, not affected by this change. Refs: apache#12899

Adds the Java side of the incremental NAS backup feature: TakeBackupCommand + mode, bitmapNew, bitmapParent, parentPath fields (null for legacy callers — script preserves its existing behaviour when these are omitted). BackupAnswer + bitmapCreated (echoed by the agent on success) + incrementalFallback (true when an incremental was requested but the agent had to fall back to full because the VM was stopped). LibvirtTakeBackupCommandWrapper - Forwards the new fields to nasbackup.sh. - Strips the new BITMAP_CREATED= / INCREMENTAL_FALLBACK= marker lines out of stdout before the existing numeric-suffix size parser runs, so the script can keep the same "size as last line(s)" contract. - Surfaces both markers on the BackupAnswer. NASBackupProvider - decideChain(vm) walks backup_details (chain_id, chain_position, bitmap_name) for the latest BackedUp backup of the VM and decides: * Stopped VM -> full (libvirt backup-begin needs running QEMU) * No prior chain -> full (chain_position=0) * chain_position+1 >= nas.backup.full.every -> new full * otherwise -> incremental, parent=last bitmap - Generates timestamp-based bitmap names ("backup-<epoch>") matching what the script then registers as the libvirt checkpoint name. - persistChainMetadata() writes parent_backup_id, bitmap_name, chain_id, chain_position, type into the existing backup_details key/value table (per the RFC review — no new columns on backups). - Honours the agent's INCREMENTAL_FALLBACK= signal: re-records the backup as a full and starts a fresh chain. - createBackupObject() now takes a type argument so the BackupVO reflects the actual decision instead of always being "FULL". Refs: apache#12899

CloudStack rebuilds the libvirt domain XML on every VM start, which means persistent QEMU dirty bitmaps don't survive a stop/start cycle. Rather than hooking into the VM start lifecycle (intrusive across the orchestration layer), this commit handles the missing bitmap *lazily* at the next backup attempt: nasbackup.sh - When -M incremental is requested, the script first checks `virsh checkpoint-list` for the parent bitmap. If absent, it recreates the checkpoint on the running domain so libvirt accepts the <incremental> reference. The next incremental will be larger than usual (it captures all writes since recreate, not since the previous incremental) but is correct; subsequent ones return to normal size. - On recreation, emits BITMAP_RECREATED=<name> on stdout for the orchestrator to record. BackupAnswer + bitmapRecreated field surfaced from the agent. LibvirtTakeBackupCommandWrapper - Strips BITMAP_RECREATED= line from stdout before size parsing. - Sets answer.setBitmapRecreated(...). NASBackupChainKeys + BITMAP_RECREATED key for backup_details. NASBackupProvider - When the agent reports a recreated bitmap, persists it under backup_details and logs an info-level message so operators can correlate larger-than-usual incrementals with VM restarts. This satisfies the bitmap-loss-on-VM-restart concern from the RFC review without touching VirtualMachineManager / StartCommand / agent lifecycle. Refs: apache#12899

Two changes that together let an incremental NAS backup be restored without manual chain assembly: scripts/vm/hypervisor/kvm/nasbackup.sh - qemu-img rebase now writes a backing-file path that is RELATIVE to the new qcow2's directory (e.g. ../<parent-ts>/root.<uuid>.qcow2) rather than the absolute path on the current mount point. NAS mount points are ephemeral (mktemp -d), so an absolute reference would not resolve when the backup is re-mounted at restore time. Relative references are resolved by qemu-img against the file's own directory, so the chain stays valid no matter where the NAS is mounted next. - Verifies the parent file exists on the NAS before rebasing. LibvirtRestoreBackupCommandWrapper - For file-based primary storage (local, NFS-file), the existing code rsync'd the source qcow2 to the volume. That copies only the differential blocks of an incremental, leaving a volume whose backing-file reference points at a path the primary storage host doesn't have. Now: detect a backing-chain via qemu-img info JSON and flatten via 'qemu-img convert -O qcow2', which follows the chain and produces a self-contained qcow2. Full backups continue to use rsync (faster, no chain to flatten). - The block-storage path (RBD/Linstor) already used qemu-img convert via the QemuImg helper, which auto-flattens chains, so that path needed no change. Refs: apache#12899

@abh1sar

Adds the delete-with-chain-repair semantics agreed in the RFC review: scripts/vm/hypervisor/kvm/nasbackup.sh - New '-o rebase' operation: rebases an existing on-NAS qcow2 onto a new backing parent. Uses a SAFE rebase (no -u) so the target absorbs blocks of the about-to-be-deleted parent before the backing pointer is moved up to the grandparent. Writes the new backing reference relative to the target's directory so it survives mount-point changes. - New CLI flags --rebase-target, --rebase-new-backing (both passed mount-relative). RebaseBackupCommand + LibvirtRebaseBackupCommandWrapper - New agent command that wraps the script's rebase operation. The provider sends one of these per child that needs re-pointing. NASBackupProvider.deleteBackup - Now plans the chain repair before touching files via computeChainRepair(): * No chain metadata -> single-file delete (legacy behaviour) * Tail incremental -> single delete, no rebase * Middle incremental -> rebase immediate child onto our parent, then delete; shift chain_position of all later descendants by -1 * Full with descendants -> refuse unless forced=true; with forced=true delete full + every descendant newest-first - Updates parent_backup_id, chain_position metadata in backup_details after each rebase so the model in the DB matches the on-disk chain. This implements the cascade-delete behaviour requested in @abh1sar's review point apache#7. Refs: apache#12899

Adds five new test cases to test_backup_recovery_nas.py covering the end-to-end behaviour of the incremental NAS backup feature: * test_incremental_chain_cadence - Sets nas.backup.full.every=3, takes 5 backups, verifies the type pattern is FULL, INC, INC, FULL, INC. * test_restore_from_incremental - FULL + 2 INCs, each with a marker file. Restores from the latest INC and verifies all three markers are present (i.e. qemu-img convert flattened the chain correctly). * test_delete_middle_incremental_repairs_chain - Builds FULL, INC1, INC2; deletes INC1 (no force needed); restores from the surviving INC2 and verifies that markers from FULL, INC1 (which was deleted), and INC2 are all present — proving the rebase merged INC1's blocks into INC2. * test_refuse_delete_full_with_children - Verifies plain delete of a FULL that has children fails, and delete with forced=true succeeds and removes the whole chain. * test_stopped_vm_falls_back_to_full - Sets cadence to 2, takes one backup (FULL), stops the VM, triggers another (cadence would say INC). Verifies the second backup is recorded as FULL because the agent fell back when backup-begin couldn't run on a stopped VM. All tests restore nas.backup.full.every to 10 in finally blocks. Refs: apache#12899

codecov · 2026-04-27T17:19:46Z

Codecov Report

❌ Patch coverage is 32.54157% with 284 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.92%. Comparing base (6f4445c) to head (a51f335).
⚠️ Report is 79 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/apache/cloudstack/backup/NASBackupProvider.java	41.11%	141 Missing and 28 partials ⚠️
...ource/wrapper/LibvirtTakeBackupCommandWrapper.java	0.00%	77 Missing ⚠️
...rg/apache/cloudstack/backup/TakeBackupCommand.java	33.33%	16 Missing ⚠️
...ava/org/apache/cloudstack/backup/BackupAnswer.java	0.00%	12 Missing ⚠️
...a/org/apache/cloudstack/backup/BackupProvider.java	0.00%	3 Missing ⚠️
...g/apache/cloudstack/backup/NASBackupChainKeys.java	0.00%	3 Missing ⚠️
...rg/apache/cloudstack/backup/BackupManagerImpl.java	57.14%	1 Missing and 2 partials ⚠️
...ce/wrapper/LibvirtRestoreBackupCommandWrapper.java	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13074      +/-   ##
============================================
+ Coverage     18.02%   18.92%   +0.90%     
- Complexity    16621    18261    +1640     
============================================
  Files          6029     6175     +146     
  Lines        542184   555617   +13433     
  Branches      66451    67853    +1402     
============================================
+ Hits          97740   105177    +7437     
- Misses       433428   438875    +5447     
- Partials      11016    11565     +549

Flag	Coverage Δ
uitests	`3.53% <ø> (+<0.01%)`	⬆️
unittests	`20.13% <32.54%> (+0.94%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sureshanaparti · 2026-04-28T05:41:32Z

@jmsperu can you check the build failure. thanks.

weizhouapache · 2026-04-28T07:12:11Z

@jmsperu
is this ready for review ?

Phase 6 added a hasBackingChain() check before rsync that uses qemu-img info to detect chained incrementals. The existing testExecuteWithRsyncFailure test mocks Script.runSimpleBashScriptForExitValue to return 0 for any command, so the new qemu-img info check incorrectly evaluates as "has backing chain" and routes the test through the chain-flatten path instead of rsync — the test then asserts a failure that never occurs. Add a clause to the mock that returns 1 (no backing chain) for the qemu-img info backing-filename probe, so the test continues to exercise the rsync path it was designed for.

jmsperu · 2026-04-28T08:37:13Z

@weizhouapache yes — ready for review.

@sureshanaparti — apologies, I missed your earlier ping. The build failure was a unit test in LibvirtRestoreBackupCommandWrapperTest.testExecuteWithRsyncFailure (NPE on currentDevice after my new chain-flatten check incorrectly routed the test through the qemu-img convert path).

Fixed in d80ed16: the test's Script.runSimpleBashScriptForExitValue mock now returns 1 (no backing chain) for the new qemu-img info | grep "backing-filename" probe, so the test continues to exercise the rsync path it was designed for.

CI should be green on the next run. Cc @abh1sar @JoaoJandre @harikrishna-patnala in case you also want to take a look.

Copilot

Pull request overview

Adds incremental backup-chain support to the NAS backup provider for KVM by leveraging libvirt backup-begin with checkpoints/dirty-bitmaps, plus restore/flatten and chain-aware delete/repair semantics.

Changes:

Introduces backup-chain metadata keys (NASBackupChainKeys) and zone-scoped cadence config nas.backup.full.every, with orchestration logic to choose full vs incremental and persist chain details in backup_details.
Extends the KVM agent + nasbackup.sh to support full-with-checkpoint and incremental-with-rebase, plus a new “rebase” operation used for chain repair during delete.
Updates restore logic to detect qcow2 backing chains and flatten via qemu-img convert, and adds new integration smoke tests for incremental-chain behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test/integration/smoke/test_backup_recovery_nas.py	Adds incremental-chain smoke tests (cadence, restore, delete-middle repair, forced delete behavior, stopped-VM fallback).
scripts/vm/hypervisor/kvm/nasbackup.sh	Adds mode-aware backup (`full`/`incremental`), checkpoint creation, incremental rebase, and a new `rebase` operation for delete-middle chain repair.
plugins/hypervisors/kvm/src/test/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapperTest.java	Extends restore wrapper tests to exercise the “no backing chain => rsync” path.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtTakeBackupCommandWrapper.java	Passes incremental args to `nasbackup.sh` and parses bitmap/fallback markers from script output.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapper.java	Detects qcow2 backing chains and flattens incrementals during restore using `qemu-img convert`.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRebaseBackupCommandWrapper.java	New wrapper to run `nasbackup.sh -o rebase` for chain repair.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java	Implements full-vs-incremental decisions, stores chain metadata in `backup_details`, and adds chain-aware delete/repair logic.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupChainKeys.java	Defines `backup_details` keys for chain id/position/type/bitmap/parent linkage.
docs/rfcs/incremental-nas-backup.md	Adds an RFC document describing incremental NAS backup approach (needs alignment with final implementation).
core/src/main/java/org/apache/cloudstack/backup/TakeBackupCommand.java	Adds optional incremental-mode fields (mode/bitmap names/parent path).
core/src/main/java/org/apache/cloudstack/backup/RebaseBackupCommand.java	New agent command to rebase a backup qcow2 onto a new backing file for chain repair.
core/src/main/java/org/apache/cloudstack/backup/BackupAnswer.java	Adds fields to return bitmap creation/recreation and incremental-fallback markers back to orchestration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T09:21:11Z

+    # All tests set nas.backup.full.every to a small value (3) so a chain
+    # forms quickly without needing many backup iterations. They restore
+    # the original value at teardown.
+
+    def _set_full_every(self, value):
+        Configurations.update(self.apiclient, name='nas.backup.full.every',
+                              value=str(value), zoneid=self.zone.id)
+


The incremental tests always reset nas.backup.full.every to 10 in finally, but they never read/preserve the original zone value. If a test environment has a non-default value configured, these tests will leave the zone config changed. Consider capturing the current value once (e.g., via Configurations.list(...) in setup) and restoring that exact value in each finally (or in tearDown).

Copilot · 2026-04-28T09:21:11Z

+        command.setBitmapParent(decision.bitmapParent);
+        command.setParentPath(decision.parentPath);

        if (VirtualMachine.State.Stopped.equals(vm.getState())) {


For stopped VMs decideChain returns fullStart(newBitmap) and takeBackup always sets command.mode="full" / bitmapNew. But nasbackup.sh's stopped-VM path doesn’t create a checkpoint/bitmap (no BITMAP_CREATED=), so persisting nas.bitmap_name from the requested bitmap can cause the next backup to attempt an incremental against a bitmap that was never created. Consider clearing mode/bitmapNew (legacy full) for stopped VMs and/or only persisting nas.bitmap_name when the agent confirms it via BITMAP_CREATED=.

Suggested change

if (VirtualMachine.State.Stopped.equals(vm.getState())) {

if (VirtualMachine.State.Stopped.equals(vm.getState())) {

// Stopped-VM backups use the offline path and do not create checkpoints/bitmaps.

// Clear chain metadata so a full backup does not imply a bitmap was created.

command.setMode(null);

command.setBitmapNew(null);

command.setBitmapParent(null);

command.setParentPath(null);

Copilot · 2026-04-28T09:21:12Z

+    if [[ "$effective_mode" == "incremental" ]]; then
+      volUuid="${fullpath##*/}"
+      if [[ "$fullpath" == /dev/drbd/by-res/* ]]; then
+        volUuid=$(get_linstor_uuid_from_path "$fullpath")
+      fi
+      # PARENT_PATH from the orchestrator is the parent backup's path relative to the
+      # NAS mount root (e.g. "i-2-X/2026.04.27.12.00.00/root.UUID.qcow2"). Convert it to
+      # a path relative to THIS new qcow2's directory so the backing reference resolves
+      # correctly the next time the NAS is mounted (mount points are ephemeral).
+      local parent_abs="$mount_point/$PARENT_PATH"
+      if [[ ! -f "$parent_abs" ]]; then
+        echo "Parent backup file does not exist on NAS: $parent_abs"
+        cleanup
+        exit 1
+      fi
+      local parent_rel
+      parent_rel=$(realpath --relative-to="$dest" "$parent_abs")
+      if ! qemu-img rebase -u -b "$parent_rel" -F qcow2 "$dest/$name.$volUuid.qcow2" >> "$logFile" 2> >(cat >&2); then
+        echo "qemu-img rebase failed for $dest/$name.$volUuid.qcow2 onto $parent_rel"


In incremental mode, each exported qcow2 (root.* and datadisk.*) is rebased onto the same PARENT_PATH. For VMs with multiple volumes this will rebase data-disk incrementals onto the root-disk parent file, corrupting the chain for non-root volumes. PARENT_PATH needs to be per-disk (or pass the parent backup directory and derive the correct parent filename for each volUuid/disk role).

Copilot · 2026-04-28T09:21:12Z

+  echo "Incremental backup options (running VMs only; requires QEMU >= 4.2 and libvirt >= 7.2):"
+  echo "  -M|--mode full          Take a full backup AND create a checkpoint (--bitmap-new required) for future incrementals."
+  echo "  -M|--mode incremental   Take an incremental backup since --bitmap-parent and create new checkpoint --bitmap-new."
+  echo "                          Requires --bitmap-parent, --bitmap-new, and --parent-path (parent backup file for rebase)."
+  echo "  Without -M, behaves as legacy full-only backup with no checkpoint creation."


The usage text says “Without -M, behaves as legacy full-only backup with no checkpoint creation”, but the script still runs sanity_checks unconditionally (QEMU>=4.2/libvirt>=7.2) even for legacy callers and non-backup ops (delete/stats/rebase). To preserve the documented legacy behavior/backward compatibility, gate the version checks to only the code paths that actually require backup-begin/incremental features (or only when MODE is set).

@bernardodemarco

@bernardodemarco pointed out that design docs / RFCs go in the project wiki or as a separate issue rather than into the source tree. The RFC content has been posted as a comment on the existing tracking issue apache#12899 (which is where the design discussion already lives), and the docs/rfcs/ directory is removed from this PR.

jmsperu · 2026-04-28T21:04:40Z

@bernardodemarco thanks — good point. Done in 9764025:

Removed docs/rfcs/incremental-nas-backup.md from this PR
Posted the full RFC text as a comment on the existing tracking issue [RFC] Incremental NAS Backup Support for KVM Hypervisor #12899 — that's also where all the prior design discussion lives, so it stays together: [RFC] Incremental NAS Backup Support for KVM Hypervisor #12899 (comment)

PR is now purely the implementation. Updated PR description to drop the doc reference.

blueorangutan · 2026-06-13T14:08:04Z

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2026-06-13T14:53:56Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18250

abh1sar · 2026-06-13T14:56:02Z

@blueorangutan test

blueorangutan · 2026-06-13T14:58:03Z

@abh1sar a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

…ancestors Address abh1sar review on PR apache#13074 (NASBackupProvider.java:995). The previous sweep/cascade path called findChainParent in a loop, each one issuing a fresh listByVmId — O(N) DB calls per chain walk. Add getChainOrderedLeafToRoot(member) which materialises the full chain (every backup row sharing CHAIN_ID) via a single listByVmId, sorted leaf-first by CHAIN_POSITION. Rewrite deleteLeafBackupAndSweepPendingAncestors to snapshot that chain BEFORE the leaf delete (so the in-memory list stays resolvable after the row is gone), then iterate ancestors from the snapshot. Rewrite cascadeDeleteSubtree as a plain leaf-first walk of the ordered chain — NAS backups are a linear chain, no tree-walking needed. findChainParent is kept (the parent-row lookup is still a useful primitive) with a Javadoc note recommending the new method when iterating.

…wrapper Address abh1sar review on PR apache#13074 (nasbackup.sh:155, :193, :358; LibvirtTake BackupCommandWrapper.java:124). The script was carrying caller-side policy: arg validation, fallback decisions, and stdout markers that the wrapper had to parse out before the size-parsing logic could run. Move that policy into Java and use dedicated exit codes for the signals the wrapper needs. Script (scripts/vm/hypervisor/kvm/nasbackup.sh): * Drop the per-mode required-args checks (the wrapper now pre-validates). * Replace the INCREMENTAL_FALLBACK stdout marker with exit code 21 (EXIT_INCREMENTAL_UNSUPPORTED): emitted when the running-VM path can't re-register the parent checkpoint, and when the stopped-VM path was asked for incremental. The wrapper retries the script as a full backup and sets incrementalFallback on the BackupAnswer. * Replace the BITMAP_CREATED stdout marker with exit code 22 (EXIT_BITMAP_NOT_SEEDED), emitted only by the stopped-VM path when qemu-img bitmap --add failed for every source disk. Backup file is valid but no usable bitmap exists on the host; wrapper records bitmapCreated=null so NASBackupProvider clears active_checkpoint_id and the next backup starts a fresh full chain. Running-VM success path no longer needs a marker — libvirt's backup-begin atomically creates the checkpoint. LibvirtTakeBackupCommandWrapper.java: * Pre-validate incremental args (mode-vs-bitmapNew/Parent/parentPaths) before invoking the script. Returns a failed BackupAnswer on missing args, keeping the script agnostic to caller policy. * Extract runBackupScript() so the same code can fire the retry-as-full after EXIT_INCREMENTAL_UNSUPPORTED without duplicating arg assembly. * On EXIT_INCREMENTAL_UNSUPPORTED + requestedMode==incremental, re-invoke with mode=full and only --bitmap-new (drop --bitmap-parent/--parent-paths); set incrementalFallback=true on the eventual answer. * On EXIT_BITMAP_NOT_SEEDED, treat as success but set bitmapCreated=null. * Drop the stdout-marker stripping loop (markers no longer emitted), and the separate BITMAP_CREATED parsing — bitmapCreated mirrors command.getBitmapNew() unless the not-seeded exit code says otherwise. NASBackupProvider.java: * Refresh the two comment blocks that referenced the old BITMAP_CREATED= stdout signal to describe the new exit-code path. No behaviour change in this file.

jmsperu · 2026-06-13T21:41:37Z

@abh1sar — I picked up the 2026-06-13 review batch in one author for style consistency with the earlier rounds. Two commits pushed:

1. 096bef1292 — backup(nas): collapse N+1 chain queries when sweeping delete-pending ancestors
Addresses your comment on NASBackupProvider.java:995. Added getChainOrderedLeafToRoot(member) which materialises the chain via a single listByVmId call ordered leaf-first by CHAIN_POSITION. deleteLeafBackupAndSweepPendingAncestors now snapshots that chain before the leaf delete (so the in-memory list stays resolvable after the row is gone), then iterates ancestors from the snapshot. cascadeDeleteSubtree is now a plain leaf-first walk — NAS backups are a linear chain so no tree traversal is needed. findChainParent is kept (still the right primitive for single parent-row lookups) with a Javadoc note recommending the new method when looping.

2. 73c4206c21 — backup(nas): move backup-mode policy + stdout markers from script to wrapper
Addresses your comments on nasbackup.sh:155, nasbackup.sh:193, nasbackup.sh:358, and LibvirtTakeBackupCommandWrapper.java:124.

The script was carrying caller-side policy (arg validation, fallback decisions) and emitting stdout markers the wrapper had to parse around. Both have moved into Java; the script now uses dedicated exit codes for the signals the wrapper actually needs:

EXIT_INCREMENTAL_UNSUPPORTED=21 replaces the INCREMENTAL_FALLBACK= stdout marker. Emitted when (a) the running-VM path can't re-register the parent checkpoint, or (b) the stopped-VM path was asked for incremental. Java owns the retry policy — wrapper sees the exit code and re-invokes the script with --mode=full + the same --bitmap-new, then sets incrementalFallback=true on the answer.
EXIT_BITMAP_NOT_SEEDED=22 replaces the BITMAP_CREATED= stdout marker. Emitted only by the stopped-VM path when qemu-img bitmap --add failed on every source disk. Backup file is valid; wrapper records bitmapCreated=null so NASBackupProvider clears active_checkpoint_id and the next backup starts a fresh chain. The running-VM success path no longer needs a marker — backup-begin is atomic.
validateBackupArgs(command) in the wrapper pre-validates the mode + bitmap args before invoking the script. The script's per-mode required-args block is gone; the agnostic case statement remains as defensive cover for direct invocations.
runBackupScript() extracted so the EXIT_INCREMENTAL_UNSUPPORTED retry doesn't duplicate the argv-assembly logic.
Wrapper's stdout-marker stripping loop is removed, and the BITMAP_CREATED re-parse is gone (mirrors command.getBitmapNew() directly, gated on the not-seeded exit code).

NASBackupProvider.java only changes are two comment refreshes describing the new exit-code path instead of the old marker.

Note on the parent-checkpoint redefine itself: I kept the actual virsh checkpoint-create --redefine call in the script because it already has the NAS mounted at that point (the parent's .checkpoint.xml lives next to the parent backup file, mount-relative). Moving the redefine into Java would mean duplicating the mount logic in the wrapper, which felt worse than the current shape. What did move out is the decision that comes after the redefine fails — which is what carried the stdout-marker complexity. Happy to revisit if you'd rather see the redefine itself in Java too.

Tested: chain N+1 fix is straightforward refactor against the existing unit tests; for the script + exit codes I'd appreciate a fresh @blueorangutan test run since I don't have access to a libvirt-10/qemu-8.2 host on my side. Thanks for the patience on this batch.

…ures deletingLeafSweepsUpDeletePendingParent previously omitted CHAIN_POSITION because the old PARENT_BACKUP_ID-walking sweep didn't depend on it. The new getChainOrderedLeafToRoot helper (096bef1) sorts the chain by CHAIN_POSITION desc; without those mocks both backups returned MAX_VALUE, the stable sort left the leaf at index 0 and the parent never got swept. Real backups always carry CHAIN_POSITION (set in persistChainMetadata), so this aligns the fixtures with production data rather than papering over the new sort assumption.

blueorangutan · 2026-06-14T06:30:11Z

[SF] Trillian test result (tid-16304)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 53051 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13074-t16304-kvm-ol8.zip
Smoke tests completed. 142 look OK, 9 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_DeployVmAntiAffinityGroup_in_project	`Error`	63.89	test_affinity_groups_projects.py
test_DeployVmAntiAffinityGroup	`Error`	7.90	test_affinity_groups.py
ContextSuite context=TestNASBackupAndRecovery>:setup	`Error`	0.00	test_backup_recovery_nas.py
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	29.11	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	0.11	test_kubernetes_clusters.py
test_12_test_deploy_cluster_different_offerings_per_node_type	`Failure`	77.67	test_kubernetes_clusters.py
test_05_list_volumes_isrecursive	`Failure`	0.05	test_list_volumes.py
test_07_list_volumes_listall	`Failure`	0.04	test_list_volumes.py
test_01_non_strict_host_anti_affinity	`Failure`	78.27	test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity	`Error`	28.54	test_nonstrict_affinity_group.py
test_01_vpn_usage	`Error`	1.13	test_usage.py
ContextSuite context=TestMigrateVMStrictTags>:setup	`Error`	0.00	test_vm_strict_host_tags.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	302.30	test_hostha_kvm.py

harikrishna-patnala · 2026-06-16T11:03:23Z

@jmsperu can you please fix the failing test

Error:  Errors: 
Error:    NASBackupProviderTest.unnecessary Mockito stubbings » UnnecessaryStubbing

…-parent test deletingLeafSweepsUpDeletePendingParent stubbed the leaf's PARENT_BACKUP_ID but production never reads it on this path: - findLiveChildren(leaf) iterates the sibling list and reads each *other* backup's PARENT_BACKUP_ID against leaf.getUuid() — never the leaf's own. - getChainOrderedLeafToRoot (introduced in 096bef1) walks the chain by CHAIN_ID + CHAIN_POSITION; the legacy findChainParent → PARENT_BACKUP_ID walk is bypassed for the sweep. Same UnnecessaryStubbingException pattern as 9f4d61f; the parent's PARENT_BACKUP_ID stub IS still used (findLiveChildren reads it) so it stays. Unblocks CI for apache#13074.

jmsperu · 2026-06-20T08:05:16Z

@harikrishna-patnala Done — fixed in f574628e (dropped the unnecessary PARENT_BACKUP_ID stub in the sweep-pending test). The build/unit-test jobs are green now.

The only remaining red is the smoke batch (test_list_accounts, test_list_disk_offerings, test_list_domains…), which is unrelated to the NAS backup changes and looks like a flaky/infra run — could a committer kick a re-run when convenient? Thanks!

DaanHoogland · 2026-06-21T14:19:18Z

@blueorangutan package

blueorangutan · 2026-06-21T14:20:04Z

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2026-06-21T15:27:04Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18327

abh1sar

Re-register the parent with checkpoint-create --redefine using the full checkpoint-dumpxml output (a minimal/synthesized XML is rejected by libvirt's checkpoint RNG schema). So: persist <bitmap>.checkpoint.xml next to each backup on the NAS, and on recreate --redefine from the parent backup's saved XML

@jmsperu In my testing, I found that a full checkpoint xml was not required. Just the checkpoint name and the created tag is enough for redefine. We don't have to store created as it doesn't have to be accurate. Checkpoints are ephemeral anyway.
Can you please check https://github.com/shapeblue/cloudstack/blob/integration-veeam-kvm/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtStartBackupCommandWrapper.java#L127?
This way we don't have to persist the full checkpoint xml.

On the larger "move the checks into Java" suggestion: I started it, but testing showed the recreate needs checkpoint-dumpxml + --redefine against NAS-side XML, which is cohesive in the script — I've kept it there for now and can revisit the Java move as a follow-up.

With the above change the Virsh checkpoint redefine logic can be moved to Java as we don't need to read the checkpoint.xml file. If checkpoint redefine fails, the Java code can fallback to full and call nasbackup.sh without the -M flag.

It is possible that the checkpoint we have stored in DB as the VM active checkpoint is not present in the qcow2 file (after migration etc.). In that case backup currently fails with error. We should try to fallback to full backup is such cases. I handled a similar thing recently here https://github.com/shapeblue/cloudstack/blob/integration-veeam-kvm/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtStartBackupCommandWrapper.java#L127.
You can take some ideas from that.

~~restoreBackedUpVolume() restores a single volume, we should refer vm's active checkpoint to null in this case also.~~ - Disregard this as restore and attach volume to an instance which is assigned a backup offering is not allowed

We need to fix the backup delete logic. Specially around hidden deletes, cascade deletes and how resource limit and usage is decremented in BackupManagerImpl(). Can you explore that?

Currently BackupManagerImpl again tries deleting the backup from the DB which Nas provider has already deleted. Also Nas provider might have deleted the full chain but resource limits and usage is not updated.
Can you check how snapshot delete does that? We need to look for a simple fix with a rigid expectations and rules.

I have started testing the PR. But these comments need to be addressed as well.

abh1sar

During testing I observed that backup_details and vm_instance_details tables were having chain and checkpoint related fields persisted one though the incremental feature was not enabled.
We should make sure that no changes are done to the existing functionality if the feature is not enabled.

abh1sar · 2026-06-22T06:33:12Z

+rebase_backup() {
+  mount_operation
+
+  if [[ -z "$REBASE_TARGET" || -z "$REBASE_NEW_BACKING" ]]; then
+    echo "rebase requires --rebase-target and --rebase-new-backing"
+    cleanup
+    exit 1
+  fi
+
+  local target_abs="$mount_point/$REBASE_TARGET"
+  local backing_abs="$mount_point/$REBASE_NEW_BACKING"
+  if [[ ! -f "$target_abs" ]]; then
+    echo "Rebase target file does not exist: $target_abs"
+    cleanup
+    exit 1
+  fi
+  if [[ ! -f "$backing_abs" ]]; then
+    echo "New backing file does not exist: $backing_abs"
+    cleanup
+    exit 1
+  fi
+  local target_dir
+  target_dir=$(dirname "$target_abs")
+  local backing_rel
+  backing_rel=$(realpath --relative-to="$target_dir" "$backing_abs")
+
+  # SAFE rebase (no -u): qemu-img reads blocks from the old chain and writes them into
+  # the target where the new chain doesn't cover them. This is the "merge into" semantic
+  # required when we're about to delete the old immediate parent — the target needs to
+  # absorb the to-be-deleted parent's blocks so the chain remains consistent against the
+  # new (further-back) backing.
+  if ! qemu-img rebase -b "$backing_rel" -F qcow2 "$target_abs" >> "$logFile" 2> >(cat >&2); then
+    echo "qemu-img rebase failed for $target_abs onto $backing_rel"
+    cleanup
+    exit 1
+  fi
+  sync
+  umount $mount_point
+  rmdir $mount_point


rebase_backup() is not being called now. Remove all occurrences from script.

abh1sar · 2026-06-22T06:34:17Z

+# QEMU >= 4.2 and libvirt >= 7.2 are only required for backup-begin (incremental
+# checkpoints and per-bitmap exports). Legacy full-only backups, plus delete /
+# stats / rebase operations, run on older versions just fine. Gate the version
+# check to the paths that actually need it to preserve backward compatibility.
+if [ "$OP" = "backup" ] && [ -n "$MODE" ]; then
+  sanity_checks
+fi


This doesn't look right.
sanity_checks was being called unconditionally before. It has nothing to do with incremental backups.
This change should be removed.

abh1sar · 2026-06-22T06:34:58Z

+    --rebase-target)
+      REBASE_TARGET="$2"
+      shift
+      shift
+      ;;
+    --rebase-new-backing)
+      REBASE_NEW_BACKING="$2"
+      shift
+      shift
+      ;;


remove these as well

abh1sar · 2026-06-22T06:49:25Z

+        # No saved checkpoint XML (e.g. a backup taken before this fix) or redefine failed.
+        # Signal the Java wrapper to retry as a full backup so the chain restarts cleanly
+        # instead of failing the backup. The wrapper is responsible for the retry and for
+        # recording incrementalFallback=true on the resulting BackupAnswer.
+        log -e "incremental: parent checkpoint $BITMAP_PARENT could not be re-registered — exiting $EXIT_INCREMENTAL_UNSUPPORTED for caller-driven fallback"
+        cleanup
+        exit $EXIT_INCREMENTAL_UNSUPPORTED
+      fi


Can't we just set $effective_mode="full" here instead of returning? I think that will make the code much simpler.
We can remove the below check as well since backup of stopped VM should not be sent using incremental mode anyway.

backup_stopped_vm() { # Stopped VMs cannot use libvirt's backup-begin (no QEMU process). Take a full # backup via qemu-img convert. If the caller asked for incremental, signal the # Java wrapper to retry as full and record the fallback on the BackupAnswer. if [[ "$MODE" == "incremental" ]]; then log -e "incremental: VM stopped — exiting $EXIT_INCREMENTAL_UNSUPPORTED for caller-driven fallback to full" exit $EXIT_INCREMENTAL_UNSUPPORTED fi

With these two changes we can remove EXIT_INCREMENTAL_UNSUPPORTED completely from the script and the wrapper

abh1sar · 2026-06-22T08:22:09Z

+        // bitmapCreated mirrors what we asked the script to create — except when the
+        // script exited EXIT_BITMAP_NOT_SEEDED, in which case the host has no bitmap
+        // and the orchestrator must clear active_checkpoint_id.
+        answer.setBitmapCreated(bitmapSeeded ? command.getBitmapNew() : null);


bitmapSeeded is by default true. Event if incremental backup feature is not enabled and bitmaps are not actually created.

This causes nas.active_checkpoint_id to be set in vm_details without the feature being enabled.

I suggest returning an error if bitmaps cannot be created instead of EXIT_BITMAP_NOT_SEEDED.

It is only being used in case of stopped VMs. bitmap --add operation should not fail ideally and it should be ok to fail the backup if bitmap --add fails for any disk, so that the underlying problem is dealt with instead of forcing full backups.

And for the cases where the incremental feature is not enabled, we should not be sending bitmap_new in the TakeBackupCommand

abh1sar · 2026-06-22T08:32:20Z

+        // of the live config). The next backup with this flag back on starts a new chain.
+        Boolean incrementalEnabled = NASBackupIncrementalEnabled.valueIn(vm.getDataCenterId());
+        if (incrementalEnabled == null || !incrementalEnabled) {
+            return ChainDecision.fullStart(newBitmap);


Don't set newBitmap if incremental backups are not enabled.
Otherwise it causes the bitmap information being persisted in backup_details and vm_instance_details even though the bitmap didn't actually get persisted on disk.

Also return mode as legacy-full as suggested in the other comment

abh1sar · 2026-06-22T08:46:44Z

+        command.setBitmapNew(decision.bitmapNew);
+        command.setBitmapParent(decision.bitmapParent);
+        command.setParentPaths(decision.parentPaths);


Let's have 3 modes here as well: incremental | full and | legacy_full
And not set any bitmap fields or parent path if the mode is legacy_full.

…che#13074) restoreBackedUpVolume() attaches a restored volume whose image carries no QEMU bitmap, so the target VM's nas.active_checkpoint_id becomes stale. Clear it (mirroring the full-restore paths restoreVMFromBackup/restoreBackupToVM) so the next backup of that VM is a fresh full. Adds NASBackupProviderTest.restoreBackedUpVolumeClearsTargetVmActiveCheckpoint, which fails against the pre-fix code (removeDetail never invoked).

…bled (apache#13074) Addresses abh1sar review: with the incremental feature off, chain/checkpoint metadata was still written to backup_details/vm_instance_details. Introduce a distinct legacy-full mode so the feature-off path stays byte-for-byte legacy: - decideChain returns legacyFull() when nas.backup.incremental.enabled is off: no bitmap generated, no chain id, no parent paths. - takeBackup persists no chain metadata and does not touch active_checkpoint_id for legacy-full backups. - TakeBackupCommand mode carries "legacy-full"; the KVM wrapper accepts it (no bitmap/chain args) and forwards -M legacy-full to nasbackup.sh, which already maps it to make_checkpoint=0. Updates the disabled-switch unit test to assert legacy-full + null bitmap/chain.

…cks (apache#13074) Per abh1sar review: - Remove the unused rebase_backup() operation (-o rebase), its --rebase-target/ --rebase-new-backing arguments and REBASE_* variables — no longer called from the orchestrator. - Restore unconditional sanity_checks(): the QEMU/libvirt version check ran for every operation before; gating it to backup+MODE was incorrect and unrelated to incremental backups.

…ted dump (apache#13074) Per abh1sar review: libvirt's --redefine only needs the checkpoint name and a creationTime (the value need not be accurate — checkpoints are ephemeral), so synthesize a minimal <domaincheckpoint> on the fly instead of persisting the full checkpoint-dumpxml next to each backup. Removes the per-backup <bitmap>.checkpoint.xml file. Verified on libvirt 10: create checkpoint -> delete its metadata (simulating a VM restart that wipes the registry while the bitmap persists on the qcow2) -> redefine from the minimal XML succeeds and re-registers the checkpoint. (The EXIT_INCREMENTAL_UNSUPPORTED removal / inline-fallback simplification is a separate follow-up pending the fallback-signal design.)

…lling (apache#13074) Per abh1sar round-2 review: - When the parent checkpoint can't be re-registered, fall back to a full backup in place (effective_mode=full) and emit an INCREMENTAL_FALLBACK marker on stdout, instead of exiting EXIT_INCREMENTAL_UNSUPPORTED for a caller-driven retry. - Remove the stopped-VM incremental guard — the orchestrator never sends incremental mode for a stopped VM. - bitmap --add failure now fails the backup (instead of EXIT_BITMAP_NOT_SEEDED silently degrading future backups to full), surfacing the underlying problem. - Drop EXIT_INCREMENTAL_UNSUPPORTED / EXIT_BITMAP_NOT_SEEDED from the script. LibvirtTakeBackupCommandWrapper: remove the re-invoke-as-full retry and the two exit-code constants; detect the INCREMENTAL_FALLBACK stdout marker (recording incrementalFallback) and strip it before parsing the backup size.

… the qcow2 (apache#13074) Per abh1sar review (item 3): the VM's active checkpoint/bitmap can be absent from the qcow2 after a migration even though the orchestrator says it should be there, which previously made the incremental backup-begin fail hard. Before building the incremental, probe the running disk via QMP query-block; if the parent bitmap is not present, fall back to a full backup in place (emit INCREMENTAL_FALLBACK so the wrapper records it as full) instead of failing. Verified on a real CirrOS VM (libvirt 10): bitmap present -> true incremental (marker=0); bitmap absent -> full backup (marker=1, full-size output, rc=0).

jmsperu · 2026-06-22T12:56:13Z

@abh1sar — on the delete-logic / resource-accounting point: I dug into how deleteCheckedBackup interacts with the NAS chain delete and confirmed two concrete problems:

Tombstone case (live children): the NAS provider keeps the row as delete-pending for chain tracking, but deleteCheckedBackup then calls backupDao.remove(target) — which destroys the tombstone the later sweep relies on.
Leaf + sweep case: the provider physically removes the leaf row plus the swept delete-pending ancestors (N rows), but deleteCheckedBackup re-removes the leaf (already gone → false → spurious failure) and decrements backup count / backup_storage usage for only one backup — leaking the other N−1.

Comparing with SnapshotManagerImpl.deleteSnapshot: the strategy owns storage + chain/state, and the manager decrements based on the post-delete state of the entity rather than blindly removing the row.

Proposed rule (rigid, exactly-once): a backup's backup count + backup_storage usage is decremented once, at the single point its row+file are physically removed — i.e. inside deleteBackupFileAndRow, which runs for the leaf and every swept ancestor. deleteCheckedBackup keeps only the CheckedReservation limit guard and stops calling decrementResourceCount/backupDao.remove for NAS chain backups; a tombstone (delete-pending) is not decremented until the sweep finally removes it.

One open question before I implement — the manager currently decrements for all providers (Veeam/Networker too). To keep their accounting intact while the NAS provider owns chain accounting, do you prefer:
(a) a BackupProvider capability flag (e.g. handlesChainDeleteAccounting(), default false; NAS returns true) so the manager skips its own decrement/remove only for such providers, or
(b) deleteBackup returning the set/count of backups actually removed, so the manager decrements per-removed?

I'm leaning (a) as the smaller, more rigid change — happy to implement whichever you prefer.

…al (apache#13074) Per abh1sar review (delete-logic / resource accounting): deleteCheckedBackup decremented backup count / backup_storage and removed the DB row for ONE backup, but the NAS provider deletes whole chains per call (leaf + swept delete-pending ancestors) and removes those rows itself — double-handling the row (destroying delete-pending tombstones) and under-counting resources for swept ancestors. Fix (exactly-once, owned by the chain provider): - BackupProvider.handlesChainDeleteResourceAccounting() (default false). - NASBackupProvider overrides it true and decrements backup + backup_storage at the single physical-removal choke-point (deleteBackupFileAndRow), which runs for the leaf and every swept ancestor — once per actually-removed backup. A tombstoned (delete-pending) backup is not decremented until it is swept. - deleteCheckedBackup skips its own decrement + row removal for such providers. NASBackupProviderTest: leaf+sweep decrements both removed backups; live-children (tombstone) path decrements none.

jmsperu · 2026-06-22T13:35:31Z

Implemented Option A in a51f335:

BackupProvider.handlesChainDeleteResourceAccounting() (default false).
NASBackupProvider overrides it true and decrements backup + backup_storage at the single physical-removal choke-point (deleteBackupFileAndRow) — which runs for the leaf and every swept delete-pending ancestor, so accounting is exactly-once per actually-removed backup. A tombstoned (delete-pending) backup is not decremented until it is swept.
deleteCheckedBackup skips its own decrement + backupDao.remove for such providers (keeps the CheckedReservation limit guard).

Unit tests (NASBackupProviderTest): the leaf+sweep path decrements both removed backups; the live-children (tombstone) path decrements none.

This also fixes the two concrete bugs noted above — the manager no longer destroys the delete-pending tombstone, and swept ancestors are no longer leaked. Happy to switch to option (b) (have deleteBackup return the removed set and decrement in the manager) if you'd prefer that split.

docs: add RFC for incremental NAS backup support (KVM)

f2a9202

Adds the design document for incremental NAS backups using QEMU dirty bitmaps and libvirt's backup-begin API. Reduces daily backup storage 80-95% for large VMs. Refs: apache#12899

jmsperu mentioned this pull request Apr 27, 2026

[RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Open

boring-cyborg Bot added the component:backup label Apr 27, 2026

boring-cyborg Bot added the component:kvm label Apr 27, 2026

jmsperu added 5 commits April 27, 2026 19:07

boring-cyborg Bot added component:integration-test Python Warning... Python code Ahead! labels Apr 27, 2026

jmsperu changed the title ~~[WIP] feat(backup): incremental NAS backup support for KVM~~ feat(backup): incremental NAS backup support for KVM Apr 27, 2026

jmsperu marked this pull request as ready for review April 27, 2026 16:26

winterhazel added this to the 4.23.0 milestone Apr 27, 2026

weizhouapache requested review from abh1sar, Copilot and sureshanaparti April 28, 2026 09:06

Copilot started reviewing on behalf of weizhouapache April 28, 2026 09:12 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

harikrishna-patnala linked an issue Apr 28, 2026 that may be closed by this pull request

[RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Open

bernardodemarco reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/rfcs/incremental-nas-backup.md Outdated

abh1sar reviewed Jun 13, 2026

View reviewed changes

Comment thread plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java Outdated

github-actions Bot mentioned this pull request Jun 13, 2026

[repo-status] Daily Status Report — June 13, 2026 #13416

Closed

jmsperu added 2 commits June 14, 2026 00:14

github-actions Bot mentioned this pull request Jun 14, 2026

[repo-status] Daily Status Report — June 14, 2026 #13418

Closed

abh1sar requested changes Jun 22, 2026

View reviewed changes

abh1sar reviewed Jun 22, 2026

View reviewed changes

jmsperu added 6 commits June 22, 2026 11:58

boring-cyborg Bot added the component:api label Jun 22, 2026

-        if (VirtualMachine.State.Stopped.equals(vm.getState())) {
+        if (VirtualMachine.State.Stopped.equals(vm.getState())) {
+            // Stopped-VM backups use the offline path and do not create checkpoints/bitmaps.
+            // Clear chain metadata so a full backup does not imply a bitmap was created.
+            command.setMode(null);
+            command.setBitmapNew(null);
+            command.setBitmapParent(null);
+            command.setParentPath(null);

Conversation

jmsperu commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Review feedback addressed (all from #12899 thread)

Backwards compatibility

Test plan

Environment

Automated coverage

Smoke scenarios

Manual scenarios (outside smoke scope)

Backwards-compat checks

Results

Refs

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sureshanaparti commented Apr 28, 2026

Uh oh!

weizhouapache commented Apr 28, 2026

Uh oh!

jmsperu commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmsperu commented Apr 28, 2026

Uh oh!

blueorangutan commented Jun 13, 2026

Uh oh!

Uh oh!

blueorangutan commented Jun 13, 2026

Uh oh!

abh1sar commented Jun 13, 2026

Uh oh!

blueorangutan commented Jun 13, 2026

Uh oh!

jmsperu commented Jun 13, 2026

Uh oh!

blueorangutan commented Jun 14, 2026

Uh oh!

harikrishna-patnala commented Jun 16, 2026

Uh oh!

jmsperu commented Jun 20, 2026

Uh oh!

DaanHoogland commented Jun 21, 2026

Uh oh!

blueorangutan commented Jun 21, 2026

Uh oh!

blueorangutan commented Jun 21, 2026

Uh oh!

abh1sar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abh1sar left a comment

Choose a reason for hiding this comment

jmsperu commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading

abh1sar left a comment •

edited

Loading