Updates to capacity management #2499

mike-tutkowski · 2018-03-21T20:37:31Z

Description

In StorageManagerImpl.storagePoolHasEnoughSpace, we need to update a couple areas of the algorithm that calculates if enough space is present when dealing with managed storage:

We no longer can rely on managed storage being exclusively at the zone level. Check if the storage is managed (not if if it at the zone level).
Invoke getBytesRequiredForTemplate not only for XenServer when getSupportsResigning resolves to true, but also if using VMware or KVM.

https://issues.apache.org/jira/browse/CLOUDSTACK-10335

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

Initially I noticed on VMware and KVM that templates were not being included in the space used for primary storage when that storage is managed. I made the necessary changes (included in this PR) and then checked space used to verify that the new calculated number was now accurate for managed storage when using those hypervisor types.

Checklist:

I have read the CONTRIBUTING document.
My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

@blueorangutan package

blueorangutan · 2018-03-21T20:37:59Z

@mike-tutkowski a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

mike-tutkowski · 2018-03-21T20:39:18Z

I plan to write a Marvin test for this, but - in the meanwhile - wanted to get this PR opened so reviewers could provide comments on the production-focused code.

blueorangutan · 2018-03-21T21:06:09Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1801

borisstoyanov · 2018-04-02T09:37:30Z

@blueorangutan package

blueorangutan · 2018-04-02T09:38:16Z

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-04-02T10:44:17Z

Packaging result: ✔centos6 ✔centos7 ✖debian. JID-1860

borisstoyanov · 2018-04-02T11:07:49Z

@blueorangutan test

blueorangutan · 2018-04-02T11:08:16Z

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-04-03T13:45:24Z

Trillian test result (tid-2446)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 93960 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2446-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 63 look OK, 4 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_02_vpc_privategw_static_routes	`Failure`	176.74	test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup	`Failure`	244.54	test_privategw_acl.py
test_04_rvpc_privategw_static_routes	`Failure`	241.89	test_privategw_acl.py
test_04_restart_network_wo_cleanup	`Failure`	3.95	test_routers.py
test_01_cancel_host_maintenace_with_no_migration_jobs	`Failure`	0.11	test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	873.09	test_host_maintenance.py
test_hostha_enable_ha_when_host_disconected	`Error`	940.00	test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	4.58	test_hostha_kvm.py

mike-tutkowski · 2018-04-03T19:25:27Z

I'm pretty sure none of those test failures has to do with this PR. The PR code relates only to managed storage (which none of those tests test). On top of it, the code is really concerned with somewhat of a corner case in managed storage (which none of those tests would test either).

borisstoyanov · 2018-04-03T21:05:06Z

Yes @mike-tutkowski I think that's absolutely valid. it makes me sad to see these random failures occasionally... :(
@blueorangutan test

blueorangutan · 2018-04-03T21:05:37Z

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

rafaelweingartner

Everything seems to be ok here. There are room for improvements (code extraction and unit tests), though.

rafaelweingartner · 2018-04-05T15:38:48Z

server/src/main/java/com/cloud/storage/StorageManagerImpl.java

                        // This next call leads to CloudStack asking how many more bytes it will need for the template (if the template is
                        // already stored on the primary storage, then the answer is 0).

-                        if (clusterId != null && _clusterDao.getSupportsResigning(clusterId)) {
-                            totalAskingSize += getBytesRequiredForTemplate(tmpl, pool);
+                        if (clusterId != null) {


Would you mind extracting the block of this IF condition to a method? This would allow proper documentation an unit tests.

If the resigning is not supported this new method can return 0 as the value to be added to the totalAskingSize variable.

DaanHoogland · 2018-04-06T09:12:07Z

@blueorangutan package

blueorangutan · 2018-04-06T09:12:36Z

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

DaanHoogland · 2018-04-06T09:40:10Z

@blueorangutan test

blueorangutan · 2018-04-06T09:40:36Z

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-04-06T09:40:46Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1878

blueorangutan · 2018-04-06T14:46:19Z

Trillian test result (tid-2464)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 109766 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2464-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_loadbalance.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_templates.py
Intermitten failure detected: /marvin/tests/smoke/test_usage.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 58 look OK, 9 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_01_add_primary_storage_disabled_host	`Error`	0.81	test_primary_storage.py
test_01_primary_storage_nfs	`Error`	0.18	test_primary_storage.py
ContextSuite context=TestStorageTags>:setup	`Error`	0.32	test_primary_storage.py
test_04_restart_network_wo_cleanup	`Failure`	2.88	test_routers.py
test_02_list_snapshots_with_removed_data_store	`Error`	1.21	test_snapshots.py
test_04_extract_template	`Failure`	128.32	test_templates.py
ContextSuite context=TestISOUsage>:setup	`Error`	0.00	test_usage.py
test_08_migrate_vm	`Error`	36.67	test_vm_life_cycle.py
test_06_download_detached_volume	`Failure`	138.78	test_volumes.py
test_01_cancel_host_maintenace_with_no_migration_jobs	`Failure`	0.20	test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	3.63	test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	2.02	test_hostha_kvm.py

mike-tutkowski · 2018-04-07T06:45:38Z

I've added an integration test.

mike-tutkowski · 2018-04-07T07:10:58Z

All test errors seem inapplicable to this PR. Here are some examples:

test_primary_storage.py: errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2464-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n']

test_snapshots.py: errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2464-kvm-centos7/nfs2 already in use by another pod (id=1)\n']

test_templates.py: 'AssertionError: Extract Template Failed with invalid URL http://192.168.100.96/userdata/99b8334e-ecaa-405b-9168-e902981a3c40.qcow2 (template id: 8cc43b7f-00e7-4250-acbc-53be1de58627)\n']

test_vm_life_cycle.py: errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'}, accountid : u'c600e427-38a5-11e8-a6b6-06db8e010701'}\n"]

test_volumes.py: 'AssertionError: Extract Volume Failed with invalid URL http://192.168.100.96/userdata/c146f89d-12e8-4a34-8087-79e66e110239.qcow2 (vol id: ab60d379-a5d3-471a-b17f-7df204e48e53)\n']

blueorangutan · 2018-04-07T14:56:36Z

Trillian test result (tid-2470)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 103783 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2470-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 60 look OK, 7 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_01_add_primary_storage_disabled_host	`Error`	0.64	test_primary_storage.py
test_01_primary_storage_nfs	`Error`	0.08	test_primary_storage.py
ContextSuite context=TestStorageTags>:setup	`Error`	0.14	test_primary_storage.py
test_02_vpc_privategw_static_routes	`Failure`	258.12	test_privategw_acl.py
test_04_rvpc_privategw_static_routes	`Failure`	307.21	test_privategw_acl.py
test_04_restart_network_wo_cleanup	`Failure`	4.07	test_routers.py
test_02_list_snapshots_with_removed_data_store	`Error`	1.11	test_snapshots.py
test_08_migrate_vm	`Error`	21.72	test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs	`Failure`	1.10	test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	2.26	test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	3.50	test_hostha_kvm.py

mike-tutkowski · 2018-04-10T21:55:27Z

These test failures do not seem to be related to this PR. Here are some examples of the errors:

test_primary_storage.py:
errorText:Primary storage with id 5 cannot be disabled.
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2470-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n']

test_snapshots.py:
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2470-kvm-centos7/nfs2 already in use by another pod (id=1)\n']

test_vm_life_cycle.py:
errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'}, accountid : u'be964da7-397f-11e8-a179-06965801071a'}\n"]

borisstoyanov · 2018-04-11T07:19:52Z

Thanks fro the integration test @mike-tutkowski, let me repackage and run them again
@blueorangutan package

blueorangutan · 2018-04-11T07:20:46Z

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-04-11T07:45:52Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1899

borisstoyanov · 2018-04-11T08:35:41Z

@blueorangutan test

blueorangutan · 2018-04-11T08:36:45Z

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-04-12T08:57:24Z

Trillian test result (tid-2490)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 85871 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2490-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 61 look OK, 6 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_01_add_primary_storage_disabled_host	`Error`	0.65	test_primary_storage.py
test_01_primary_storage_nfs	`Error`	0.12	test_primary_storage.py
ContextSuite context=TestStorageTags>:setup	`Error`	0.23	test_primary_storage.py
test_04_restart_network_wo_cleanup	`Failure`	3.00	test_routers.py
test_02_list_snapshots_with_removed_data_store	`Error`	1.18	test_snapshots.py
test_08_migrate_vm	`Error`	16.91	test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs	`Failure`	0.11	test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	3.34	test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	2.71	test_hostha_kvm.py

mike-tutkowski · 2018-04-12T20:00:11Z

The test environment is having an issue when we try to put an NFS-based primary storage in maintenance mode. In test_primary_storage.py, the first error is related to that and then we later see other errors where adding a new primary storage with the same name fails because it's already in use (presumably we were originally going to delete the primary storage after putting it in maintenance mode, but putting it in maintenance mode failed).

Is this error scenario unique to this PR? It seems like the code in this PR wouldn't be responsible for such a situation.

On the up side, both Jenkins and Travis passed.

test_primary_storage.py:

errorText:Primary storage with id 5 cannot be disabled. Storage pool state : Maintenance\n'
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2490-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n' (two of these errors)

test_routers.py:

'AssertionError: Check uptime is less than 3 mins or not\n'

test_snapshots.py:

errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2490-kvm-centos7/nfs2 already in use by another pod (id=1)\n'

test_vm_life_cycle.py:

errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'

mike-tutkowski · 2018-04-12T20:08:48Z

I looked at several of the error messages for a recent test run of #2486 and it seems the list is quite similar to the list of error messages for this PR. As such, I suggest it is likely that none of the errors that are listed for this test run are related to this PR.

borisstoyanov

LGTM

mike-tutkowski · 2018-04-13T20:20:12Z

Two LGTMs and regression tests looking good, so merging.

blueorangutan · 2018-04-13T20:21:08Z

@mike-tutkowski a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-04-13T20:44:32Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1927

rohityadavcloud · 2018-05-04T08:15:22Z

@mike-tutkowski I scanned PRs merged on master but not 4.11 and came across this PR, do you think this would be useful for 4.11? If so, can you help create a backport PR for 4.11?

mike-tutkowski · 2018-05-05T03:04:07Z

Yes, @rhtyd, you are correct that this PR is a good candidate for 4.11.1. I can create a PR to back port it.

rohityadavcloud · 2018-05-05T16:43:36Z

Thanks @mike-tutkowski, looking forward to your port-PR.

mike-tutkowski · 2018-05-07T18:57:03Z

Hey @rhtyd - What's the official process for backporting a PR like this that has already been merged? Should I just cherry pick the commit to 4.11.1? Clearly there's not much tracking going on that way, but I'm not sure how we officially do this. Thanks!

rohityadavcloud · 2018-05-07T19:07:16Z

@mike-tutkowski yes, please cherry-pick or manuall port your changes to 4.11 branch. You can then either create a new PR, or push to this backport PR #2621 (on this PR both rafael and I have push changes). There is no official guidelines around things, generally we should send bugfix PRs towards LTS branch or the previous release's branch.

mike-tutkowski · 2018-05-07T19:26:41Z

I went ahead and pushed the cherry-picked commit to #2621.

mike-tutkowski force-pushed the calculate-storage-space branch from 2b57d32 to 0ab13da Compare March 22, 2018 18:50

mike-tutkowski force-pushed the calculate-storage-space branch from 0ab13da to f32fe94 Compare April 3, 2018 03:17

mike-tutkowski force-pushed the calculate-storage-space branch from f32fe94 to 42ef44b Compare April 5, 2018 15:39

rafaelweingartner approved these changes Apr 5, 2018

View reviewed changes

mike-tutkowski force-pushed the calculate-storage-space branch from 42ef44b to 6047333 Compare April 5, 2018 17:49

mike-tutkowski force-pushed the calculate-storage-space branch 2 times, most recently from a569d4f to 8c4a69b Compare April 7, 2018 06:44

Updates to capacity management

f527eae

mike-tutkowski force-pushed the calculate-storage-space branch from 8c4a69b to f527eae Compare April 10, 2018 19:34

borisstoyanov approved these changes Apr 13, 2018

View reviewed changes

mike-tutkowski merged commit 740adf4 into apache:master Apr 13, 2018

mike-tutkowski deleted the calculate-storage-space branch April 13, 2018 20:25

rohityadavcloud added this to the 4.12.0.0 milestone May 3, 2018

rohityadavcloud mentioned this pull request May 3, 2018

Backports for 4.11 branch #2621

Merged

12 tasks

Updates to capacity management #2499

Updates to capacity management #2499

Conversation

mike-tutkowski commented Mar 21, 2018 • edited Loading

Description

Types of changes

How Has This Been Tested?

Checklist:

blueorangutan commented Mar 21, 2018

mike-tutkowski commented Mar 21, 2018

blueorangutan commented Mar 21, 2018

borisstoyanov commented Apr 2, 2018

blueorangutan commented Apr 2, 2018

blueorangutan commented Apr 2, 2018

borisstoyanov commented Apr 2, 2018

blueorangutan commented Apr 2, 2018

blueorangutan commented Apr 3, 2018

mike-tutkowski commented Apr 3, 2018

borisstoyanov commented Apr 3, 2018

blueorangutan commented Apr 3, 2018

rafaelweingartner left a comment • edited Loading

Choose a reason for hiding this comment

rafaelweingartner Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

mike-tutkowski Apr 5, 2018

Choose a reason for hiding this comment

DaanHoogland commented Apr 6, 2018

blueorangutan commented Apr 6, 2018

DaanHoogland commented Apr 6, 2018

blueorangutan commented Apr 6, 2018

blueorangutan commented Apr 6, 2018

blueorangutan commented Apr 6, 2018

mike-tutkowski commented Apr 7, 2018

mike-tutkowski commented Apr 7, 2018

blueorangutan commented Apr 7, 2018

mike-tutkowski commented Apr 10, 2018

borisstoyanov commented Apr 11, 2018

blueorangutan commented Apr 11, 2018

blueorangutan commented Apr 11, 2018

borisstoyanov commented Apr 11, 2018

blueorangutan commented Apr 11, 2018

blueorangutan commented Apr 12, 2018

mike-tutkowski commented Apr 12, 2018

mike-tutkowski commented Apr 12, 2018

borisstoyanov left a comment

Choose a reason for hiding this comment

mike-tutkowski commented Apr 13, 2018

blueorangutan commented Apr 13, 2018

blueorangutan commented Apr 13, 2018

rohityadavcloud commented May 4, 2018

mike-tutkowski commented May 5, 2018

rohityadavcloud commented May 5, 2018

mike-tutkowski commented May 7, 2018

rohityadavcloud commented May 7, 2018 • edited Loading

mike-tutkowski commented May 7, 2018

mike-tutkowski commented Mar 21, 2018 •

edited

Loading

rafaelweingartner left a comment •

edited

Loading

rafaelweingartner Apr 5, 2018 •

edited

Loading

rohityadavcloud commented May 7, 2018 •

edited

Loading