Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Updates to capacity management #2499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mike-tutkowski
Copy link
Member

@mike-tutkowski mike-tutkowski commented Mar 21, 2018

Description

In StorageManagerImpl.storagePoolHasEnoughSpace, we need to update a couple areas of the algorithm that calculates if enough space is present when dealing with managed storage:

  1. We no longer can rely on managed storage being exclusively at the zone level. Check if the storage is managed (not if if it at the zone level).

  2. Invoke getBytesRequiredForTemplate not only for XenServer when getSupportsResigning resolves to true, but also if using VMware or KVM.

https://issues.apache.org/jira/browse/CLOUDSTACK-10335

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

Initially I noticed on VMware and KVM that templates were not being included in the space used for primary storage when that storage is managed. I made the necessary changes (included in this PR) and then checked space used to verify that the new calculated number was now accurate for managed storage when using those hypervisor types.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@blueorangutan package

@blueorangutan
Copy link

@mike-tutkowski a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@mike-tutkowski
Copy link
Member Author

I plan to write a Marvin test for this, but - in the meanwhile - wanted to get this PR opened so reviewers could provide comments on the production-focused code.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1801

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch from 2b57d32 to 0ab13da Compare March 22, 2018 18:50
@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✖debian. JID-1860

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch from 0ab13da to f32fe94 Compare April 3, 2018 03:17
@blueorangutan
Copy link

Trillian test result (tid-2446)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 93960 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2446-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 63 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Failure 176.74 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 244.54 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 241.89 test_privategw_acl.py
test_04_restart_network_wo_cleanup Failure 3.95 test_routers.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 0.11 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 873.09 test_host_maintenance.py
test_hostha_enable_ha_when_host_disconected Error 940.00 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 4.58 test_hostha_kvm.py

@mike-tutkowski
Copy link
Member Author

I'm pretty sure none of those test failures has to do with this PR. The PR code relates only to managed storage (which none of those tests test). On top of it, the code is really concerned with somewhat of a corner case in managed storage (which none of those tests would test either).

@borisstoyanov
Copy link
Contributor

Yes @mike-tutkowski I think that's absolutely valid. it makes me sad to see these random failures occasionally... :(
@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch from f32fe94 to 42ef44b Compare April 5, 2018 15:39
Copy link
Member

@rafaelweingartner rafaelweingartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to be ok here. There are room for improvements (code extraction and unit tests), though.

// This next call leads to CloudStack asking how many more bytes it will need for the template (if the template is
// already stored on the primary storage, then the answer is 0).

if (clusterId != null && _clusterDao.getSupportsResigning(clusterId)) {
totalAskingSize += getBytesRequiredForTemplate(tmpl, pool);
if (clusterId != null) {
Copy link
Member

@rafaelweingartner rafaelweingartner Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind extracting the block of this IF condition to a method? This would allow proper documentation an unit tests.

If the resigning is not supported this new method can return 0 as the value to be added to the totalAskingSize variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch from 42ef44b to 6047333 Compare April 5, 2018 17:49
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1878

@blueorangutan
Copy link

Trillian test result (tid-2464)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 109766 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2464-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_loadbalance.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_templates.py
Intermitten failure detected: /marvin/tests/smoke/test_usage.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_volumes.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 58 look OK, 9 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.81 test_primary_storage.py
test_01_primary_storage_nfs Error 0.18 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.32 test_primary_storage.py
test_04_restart_network_wo_cleanup Failure 2.88 test_routers.py
test_02_list_snapshots_with_removed_data_store Error 1.21 test_snapshots.py
test_04_extract_template Failure 128.32 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_08_migrate_vm Error 36.67 test_vm_life_cycle.py
test_06_download_detached_volume Failure 138.78 test_volumes.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 0.20 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 3.63 test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance Error 2.02 test_hostha_kvm.py

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch 2 times, most recently from a569d4f to 8c4a69b Compare April 7, 2018 06:44
@mike-tutkowski
Copy link
Member Author

I've added an integration test.

@mike-tutkowski
Copy link
Member Author

All test errors seem inapplicable to this PR. Here are some examples:

test_primary_storage.py: errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2464-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n']

test_snapshots.py: errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2464-kvm-centos7/nfs2 already in use by another pod (id=1)\n']

test_templates.py: 'AssertionError: Extract Template Failed with invalid URL http://192.168.100.96/userdata/99b8334e-ecaa-405b-9168-e902981a3c40.qcow2 (template id: 8cc43b7f-00e7-4250-acbc-53be1de58627)\n']

test_vm_life_cycle.py: errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'}, accountid : u'c600e427-38a5-11e8-a6b6-06db8e010701'}\n"]

test_volumes.py: 'AssertionError: Extract Volume Failed with invalid URL http://192.168.100.96/userdata/c146f89d-12e8-4a34-8087-79e66e110239.qcow2 (vol id: ab60d379-a5d3-471a-b17f-7df204e48e53)\n']

@blueorangutan
Copy link

Trillian test result (tid-2470)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 103783 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2470-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 60 look OK, 7 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.64 test_primary_storage.py
test_01_primary_storage_nfs Error 0.08 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.14 test_primary_storage.py
test_02_vpc_privategw_static_routes Failure 258.12 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Failure 307.21 test_privategw_acl.py
test_04_restart_network_wo_cleanup Failure 4.07 test_routers.py
test_02_list_snapshots_with_removed_data_store Error 1.11 test_snapshots.py
test_08_migrate_vm Error 21.72 test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 1.10 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 2.26 test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance Error 3.50 test_hostha_kvm.py

@mike-tutkowski mike-tutkowski force-pushed the calculate-storage-space branch from 8c4a69b to f527eae Compare April 10, 2018 19:34
@mike-tutkowski
Copy link
Member Author

These test failures do not seem to be related to this PR. Here are some examples of the errors:

test_primary_storage.py:
errorText:Primary storage with id 5 cannot be disabled.
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2470-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n']

test_snapshots.py:
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2470-kvm-centos7/nfs2 already in use by another pod (id=1)\n']

test_vm_life_cycle.py:
errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'}, accountid : u'be964da7-397f-11e8-a179-06965801071a'}\n"]

@borisstoyanov
Copy link
Contributor

Thanks fro the integration test @mike-tutkowski, let me repackage and run them again
@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1899

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2490)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 85871 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2499-t2490-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_primary_storage.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_snapshots.py
Intermitten failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 61 look OK, 6 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.65 test_primary_storage.py
test_01_primary_storage_nfs Error 0.12 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.23 test_primary_storage.py
test_04_restart_network_wo_cleanup Failure 3.00 test_routers.py
test_02_list_snapshots_with_removed_data_store Error 1.18 test_snapshots.py
test_08_migrate_vm Error 16.91 test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 0.11 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 3.34 test_host_maintenance.py
test_hostha_enable_ha_when_host_in_maintenance Error 2.71 test_hostha_kvm.py

@mike-tutkowski
Copy link
Member Author

The test environment is having an issue when we try to put an NFS-based primary storage in maintenance mode. In test_primary_storage.py, the first error is related to that and then we later see other errors where adding a new primary storage with the same name fails because it's already in use (presumably we were originally going to delete the primary storage after putting it in maintenance mode, but putting it in maintenance mode failed).

Is this error scenario unique to this PR? It seems like the code in this PR wouldn't be responsible for such a situation.

On the up side, both Jenkins and Travis passed.

test_primary_storage.py:

errorText:Primary storage with id 5 cannot be disabled. Storage pool state : Maintenance\n'
errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2490-kvm-centos7/marvin_pri1 already in use by another pod (id=1)\n' (two of these errors)

test_routers.py:

'AssertionError: Check uptime is less than 3 mins or not\n'

test_snapshots.py:

errorText:Failed to add data store: Storage pool nfs://10.2.0.16/acs/primary/pr2499-t2490-kvm-centos7/nfs2 already in use by another pod (id=1)\n'

test_vm_life_cycle.py:

errortext : u'Cannot migrate VM, destination host is not in correct state, has status: Up, state: Disabled'

@mike-tutkowski
Copy link
Member Author

I looked at several of the error messages for a recent test run of #2486 and it seems the list is quite similar to the list of error messages for this PR. As such, I suggest it is likely that none of the errors that are listed for this test run are related to this PR.

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mike-tutkowski
Copy link
Member Author

Two LGTMs and regression tests looking good, so merging.

@mike-tutkowski mike-tutkowski merged commit 740adf4 into apache:master Apr 13, 2018
@blueorangutan
Copy link

@mike-tutkowski a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@mike-tutkowski mike-tutkowski deleted the calculate-storage-space branch April 13, 2018 20:25
@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1927

@rohityadavcloud rohityadavcloud added this to the 4.12.0.0 milestone May 3, 2018
@rohityadavcloud rohityadavcloud mentioned this pull request May 3, 2018
12 tasks
@rohityadavcloud
Copy link
Member

@mike-tutkowski I scanned PRs merged on master but not 4.11 and came across this PR, do you think this would be useful for 4.11? If so, can you help create a backport PR for 4.11?

@mike-tutkowski
Copy link
Member Author

Yes, @rhtyd, you are correct that this PR is a good candidate for 4.11.1. I can create a PR to back port it.

@rohityadavcloud
Copy link
Member

Thanks @mike-tutkowski, looking forward to your port-PR.

@mike-tutkowski
Copy link
Member Author

Hey @rhtyd - What's the official process for backporting a PR like this that has already been merged? Should I just cherry pick the commit to 4.11.1? Clearly there's not much tracking going on that way, but I'm not sure how we officially do this. Thanks!

@rohityadavcloud
Copy link
Member

rohityadavcloud commented May 7, 2018

@mike-tutkowski yes, please cherry-pick or manuall port your changes to 4.11 branch. You can then either create a new PR, or push to this backport PR #2621 (on this PR both rafael and I have push changes). There is no official guidelines around things, generally we should send bugfix PRs towards LTS branch or the previous release's branch.

@mike-tutkowski
Copy link
Member Author

I went ahead and pushed the cherry-picked commit to #2621.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants