Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Network Usage event model adjustments #10755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

winterhazel
Copy link
Member

@winterhazel winterhazel commented Apr 21, 2025

Description

This PR fixes the issues with the network-related Usage event model described in #10687:

  • NETWORK.CREATE is not published when the network is created. Instead, it is published after the network is implemented, which does not match what Usage expects and can result in duplicate NETWORK.CREATEs
  • NETWORK.UPDATE is published to indicate that the network is implemented only when the method that implements the network is called a second time, such as when deploying a second VM on the network, which does not make much sense
  • no event is published when the network shuts down and goes from Implemented to Allocated

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Isolated network:

  1. I created a network: a NETWORK.CREATE was published
  2. I deployed a VM on the network: a NETWORK.UPDATE (Implemented) was published
  3. I deployed a second VM on the network: no network-related event was published
  4. I shut down both VMs and waited until the network was shut down by the garbage collector: a NETWORK.UPDATE (Allocated) was published
  5. I started one of the VMs: a NETWORK.UPDATE (Implemented) was published
  6. I restarted the network (with and without clean-up): no network-related events were published
  7. I destroyed the network: a NETWORK.UPDATE (Allocated) and a NETWORK.DELETE were published

Error on isolated network deployment:

  1. I changed router.template.kvm to a template that does not exist so that the network implementation would fail, created a new network, and tried to deploy a VM on it: only NETWORK.CREATE was published because the implementation failed

Persistent networks:

  1. I created a persistent network: a NETWORK.CREATE and a NETWORK.UPDATE (Implemented) were published
  2. I deployed another VM on the network: no network-related events were published
  3. I destroyed the network: a NETWORK.UPDATE (Allocated) and a NETWORK.DELETE were published

Shared networks:

  1. I deployed a shared network and a VM on it: only the NETWORK.CREATE was published (which is expected, as there was no state change)
  2. I destroyed the network: a NETWORK.DELETE was published

L2 networks:

  1. I deployed a L2 network: a NETWORK.CREATE was published
  2. I deployed a VM on the network: a NETWORK.UPDATE (Implemented) was published
  3. I destroyed the network: a NETWORK.UPDATE (Allocated) and a NETWORK.DELETE were published

Kubernetes:

  1. I deployed a Kubernetes cluster without specifying the network: a NETWORK.CREATE and a NETWORK.UPDATE (Implemented) were published

Copy link

codecov bot commented Apr 21, 2025

Codecov Report

Attention: Patch coverage is 8.00000% with 23 lines in your changes missing coverage. Please review.

Project coverage is 15.16%. Comparing base (f13cf59) to head (2ce77d3).
Report is 14 commits behind head on 4.19.

Files with missing lines Patch % Lines
...src/main/java/com/cloud/event/UsageEventUtils.java 0.00% 15 Missing ⚠️
...tack/engine/orchestration/NetworkOrchestrator.java 25.00% 5 Missing and 1 partial ⚠️
...ain/java/com/cloud/network/NetworkServiceImpl.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10755      +/-   ##
============================================
- Coverage     15.17%   15.16%   -0.01%     
- Complexity    11332    11333       +1     
============================================
  Files          5415     5412       -3     
  Lines        474893   475043     +150     
  Branches      57920    57963      +43     
============================================
+ Hits          72046    72055       +9     
- Misses       394792   394932     +140     
- Partials       8055     8056       +1     
Flag Coverage Δ
uitests 4.29% <ø> (+<0.01%) ⬆️
unittests 15.89% <8.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@winterhazel winterhazel marked this pull request as ready for review April 21, 2025 16:55
@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@winterhazel winterhazel added this to the 4.19.3 milestone Apr 21, 2025
@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13123

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13141

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, I think we can think of some follow up as discussed, but this certainly addresses the initial issue.

@DaanHoogland
Copy link
Contributor

@rajujith , can you see if this alleviates any of your usage issues?

@rajujith
Copy link

@DaanHoogland This solves the problem for time being. But for the long term we may want to record different usage events like you mentioned here #10687 (comment) and create two separate usage types like we already have for running and allocated VM.

@DaanHoogland
Copy link
Contributor

@DaanHoogland This solves the problem for time being. But for the long term we may want to record different usage events like you mentioned here #10687 (comment) and create two separate usage types like we already have for running and allocated VM.

thanks @rajujith , can you test this with your faulty use-case scenario?

@winterhazel would you agree with that requirement?

@winterhazel
Copy link
Member Author

winterhazel commented Apr 25, 2025

@DaanHoogland This solves the problem for time being. But for the long term we may want to record different usage events like you mentioned here #10687 (comment) and create two separate usage types like we already have for running and allocated VM.

@rajujith after looking into and understanding how the network Usage event model works (or was supposed to work), I do not think anymore that it is necessary to add separate events for IMPLEMENT and STOP (at least in the context of Usage. Maybe it makes sense for the standard events). The network's new state is in included in the NETWORK.UPDATE event. This can be used to identify when a network transitioned from allocated to implemented and vice-versa.

Also, we include the network's state in the usage record during that period. This can be used to distinguish between allocated and implemented usage.

When the running and allocated virtual machine usage types were introduced, we did not have the state column. I'm not sure why it was introduced that way (two different usage types instead of including the state in the usage record), so genuine question: is there an advantage of having two separate usage types just to distinguish between allocated and implemented? For me, it seems unnecessary.

@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@rajujith
Copy link

@DaanHoogland This solves the problem for time being. But for the long term we may want to record different usage events like you mentioned here #10687 (comment) and create two separate usage types like we already have for running and allocated VM.

@rajujith after looking into and understanding how the network Usage event model works (or was supposed to work), I do not think anymore that it is necessary to add separate events for IMPLEMENT and STOP (at least in the context of Usage. Maybe it makes sense for the standard events). The network's new state is in included in the NETWORK.UPDATE event. This can be used to identify when a network transitioned from allocated to implemented and vice-versa.

Also, we include the network's state in the usage record during that period. This can be used to distinguish between allocated and implemented usage.

When the running and allocated virtual machine usage types were introduced, we did not have the state column. I'm not sure why it was introduced that way (two different usage types instead of including the state in the usage record), so genuine question: is there an advantage of having two separate usage types just to distinguish between allocated and implemented? For me, it seems unnecessary.

@winterhazel I can only imagine a scenario where the operator would want to bill allocated and Implemented networks at a different rate or as separate products. If these usages can be differentiated in the API response that should be enough. However separate usage type makes a clear distinction in my head, may be its not necessary.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13171

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-13148)

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-13150)

Copy link

@rajujith rajujith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Based on my tests the duplicate events are no longer created. No more duplicate usage records are being generated as well.

mysql> select * from usage_event where  resource_id=204 and type in ('NETWORK.CREATE','NETWORK.DELETE','NETWORK.UPDATE');
+----+----------------+------------+---------------------+---------+-------------+---------------+-------------+-------------+------+---------------+-----------+--------------+
| id | type           | account_id | created             | zone_id | resource_id | resource_name | offering_id | template_id | size | resource_type | processed | virtual_size |
+----+----------------+------------+---------------------+---------+-------------+---------------+-------------+-------------+------+---------------+-----------+--------------+
|  2 | NETWORK.CREATE |          2 | 2025-04-25 08:03:40 |       1 |         204 | net1          |          10 |        NULL | NULL | NULL          |         1 |         NULL |
|  6 | NETWORK.UPDATE |          2 | 2025-04-25 08:04:57 |       1 |         204 | net1          |          10 |        NULL | NULL | Implemented   |         1 |         NULL |
| 11 | NETWORK.UPDATE |          2 | 2025-04-25 08:08:58 |       1 |         204 | net1          |          10 |        NULL | NULL | Allocated     |         1 |         NULL |
| 12 | NETWORK.UPDATE |          2 | 2025-04-25 08:11:14 |       1 |         204 | net1          |          10 |        NULL | NULL | Implemented   |         1 |         NULL |
| 19 | NETWORK.UPDATE |          2 | 2025-04-25 08:14:31 |       1 |         204 | net1          |          10 |        NULL | NULL | Allocated     |         1 |         NULL |
| 21 | NETWORK.DELETE |          2 | 2025-04-25 08:14:33 |       1 |         204 | net1          |          10 |        NULL | NULL | NULL          |         1 |         NULL |
+----+----------------+------------+---------------------+---------+-------------+---------------+-------------+-------------+------+---------------+-----------+--------------+
6 rows in set (0.00 sec)

mysql> select account_id,usage_id,usage_type,start_date,end_date from cloud_usage.cloud_usage where usage_id=204;
+------------+----------+------------+---------------------+---------------------+
| account_id | usage_id | usage_type | start_date          | end_date            |
+------------+----------+------------+---------------------+---------------------+
|          2 |      204 |         30 | 2025-04-25 07:30:00 | 2025-04-25 08:29:59 |
+------------+----------+------------+---------------------+---------------------+
1 row in set (0.00 sec)

@rajujith rajujith removed their assignment Apr 25, 2025
@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13143)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 47513 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10755-t13143-kvm-ol8.zip
Smoke tests completed. 132 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_list_cpvm_vm Failure 0.06 test_ssvm.py
test_04_cpvm_internals Failure 0.06 test_ssvm.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-13157)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 53512 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10755-t13157-kvm-ol8.zip
Smoke tests completed. 132 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_secure_vm_migration Error 136.79 test_vm_life_cycle.py
test_01_secure_vm_migration Error 136.80 test_vm_life_cycle.py

@DaanHoogland DaanHoogland merged commit 9d263cd into apache:4.19 Apr 26, 2025
22 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in ACS 4.20.1 Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Duplicate NETWORK.CREATE usage events resulting in duplicate usage records
7 participants