[snmp] peer-ify application #5417

max-au · 2021-11-18T06:42:38Z

Replace test_server:start_node with ?CT_PEER. Clean up
unused test case callbacks (e.g. testcase(suite)) while I'm at it.

max-au · 2021-12-25T22:00:35Z

Rebased to fix merge conflict.

bmk · 2022-01-19T18:53:50Z

A couple of points:

?ALIB:start_node/1 does more than just start a node.
Your changes does not take that into account.
Move the CT_PEER(...) stuff the ?ALIB:start_node/1 function
(replace START_NODE) and let it do the "same" as before.
I am not fond of anonymous node names (which is why the
nodes are usually prefixed with: agent_ or manager_, ...)
Makes it easier when debugging tests.

max-au · 2022-01-20T02:25:09Z

?ALIB:start_node/1 does more than just start a node.
Your changes does not take that into account.

Extra actions taken are:

adding extra code path to filename:dirname(code:which(?MODULE)), - ?CT_PEER does it automatically
calls global:sync on the started node (via command line) - this is done via arguments passed to ?CT_PEER
calls global:sync in the test runner node, this is done in init_all(Config) (most test cases do not need that)

I am not fond of anonymous node names (which is why the
nodes are usually prefixed with: agent_ or manager_, ...)

I have updated the PR with extra "-mgr", "-agent", "-v3mgr" and "-v3agent" to make node names clearer.

In general, such clarity is exactly one of the design goals for ?CT_PEER. Nodes are named after the test suite and test case. To give an example:

-module(my_SUITE).
my_test(Config) when is_list(Config) ->
    {ok, Peer, Node} = ?CT_PEER(),
    ct:print("Node: ~s", [Node]).

Would print my_SUITE-my_test-333-123456@hostname (where 333 is the os:pid() and 123456 is erlang:unique().

This way node names are guaranteed not to clash, even if I run two copies of the same test suite on the same host. It also guarantees that one test case that failed and left a node running will not affect another test case attempting to use the same node name.

I verified that tests are passing on Linux (Ubuntu20), FreeBSD 13 and Windows machine.

bmk · 2022-01-20T17:30:45Z

You still do not start the snmp_test_sys_monitor process on all nodes
which is the primary thing the start_node function did.
Either move the CT_PEER calls into this (start_node) function or
explicitly add this to each call to CT_PEER.

max-au · 2022-01-20T22:18:44Z

This is done in init_all(Config), see line 132:

Args = ["-s", "snmp_test_sys_monitor", "start", "-s", "global", "sync"],

No existing test case called start_node directly, only via init_all, so snmp_test_sys_monitor is started on both manager and agent nodes.

I verified that tests pass on Linux, FreeBSD, MacOS and Windows. Is there any other OS or a combination of settings that may not be working with my changes?

bmk · 2022-01-21T14:27:25Z

?ALIB:start_node is called directly at two places in snmp_agent_SUITE.erl
There also a wrapper-function, start_node, which is also called four times.
Also ?ALIB:start_node is also called from snmp_agent_mibs_SUITE.erl
And possible future use.

github-actions · 2022-01-21T22:38:22Z

CT Test Results

    2 files   24 suites 47m 12s ⏱️
481 tests 453 ✔️   28 💤 0 ❌
622 runs 489 ✔️ 133 💤 0 ❌

Results for commit a9d76c5.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

max-au · 2022-01-22T00:41:38Z

Thank you, I updated PR adding this to all peer nodes.

I was not able to figure out how snmp_test_sys_monitor is used in test suited, but I believe the change didn't break anything, tests are still passing on all OS I can get my hands on.

bmk · 2022-01-25T13:04:42Z

snmp_test_sys_monitor is started on each new node.
It subscribes to "system events". When it gets a system event,
it sends it to a global "collector" process.
When a test case fails (timeout), the test case code checks
if there is any system events, and if so this is most likely
the reason for failures.
In the bad old days, we used to have 10-30 failed test cases
each night. Almost all of which where random (it could happen
in any test case) timeouts. And it turned out that in almost every
case, the VM had issues that made it basically stop for a period
of everything from 5-60 seconds => request timeout!

bmk · 2022-01-27T14:41:12Z

A lot of test cases failed in the last nightly test run(s).
But I think many of them are unrelated transient disturbances.
But because of that, I cannot say if this branch caused any issues.
I will let it run one more night.
I have a 'maint'-branch with various fixes and changes (to the tests).
I might need to merge that branch to resolve these issues (would require a
rebase of this branch). I will report progress tomorrow.

bmk · 2022-01-28T14:35:38Z

There are still a lot of failing test cases.
I will take this PR out for the weekend, and instead add my (maint-based)
branch (which deals with a bunch of minor test issues).
Let that run over the weekend, and if it looks ok on Monday, I will merge then.
Then you can rebase on the new master and we will try again, with hopefully
all the spurious failings removed.

max-au · 2022-01-29T07:02:59Z

Thank you! Just in case, I re-run the tests on Ubuntu 20, Windows, Mac OS X, Ubuntu ARM64 and FreeBSD 11, they still pass with no failures.

bmk · 2022-01-31T09:36:42Z

Its the same for me when I run on my own machine(s).
Anyway, the tests ran without problems (from 140 to 0 failures).
I have merged my branch (to maint and master), so please
rebase and push and we can try again (with less noise).

Replace test_server:start_node with ?CT_PEER. Clean up unused test case callbacks (e.g. `testcase(suite)`) while I'm at it.

max-au · 2022-02-01T03:28:31Z

Rebased, verified Linux & FreeBSD passing.

bmk · 2022-02-02T10:03:08Z

Some failing test case(s), but not nearly as many as before (so far).
But here is one. As far as I can see its a peer:stop/1 that fails (I assume
the proxy process, and the node, has already died):

*** CT Error Notification 2022-02-02 04:47:06.433 ***
🔗

snmp_agent_SUITE:end_per_group failed
Reason: noproc

Full error description and stacktrace

=== Ended at 2022-02-02 04:47:06
=== Location: [{proc_lib,stop,1077},
{snmp_agent_test_lib,finish_all,248},
{snmp_agent_SUITE,end_per_group,922},
{test_server,ts_tc,1784},
{test_server,run_test_case_eval1,1381},
{test_server,run_test_case_eval,1225}]
=== === Reason: no such process or port

=== Total execution time of group: 864.345s

bmk · 2022-02-02T10:24:20Z

Also, I noticed that there is a (test_server:)start_peer, called with CT_PEER,
but no corresponding (test_server:)stop_peer. Instead, the peer module is called directly; peer:stop

bmk · 2022-02-02T11:28:56Z

Another "failure" (actually its the end per testcase that fails):
=== WARNING: end_per_testcase crashed!
Reason: {suite_failed,{rpc_failure,'snmp_manager_SUITE-simple_sync_get3-agent-14690-84161@*******'},
snmp_manager_SUITE,6359}
Line: [{snmp_test_lib,fail,962},
{snmp_manager_SUITE,exec,5509},
{snmp_manager_SUITE,fin_agent,5346},
{snmp_manager_SUITE,end_per_testcase,818},
{test_server,do_end_per_testcase,1628},
{test_server,run_test_case_eval1,1336},
{test_server,run_test_case_eval,1225}]

This is also a case when the node has already died (host name removed).

bmk · 2022-02-02T11:46:32Z

And another one:

*** [2022-02-02 05:20:50.881] INFO test_server@shelob <0.1659.0> ***
start_agent -> try (rpc) start sub agent on 'snmp_agent_test_lib-snmp_sa-66-30535@shelob'

*** CT Error Notification 2022-02-02 05:20:50.883 ***
🔗

snmp_agent_SUITE:init_per_group failed
Reason: {badmatch,{badrpc,nodedown}}

Full error description and stacktrace

=== Ended at 2022-02-02 05:20:50
=== Location: [{snmp_agent_test_lib,start_agent,718},
{snmp_agent_SUITE,otp_4394_init,7342},
{snmp_agent_SUITE,init_per_group,753},
{test_server,ts_tc,1784},
{test_server,run_test_case_eval1,1381},
{test_server,run_test_case_eval,1225}]
=== === Reason: no match of right hand side value {badrpc,nodedown}
in function snmp_agent_test_lib:start_agent/3 (snmp_agent_test_lib.erl, line 718)
in call from snmp_agent_SUITE:otp_4394_init/1 (snmp_agent_SUITE.erl, line 7342)
in call from snmp_agent_SUITE:init_per_group/2 (snmp_agent_SUITE.erl, line 753)
in call from test_server:ts_tc/3 (test_server.erl, line 1784)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1381)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1225)

max-au · 2022-02-03T07:19:57Z

I am not able to reproduce these test failures. What operating system I should try, and how should start these tests?
So far I am using Ubuntu Linux, MacOS, Windows and FreeBSD. I run tests as prescribed by the "HOW TO TEST":

make
make tests
cd release/tests/test_server
$ERL_TOP/bin/erl
> ts:install().
> ts:run(snmp, [batch]).

I also verify make snmp_test works on Linux (but this is done my GitHub Actions, which also reports no errors: https://erlang.github.io/prs/5417/ct_logs/index.html

Testing make_test_dir.snmp_test: *** SKIPPED {snmp_to_snmpnet_SUITE,init_per_suite} ***
Testing make_test_dir.snmp_test: TEST COMPLETE, 507 ok, 0 failed, 33 skipped of 540 test cases

bmk · 2022-02-03T10:08:46Z

peer:stop
This could be a race condition (just a guess).
It (the failure, in group end for group major_tcs) occurred on 17 different test runs
(for example; darwin 18.7 x86, darwin 21.2 arm, solaris 11 x86, ubuntu 20.04 x86,
SLES 12sp2 x86, FreeBSD 12.2 x86, Windows 10, ...)
The machines are of varying age and capacity (some old and slow, others new and fast).
Also our test runs are with various VM configurations.
For instance, the darwin 18.7 run was labeled with offheapmsgq_meamin:
./configure --enable-smp-support --enable-darwin-64bit --disable-kernel-poll
And the environment variable: ERL_OTP25_FLAGS = +Meamin
And the darwin 21.2 run run was labeled with jit_s8_a2_offheapmsgq_meamin:
./configure --disable-kernel-poll
And the environment variable: ERL_OTP25_FLAGS = +Meamin

bmk · 2022-02-03T15:13:00Z

end_per_group(major_tcs): peer:stop
Ok, I think I know what is going on.
There is ongoing work on global, which "has some side effects".
This work has not been merged to master yet (as far as I know).
In one of the (failing) test runs, we get the following message(s):

=WARNING REPORT==== 2-Feb-2022::04:51:33.635222 ===
'global' at node 'snmp_agent_test_lib-snmp_mgr-35-19411@mallor' requested disconnect from node 'snmp_agent_SUITE-otp_16092_simple_start_and_stop-agent-11618-19411@mallor' in order to prevent overlapping partitions
=WARNING REPORT==== 2-Feb-2022::04:51:33.635231 ===
'global' at node 'snmp_agent_test_lib-snmp_sa-66-19411@mallor' requested disconnect from node 'snmp_agent_SUITE-otp_16092_simple_start_and_stop-agent-11618-19411@mallor' in order to prevent overlapping partitions.

Note that this happens before end_per_group(major_tcs). In another test case, which are not actually using these nodes.
So when ?ALIB:finish_all attempts to stop the nodes (SaNode and MgrNode), they are already dead (as far as we and the
peer process know).

max-au · 2022-02-04T03:26:43Z

Now I understand this (although I was not able to reproduce the behaviour even after pulling in 2 commits implementing changes in global from maint branch).

When the fist node stops (agent), something happens to the global and it decides to disconnect the manager node too (losing the peer). I am not sure whether global behaviour is valid for this case.
I have several ways to work it around, both are somewhat... unfortunate.

I can (catch peer:stop(Node)) ignoring peer node shutdown issues.
I can run peer nodes with connection => 0 (which will keep these nodes running even not connected via Erlang distribution)

I verified that both workarounds keep tests passing, and can make any desired change. But to be honest I'm not sure whether global can make such decisions locally and disconnect nodes at will.

bmk · 2022-02-04T08:48:41Z

I would prefer alt. 1 (it was the workaround I used). Simple and no risk of strange "side effects".

bmk · 2022-02-04T09:24:08Z

Another thing is that on at least one platform almost the entire agent suite is skipped,
because init_per_group (test_v1, ...) fails to create a mnesia table (friendsTable2).

            Skipping snmp_agent_SUITE(1360): {table_already_exist,friendsTable2}

This should be impossible on a newly created node. So I am not sure what happens here,
unless its an old node being reused. Could it be one of the nodes from a previous test run
(disconnected by global) that stills lives?

rickard-green · 2022-02-04T11:53:57Z

@max-au

I verified that both workarounds keep tests passing, and can make any desired change. But to be honest I'm not sure whether global can make such decisions locally and disconnect nodes at will.

One of global's main tasks is to maintain a fully connected network. Other applications such as mnesia relies on this. Currently global fails at providing this service in the presence of network failures. The only way *) to provide this service when connections are lost, is to remove other connections so you end up with fully connected partitions.

If global is disabled (-connect_all false), it won't provide a fully connected network and will also not take down connections like this.

Since this is quite a big change in behaviour compared to before, it will also be possible to disable this bugfix (-prevent_overlapping_partitions false), but then you might get intricate issues with applications relying on the fully connected network service being provided.

*) This is not completely true. You could also provide this service by implementing "virtual connections" routing signals over other connected nodes. We would, however, not be able to deliver such a solution in the near future. That is, that can only be a long term option. Note that I'm not saying whether or not "virtual connections" will be implemented, just that it could be an option for the future.

max-au · 2022-02-04T21:40:23Z

I found why I was not able to reproduce the failure. I merged the patch from maint branch, but the feature (disconnecting nodes global does not like) was not enabled in that branch. Likely that merged to master #5611 has it enabled by default.

Adding catch to all peer:stop calls makes the tests pass, but it means nearly all tests starting more than a single peer node should either disable global, or swallow peer node shutdown errors. This makes tests less useful, as it loses signal previously delivered from unexpected peer node crashes.

It also prevents us from running test cases in parallel! I am right now running rpc_SUITE tests concurrently, completing it in just a few seconds. This is one of the improvements that I'd like for OTP tests (so that we can get testing signal in just a few minutes, for the entire OTP codebase).

bmk · 2022-02-06T11:12:23Z

One of global's main tasks is to maintain a fully connected network. Other applications such as mnesia relies on this. Currently global fails at providing this service in the presence of network failures. The only way *) to provide this service when connections are lost, is to remove other connections so you end up with fully connected partitions.

I should point out that all nodes in the snmp (agent) test suite are local (same host).

max-au · 2022-02-06T16:45:14Z

I should point out that all nodes in the snmp (agent) test suite are local (same host).

From the test perspective, they are distributed. Given that test cases call global:sync, it appears that global maintaining full mesh is expected to work for these tests (otherwise it'd be easier to disable it).

rickard-green · 2022-02-07T00:27:42Z

@max-au wrote:

I found why I was not able to reproduce the failure. I merged the patch from maint branch, but the feature (disconnecting nodes global does not like) was not enabled in that branch. Likely that merged to master #5611 has it enabled by default.

Yes, it is only enabled in our master tests. I'll soon make a PR available with the changes I've made in master which hopefully can be helpful.

Adding catch to all peer:stop calls makes the tests pass, but it means nearly all tests starting more than a single peer node should either disable global, or swallow peer node shutdown errors. This makes tests less useful, as it loses signal previously delivered from unexpected peer node crashes.

I don't think that is the correct way to handle this. The termination is due to nodes behaving as old style slave nodes. I think the correct way to handle this is to use connection => 0 making nodes behave as old style peer nodes. This is what I've done in the test cases that have needed it which not only include running tests with prevent_overlapping_partitions enabled, but also tests which performs explicit disconnects. This is also how disconnects were handled before the CT_PEER changes were introduced.

It also prevents us from running test cases in parallel! I am right now running rpc_SUITE tests concurrently, completing it in just a few seconds. This is one of the improvements that I'd like for OTP tests (so that we can get testing signal in just a few minutes, for the entire OTP codebase).

Yes it would have been nice to able to run everything in parallel, but that is not always possible. prevent_overlapping_partitions is not the only thing that prevents parallel testing.

When it comes to rpc_SUITE I think that I've disabled global if I remember correctly. I don't think it is a problem to disable global if it is not involved in the testing. In that case prevent_overlapping_partitions won't prevent parallel execution.

rickard-green · 2022-02-07T00:32:17Z

I'm copying the conversations related to prevent_overlapping_partitions in #5611 as well

bmk · 2022-02-09T14:37:02Z

I am removing this from testing. We need to see what it looks like without this PR (again).

rickard-green added the team:PS Assigned to OTP team PS label Nov 19, 2021

IngelaAndin assigned bmk Nov 23, 2021

max-au force-pushed the max-au/peerify-snmp branch from 3066550 to ca16869 Compare December 25, 2021 21:59

max-au force-pushed the max-au/peerify-snmp branch from ca16869 to f76ebc6 Compare January 20, 2022 01:59

max-au force-pushed the max-au/peerify-snmp branch from f76ebc6 to 54cd4c0 Compare January 21, 2022 22:37

bmk added the testing currently being tested, tag is used by OTP internal CI label Jan 25, 2022

bmk removed the testing currently being tested, tag is used by OTP internal CI label Jan 28, 2022

[snmp] peer-ify application

a9d76c5

Replace test_server:start_node with ?CT_PEER. Clean up unused test case callbacks (e.g. `testcase(suite)`) while I'm at it.

max-au force-pushed the max-au/peerify-snmp branch from 54cd4c0 to a9d76c5 Compare February 1, 2022 00:38

bmk added the testing currently being tested, tag is used by OTP internal CI label Feb 1, 2022

This was referenced Feb 7, 2022

Prevent global inconsistency by preventing overlapping partitions #5611

Merged

[kernel] peer-ify application #5654

Merged

bmk removed the testing currently being tested, tag is used by OTP internal CI label Feb 9, 2022

bmk added the testing currently being tested, tag is used by OTP internal CI label Feb 17, 2022

bmk merged commit 6340a42 into erlang:master Mar 1, 2022

bmk removed the testing currently being tested, tag is used by OTP internal CI label Mar 1, 2022

max-au deleted the max-au/peerify-snmp branch September 30, 2022 20:12

[snmp] peer-ify application #5417

[snmp] peer-ify application #5417

Uh oh!

Conversation

max-au commented Nov 18, 2021

Uh oh!

max-au commented Dec 25, 2021

Uh oh!

bmk commented Jan 19, 2022

Uh oh!

max-au commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmk commented Jan 20, 2022

Uh oh!

max-au commented Jan 20, 2022

Uh oh!

bmk commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CT Test Results

Artifacts

Uh oh!

max-au commented Jan 22, 2022

Uh oh!

bmk commented Jan 25, 2022

Uh oh!

bmk commented Jan 27, 2022

Uh oh!

bmk commented Jan 28, 2022

Uh oh!

max-au commented Jan 29, 2022

Uh oh!

bmk commented Jan 31, 2022

Uh oh!

max-au commented Feb 1, 2022

Uh oh!

bmk commented Feb 2, 2022

=== Ended at 2022-02-02 04:47:06 === Location: [{proc_lib,stop,1077}, {snmp_agent_test_lib,finish_all,248}, {snmp_agent_SUITE,end_per_group,922}, {test_server,ts_tc,1784}, {test_server,run_test_case_eval1,1381}, {test_server,run_test_case_eval,1225}] === === Reason: no such process or port

Uh oh!

bmk commented Feb 2, 2022

Uh oh!

bmk commented Feb 2, 2022

Uh oh!

bmk commented Feb 2, 2022

Uh oh!

max-au commented Feb 3, 2022

Uh oh!

bmk commented Feb 3, 2022

Uh oh!

bmk commented Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-au commented Feb 4, 2022

Uh oh!

bmk commented Feb 4, 2022

Uh oh!

bmk commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rickard-green commented Feb 4, 2022

Uh oh!

max-au commented Feb 4, 2022

Uh oh!

bmk commented Feb 6, 2022

Uh oh!

max-au commented Feb 6, 2022

Uh oh!

rickard-green commented Feb 7, 2022

Uh oh!

rickard-green commented Feb 7, 2022

Uh oh!

bmk commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

max-au commented Jan 20, 2022 •

edited

Loading

bmk commented Jan 21, 2022 •

edited

Loading

github-actions bot commented Jan 21, 2022 •

edited

Loading

=== Ended at 2022-02-02 04:47:06
=== Location: [{proc_lib,stop,1077},
{snmp_agent_test_lib,finish_all,248},
{snmp_agent_SUITE,end_per_group,922},
{test_server,ts_tc,1784},
{test_server,run_test_case_eval1,1381},
{test_server,run_test_case_eval,1225}]
=== === Reason: no such process or port

bmk commented Feb 3, 2022 •

edited

Loading

bmk commented Feb 4, 2022 •

edited

Loading