The verify_busy_dist_port helper function cause_bdp:spam_nodes/1
recently changed to be more aggressive in triggering busy_dist_port
warnings. The function changed to spawn 1 million processes to ensure
the test generated enough activity to trigger the warnings, but that
number of processes exceeds the 256 thousand process limit that is the
Riak default. One consequence of this can be that the rex server
responsible for handling rpc calls can crash. In some cases this leads
to rpc calls by riak_test to shutdown the riak nodes involved in the
test to hang indefinitely. This change reduces the number of processes
spawned to 200 thousand. This should still be enough processes to
trigger the busy_dist_port warnings, but without exceeding the beam
process limit.
As of Riak 2.0 the vm.args zdbbl setting defaults to 32768. Previously
the default of 1024 was used. Change the cause_bdp helper module for
the verify_busy_dist_port test to be more aggressive in order to
trigger a busy_dist_port message with the higher zdbbl setting.
* install a riak_sysmon custom handler and monitor for the busy_dist_port event on the riak node. when the handler detects the event it notifies the test process
* install riak_test_lager_backend on the riak node to capture log messages in memory. When the test process is notified that the busy dist port event fired on the riak node it retrieves the logs from the backend using the added get_logs/0 function and then checks for the busy_dist_port message
* there is a subtle race I *think*, that should rarely come up in practice, between lager and the riak_sysmon event handlers with this method of implementing the test that may require use of some form of retry/max retry still
* check the grep logs a max number of times, sleep for some interval in-between each execution of grep
* actually delete the log file
* handle file d.n.e. when using grep (hackily)