Should resolve test failures with a message similar to:
@riak_core_cluster_conn:handle_info:402 Unmatch message {<20563.30238.10>,status}
in the server logs.
Due to the refactor for the cluster manager/connection manager system to
use otp behaviors, the raw message method of getting stats has been ousted.
Instead, it uses a call. To allow the riak_test to be able to check older
clusters as well as the method, the function was extended to try new and
then the old.
Re-initiate fullsync after 100 failed checks for completion. The
number of retries of the 'start fullsync and then check for
completion' cycle is configurable using
repl_util:start_and_wait_until_fullsync_complete/4 and defaults to 20
retries. This change is to avoid spurious test failures due to a rare
condition where the rpc call to start fullsync fails to actually
initiate the fullsync. A very similar changed for the version of the
start_and_wait_until_fullsync_complete in the replication module
introduced in 0a36f9974c has had good
effect at avoiding this condition for v2 replication tests.
Update the calls to rt:systest_read in repl_util and
repl_aae_fullsync_util to treat identical siblings resulting from the
use of DVV as a single value. These changes are specifically to
address failures seen in the repl_aae_fullsync_custom_n and
replication_object_reformat tests, but should be generally useful for
replication tests using the utility modules that and that have
allow_mult set to true.
Changed intercept to explicitly return `{error, econnrefused}`. Moved
helper functions to `repl_util` and added a new helper to distinguish
between disconnects on `cluster_by_name` and `cluster_by_address`
connections.
Added asserts to all wait_for functions.
* add repl_util:wait_until_fullsync_started/1
* add repl_util:wait_until_fullsync_stopped/1
* remove timeouts and use above calls to confirm our test is in the
right state
When performing the test of object reformatting through replication,
assert that if we happen to downgrade the format we can still read the
keys which have been replicated.
Change some of the helper functions in the repl_util module to handle
errors more sensibly so that cluster setup race conditions do not
cause unnecessary test failures.
The wait_until_leader_converge function could incorrectly return
success if all of the results from the get_leader rpc calls were
either undefined or all returned a badrpc tuple. In either case the
particular result ends up as the sole unique value in a list and the
success condition is verifying that the list is of length 1 regardless
of the value of the member of the list. Change the function to filter
the list of results for values that indicate failure prior to the
success condition checking.
Ensure that AAE replication is tested using all possible failure cases
when dealing with the riak_kv_index_hashtrees and failed connections.
First, use intercepts on riak_kv_vnode and riak_kv_index_hashtree to
ensure that we simulate errors on a per node basis, starting with the
source cluster and moving to the sink. Simulate ownership transfers,
locked and incomplete hashtrees. Verify partitions generate the correct
error count, after using a bounded set of retries, and finally remove
all intercepts and verify that the fullsync completes and all keys have
been migrated between the two clusters.
* After a fullsync, make sure that the numner of successful exists equals the number of partitions, since each partition should have one successful fullsync source process exit.
* We could extend this test to handle the parts of replication2 that have down nodes in the future.
* Added a utility function to get the number of partitions from a node, plus a function to get fullsync status items.
* Test passes with changes from branch cet-successful-exists-fix