Trying to use the repl features before newly started nodes have
riak_repl completely initialized leads to all sorts of nasty crashes and
noise. Frequently it makes fullsync stuck forever, which makes a lot of
the tests fail.
This also tweaks the AAE fullsync tests to remove assumptions about
failure stats when AAE transient errors occur. The behavior in the
handling of those errors has changed recently with the introduction of
soft exits.
Update the calls to rt:systest_read in repl_util and
repl_aae_fullsync_util to treat identical siblings resulting from the
use of DVV as a single value. These changes are specifically to
address failures seen in the repl_aae_fullsync_custom_n and
replication_object_reformat tests, but should be generally useful for
replication tests using the utility modules that and that have
allow_mult set to true.
Ensure that AAE replication is tested using all possible failure cases
when dealing with the riak_kv_index_hashtrees and failed connections.
First, use intercepts on riak_kv_vnode and riak_kv_index_hashtree to
ensure that we simulate errors on a per node basis, starting with the
source cluster and moving to the sink. Simulate ownership transfers,
locked and incomplete hashtrees. Verify partitions generate the correct
error count, after using a bounded set of retries, and finally remove
all intercepts and verify that the fullsync completes and all keys have
been migrated between the two clusters.
* Common code in the cluster setup and data setup is factored out into a utility module.
* AAE fullsync test and benchmark get skinnier and call the common setup code.
* Document parameters in the setup for tests.
* Add a new AAE fullsync test that changes the N value (source is 2, sink is 3).