Trying to use the repl features before newly started nodes have
riak_repl completely initialized leads to all sorts of nasty crashes and
noise. Frequently it makes fullsync stuck forever, which makes a lot of
the tests fail.
This also tweaks the AAE fullsync tests to remove assumptions about
failure stats when AAE transient errors occur. The behavior in the
handling of those errors has changed recently with the introduction of
soft exits.
Avoid a race condition in the replication test module when checking
for site IP addresses in the replication status output. The test
waits for a connection on the leader, but it only queries the
replication status to check for the expected site IP addresses a
single time. Change the test to wait and re-check the status output to
give greater assurance that if the expected site IP addresses are not
present it is due to legitimate failure and not a race condition in
checking the replication status. This change affects the replication
and replication_upgrade tests as well as any other tests that call the
replication:replication function.
* Do not attempt to cancel fullsync if the initial attempt to start
and wait for completion fails. It has not been observed that the
problem is fullsync starting and not completing in time, but rather
the issue is that the initial call to start fullsync does not take
effect. Therefore the cancellation is unnecessary.
* Replace the call to repl_util:wait_for_connection/2 in the node
upgrade process with a call to
replication:wait_until_connection/1. This function is geared towards
v2 replication and should speed up test execution.
Fix some race conditions in the cluster leader helper functions. Also
re-initiate fullsync after a certain number of checks for
completion. V2 replication has problems where calling
riak_repl_console:start_fullsync is basically ignored and needs to be
retried.
We do this by lowering diff_batch_size to 10, and then writing 10,000
keys to a 64 vnode ring, thus ensuring that each vnode will have more
than 100 keys.