* Add support for AAE hashing version 0
Additionally, change verify_aae to use 3 nodes for more complete
testing.
* Update repl_aae_fullsync.erl
* Update verify_aae.erl
* Do hashtree upgrade verification first
* Fix intercepts in aae_fullsync test
* Increase node count for more accurate test
overriding default_bucket_props in advanced_config without
explicitly setting these returns different values with the fix for
allow_mult turning to true with an app.config file present.
Cherry-picked from 619b24e
overriding default_bucket_props in advanced_config without
explicitly setting these returns different values with the fix for
allow_mult turning to true with an app.config file present.
Trying to use the repl features before newly started nodes have
riak_repl completely initialized leads to all sorts of nasty crashes and
noise. Frequently it makes fullsync stuck forever, which makes a lot of
the tests fail.
This also tweaks the AAE fullsync tests to remove assumptions about
failure stats when AAE transient errors occur. The behavior in the
handling of those errors has changed recently with the introduction of
soft exits.
Add test which ensures that the AAE source worker doesn't deadlock when
waiting for responses from the process which is computing the hashtree
differences.
Unfortunately, this test uses timeouts because as the code currently
stands, I can't figure out a way to make it any cleaner.
Improve test infrastructure by extracing out common code into functions
that can be used to build out further test cases.
Label our existing unidirectional test as a "simple" test, and begin
building out a more exhaustive test case.
The exhaustive test case is not complete and currently a *work in
progress*. However, it currently configures three clusters and sets up
replication from one source to two sinks. This is as far as the test
gets because this currently *fails* in setting up the initial
connections to the two clusters causing fullsync to completely abort.
It appears, but it unconfirmed, that a crash is causing cluster_conn to
fail and lose information about how to connect to the clusters
documented due to async storage of the remote cluster information to the
ring. (I believe this to be the issue as this bug was discovered by
Kelly and I during debugging of another repl related race condition
during startup.) Removing the second cluster connection call, and
letting it proceed with a connection to only one cluster allows the test
to proceed.
Ensure that AAE replication is tested using all possible failure cases
when dealing with the riak_kv_index_hashtrees and failed connections.
First, use intercepts on riak_kv_vnode and riak_kv_index_hashtree to
ensure that we simulate errors on a per node basis, starting with the
source cluster and moving to the sink. Simulate ownership transfers,
locked and incomplete hashtrees. Verify partitions generate the correct
error count, after using a bounded set of retries, and finally remove
all intercepts and verify that the fullsync completes and all keys have
been migrated between the two clusters.
* Common code in the cluster setup and data setup is factored out into a utility module.
* AAE fullsync test and benchmark get skinnier and call the common setup code.
* Document parameters in the setup for tests.
* Add a new AAE fullsync test that changes the N value (source is 2, sink is 3).