Wait for transfers to complete in
replication2_pg:test_pg_proxy. Replication tests that test the n_val=1
request option can fail with insufficient_vnodes errors if the cluster
setup does not include waiting for transfers to complete. Change the
test_pg_proxy test case to wait until transfers complete on the "A"
and "B" clusters before proceeding.
Once riak_ensemble_manager:enable() is called, we need to call
riak_core_ring_manager:force_update() so that the members will be
created and added to the ensembles trying to get a quorum. During ticks
in core, new members are created only if the ring has changed. There is
a race that can sometimes prevent the members from starting and thus the
quorum from being achieved indefinitely. This small change to the test
infrastructure resolves this issue, but it still requires fixing in
riak_core and/or riak_kv.
repl_consistent_object_filter calls riak_ensemble_manager:enable() which
fails to bootstrap the ensemble because the ring has stabilized already.
An issue for this will be opened in riak_kv, but this quick fix will
allow the test to get beyond that point.
Add ensemble_basic4, ensemble_sync, and ensemble_interleave tests.
ensemble_sync tests the new AAE-based peer syncing logic. The test
checks various scenarios with different levels of data corruption.
ensemble_interleave tests a specific scenario where two peers become
corrupted one after the other. This tests the scenario where the
second peer becomes untrusted while the first peer may be syncing
with it.
Increase timeout for waiting for init:stop/0 to stop nodes from an extra
second to 10 extra seconds. Be sure to wait until the timeout expires
before using kill on any nodes that fail to stop. To avoid unconditionally
waiting the full timeout period, use kill -0 where possible to watch for
nodes stopping. Use kill -9 only after the full timeout period has elapsed
and the node still hasn't stopped. Fix setting of the cookie when
converting the riak_test node to a distributed node.
As of Riak 2.0 the vm.args zdbbl setting defaults to 32768. Previously
the default of 1024 was used. Change the cause_bdp helper module for
the verify_busy_dist_port test to be more aggressive in order to
trigger a busy_dist_port message with the higher zdbbl setting.
There is a race condition that can cause the force-replace test case
in the verify_dynamic_ring test to fail. This issue is being tracked
by riak_core issue #570. This changes replaces the force-replace
testing with another resize test. Once issue #570 is resolved this
change can be reverted.
Add ensemble_basic, ensemble_basic2, and ensemble_basic3 tests.
These tests test that Riak correctly generates proper consensus
groups, these groups reach quorum, handle leader failures, etc.
ensemble_basic3 tests basic consistent K/V API as well as behavior
during simple network partitions.
Fix some race conditions in the cluster leader helper functions. Also
re-initiate fullsync after a certain number of checks for
completion. V2 replication has problems where calling
riak_repl_console:start_fullsync is basically ignored and needs to be
retried.
Using kill -9 on a node can leave it in a zombie process state, stuck in a
system call never to return. OS X Mavericks seems especially vunerable to
this problem. Only a reboot can clear out such zombies. Change
rt:brutal_kill/1 to try a normal kill -15 first, and set a 5 second timer
to perform a kill -9 if the normal kill doesn't work. Change
rtdev:stop_all/1 to first try to connect to the nodes to shut them down via
an init:stop/0 rpc, and if that fails attempt to stop them via "riak stop"
instead. Then, ps is used to check for any stragglers and those are killed
via kill -15, wait 5 seconds, kill -9.
Improve the reliability of the proxy_get test by asserting that
ownership transfer completes before killing the leader node.
Related to basho/riak_repl#352.