Resolve failures with cuttlefish configuration changes in Riak 2.0.
Remove riak_control_upgrade, since riak_control should cover those use
cases completely.
The verify_busy_dist_port helper function cause_bdp:spam_nodes/1
recently changed to be more aggressive in triggering busy_dist_port
warnings. The function changed to spawn 1 million processes to ensure
the test generated enough activity to trigger the warnings, but that
number of processes exceeds the 256 thousand process limit that is the
Riak default. One consequence of this can be that the rex server
responsible for handling rpc calls can crash. In some cases this leads
to rpc calls by riak_test to shutdown the riak nodes involved in the
test to hang indefinitely. This change reduces the number of processes
spawned to 200 thousand. This should still be enough processes to
trigger the busy_dist_port warnings, but without exceeding the beam
process limit.
Allow intercept functions passed to rt_intercept:add/2 to be anonymous. In
compiled code they can either be a plain anonymous function, assuming they
don't use any variables from the surrounding context, or they can be a
2-tuple like this:
{[FreeVar1, ...],
fun(Arg1, ...) -> ... end}
where FreeVar1 etc. is a list of free variables to be closed over so that
they can be used within the anonymous function. For making interactive
calls to rt_intercept:add/2 from the Erlang shell, only the anonymous
function form is required, even if it uses free variables, though the
2-tuple form is also acceptable.
For compiled code, support for anonymous intercept functions is implemented
via a parse transform, and so to use anonymous functions the intercept
structure(s) containing them must be defined directly inline as part of the
final argument to rt_intercept:add/2, i.e., they cannot be first assigned
to a variable that is then used within the argument. This is because the
value of such a variable might not be visible to the parse transform.
Add a description of anonymous function intercepts to the README.
Prior to Riak 1.4.8 replication registers as a service prior to
completing all initialization tasks including establishing realtime
connections to sink clusters. This leads to a race condition in the
replication_upgrade and replication2_upgrade tests where the test may
begin writing data to the source cluster to verify the function of
realtime replication before the most recently upgraded node
establishes a connection to the sink cluster. The result of this is
that the data is silently discarded by the realtime replication system
and the test fails because all of the expected data is not replicated
and able to be read on the sink cluster. Change the
replication_upgrade and replication2_upgrade tests to explicitly wait
for the realtime connection to be established after each source
cluster node is upgraded before proceeding with the test.
Establish a new PB connection to the legacy node after it is upgraded
in order to avoid a failure. The PB connection may close if the node
upgrade takes too long and its reuse in such a case can lead to test
failure because use of the pid returns {error, disconnected} errors.
Add missing riak_test options. Allow completion to work if riak_test is
invoked as ./riak_test or a similar pathname. Change the grep for "confirm"
to "confirm/0", which should be found in a test module export, so it
doesn't accidentally catch the plain word "confirm" in a comment in a
module that shouldn't be part of the testname completion. The
_get_comp_words_by_ref helper function is not available by default on OS X,
so add code to compensate for that case. Use a local variable to capture
the grep output, rather than overwriting a global variable.
When performing the test of object reformatting through replication,
assert that if we happen to downgrade the format we can still read the
keys which have been replicated.
Wait for transfers to complete in
replication2_pg:test_pg_proxy. Replication tests that test the n_val=1
request option can fail with insufficient_vnodes errors if the cluster
setup does not include waiting for transfers to complete. Change the
test_pg_proxy test case to wait until transfers complete on the "A"
and "B" clusters before proceeding.
Fix an error that can lead to failure of tests using
replication2_pg:test_pg_proxy test case. A protocol buffers connection
is established to a node in the "B" cluster, the leader node from that
cluster is shut down, and then that protocol buffers connection is
used to exercise proxy_get. If the connection was established to the
former leader and that is subsequently shut down it can cause the test
to stall and eventually fail. This changes that test to establish a
new connection to a node remaining in the "B" cluster to use for the
proxy_get and prevents the test from stalling.
Once riak_ensemble_manager:enable() is called, we need to call
riak_core_ring_manager:force_update() so that the members will be
created and added to the ensembles trying to get a quorum. During ticks
in core, new members are created only if the ring has changed. There is
a race that can sometimes prevent the members from starting and thus the
quorum from being achieved indefinitely. This small change to the test
infrastructure resolves this issue, but it still requires fixing in
riak_core and/or riak_kv.
repl_consistent_object_filter calls riak_ensemble_manager:enable() which
fails to bootstrap the ensemble because the ring has stabilized already.
An issue for this will be opened in riak_kv, but this quick fix will
allow the test to get beyond that point.
Add ensemble_basic4, ensemble_sync, and ensemble_interleave tests.
ensemble_sync tests the new AAE-based peer syncing logic. The test
checks various scenarios with different levels of data corruption.
ensemble_interleave tests a specific scenario where two peers become
corrupted one after the other. This tests the scenario where the
second peer becomes untrusted while the first peer may be syncing
with it.