Change replication_ssl to use the wait_for_site_ips function from the
replication module introduced in
297090ded6 instead of the defunct
verify_site_ips function.
Avoid a race condition in the replication test module when checking
for site IP addresses in the replication status output. The test
waits for a connection on the leader, but it only queries the
replication status to check for the expected site IP addresses a
single time. Change the test to wait and re-check the status output to
give greater assurance that if the expected site IP addresses are not
present it is due to legitimate failure and not a race condition in
checking the replication status. This change affects the replication
and replication_upgrade tests as well as any other tests that call the
replication:replication function.
Prevent a situtation where the auto-reconnect hasn't triggered yet
causing the result to be an error, instead of ok, on the next operation
after reconnecting. Force a disconnect and reconnect to make sure the
test is deterministic.
Changing to fetching the list of peers first, then check if the riak_kv
service is up. If the service is up, then check the peers. Otherwise it
is possible to see the service down, then peers up because it went up in
the interim.
Also, making KV vnode delay configurable.
Now ensemble peers are prevented from starting up until the riak_kv
service is up to avoid nasty races that could even lead to node crashes
as the ensembles frantically query for data that isn't ready.
Re-initiate fullsync after 100 failed checks for completion. The
number of retries of the 'start fullsync and then check for
completion' cycle is configurable using
repl_util:start_and_wait_until_fullsync_complete/4 and defaults to 20
retries. This change is to avoid spurious test failures due to a rare
condition where the rpc call to start fullsync fails to actually
initiate the fullsync. A very similar changed for the version of the
start_and_wait_until_fullsync_complete in the replication module
introduced in 0a36f9974c has had good
effect at avoiding this condition for v2 replication tests.
Part of the condition checking done in the replication_object_reformat
test is to validate the results of a fullsync using
repl_util:validate_completed_fullsync/6. The way in which the the
function is called from the test expects fullsync to complete with 0
error_exit or retry_exit conditions occurring. This requires that sink
cluster be in a steady state with all partitions available. The test
failed to wait for such conditions to occur and instead relied on
performing a node downgrade asynchronously and waiting for up to 60
seconds for a completion message before continuing with the test. The
test was continually failing after a node was downgraded to `previous`
due to partitions being reported as `down` on that node. To resolve
the issue the node downgrade process is now done in the primary test
process instead of in a separate spawned process. After the version
downgrade is complete, the test now waits for the riak_repl and the
riak_kv services, calls rt:wait_until_nodes_ready/1, calls
rt:wait_until_no_pending_changes/1, and finally waits for the
riak_repl2_fs_node_reserver named process to be registered on the
downgraded node. This process is responsible for handling partition
reservation requests and is key to determining the the new node is
able to handle a fullsync without partition errors.
Update the calls to rt:systest_read in repl_util and
repl_aae_fullsync_util to treat identical siblings resulting from the
use of DVV as a single value. These changes are specifically to
address failures seen in the repl_aae_fullsync_custom_n and
replication_object_reformat tests, but should be generally useful for
replication tests using the utility modules that and that have
allow_mult set to true.
Fix problem with cacertdir specification in replication_ssl test. The
code used load cert files in v2 replication expects the path specific
by the cacertdir key to only be a directory. With v3 replication the
code used is flexible enough to allow a directory or a file. Also
correct a typo in the certfile path for the SSLConfig1 configuration.
* Do not attempt to cancel fullsync if the initial attempt to start
and wait for completion fails. It has not been observed that the
problem is fullsync starting and not completing in time, but rather
the issue is that the initial call to start fullsync does not take
effect. Therefore the cancellation is unnecessary.
* Replace the call to repl_util:wait_for_connection/2 in the node
upgrade process with a call to
replication:wait_until_connection/1. This function is geared towards
v2 replication and should speed up test execution.
Replace use of a 40 second sleep in the test_supervision test case
with a wait condition to better handle variances in the time it takes
to progress through 10 retry attempts.
AAE is disabled. (defect https://github.com/basho/riak_kv/issues/959)
- Adds additional console output to reset-current-env to explain
configuration and steps being executed
- Adds the -n option to the reset-current-env script to specify the
number of nodes to build. By default, 5 will be created.
As of commit 3044839456 tests that
return something other than the prescribed success atom 'pass' to
indicate success result in test failure. Change the
replication_upgrade and replication2_upgrade tests that return the
result of the a call to lists:foreach/2 to instead return 'pass' to
indicate success.
Prior to this commit, the various riak_ensemble related tests would
manually enable the consensus system on one-and-only-one node in a
given cluster in order to work around issue basho/riak_core#571.
This commit changes the tests to work properly after the above issue
has been fixed.
In addition to removing the call to riak_ensemble_manager:enable()
that is now handled automatically by Riak, this commit also removes
a few wait_until_stable/2 checks against 1-node clusters. These
checks no longer apply, since Riak is now designed to only enable
the consensus system after the cluster contains at least 3 nodes.
Changed intercept to explicitly return `{error, econnrefused}`. Moved
helper functions to `repl_util` and added a new helper to distinguish
between disconnects on `cluster_by_name` and `cluster_by_address`
connections.
Added asserts to all wait_for functions.
ensemble_remove_node2 uses an intercept to prevent a riak_ensemble
related transition that is necessary for nodes to completely exit and
shutdown after removal. In fact, testing for this scenario is the
entire point of this test, since it is testing logic that was added to
solve basho/riak_core#572 and that logic prevents nodes from exiting
until that transition occurs.
However, even without this new logic, there is an unrelated
riak_ensemble related bug that can trigger a race condition that also
prevents nodes from shutting down.
The good news is that other changes made as part of the solution to
solve basho/riak_core#572 also fix this unrelated bug. Therefore this
commit extends ensemble_remove_node2 to remove the intercept at the
end of the test and verify that the removed nodes do actually end up
exiting as expected. Thus, the test now tests for both the negative
and positive scenarios and serves as a test against future regressions
that stall node removal/shutdown.
Adding a test to verify a bucket type is visible from a number of nodes
since the active status is given as long as the claimant sees it. But
requests to other nodes can end up hitting the dreaded {error, no_type}.
Also added a general utility that can be used for bucket type checks and
for general verification of bucket properties across nodes.
an undefined bucket type is specified. (defect #875)
- Adds a description of the reset-current-env.sh script and its
usage to README.md
- Corrects a spelling mistake in an information message emitted by
the reset-current-env.sh script
While r16b02-basho5 did not need the cacertfile path put in, r16b03 did.
The test still passes r16b02-basho5 with the added cacertfile line. Since
there is no harm in putting it in, better for forwards compatibility than
not.
Remove an assertion based on reading keys a single time after realtime
replication is re-enabled in the test. Instead just rely on the wait
condition that already followed the assertion to read and verify the
same keys.
There are different cert chains for pb_cipher_suites and http_security.
The certs were not fully cleaned between tests, so it would cause the
test to fail. By just using a different directory to put the certs into,
it better isolates the tests.
* add repl_util:wait_until_fullsync_started/1
* add repl_util:wait_until_fullsync_stopped/1
* remove timeouts and use above calls to confirm our test is in the
right state
As of commit 3044839456 tests that
return something other than the prescribed success atom 'pass' to
indicate success result in test failure. Change tests that return the
atom 'ok' or some other value to instead return 'pass' to indicate
success.
This reverses an earlier change to support a feature that has been stripped
(for now). When said feature is put back in, it should support multi.
Setting this to allow mutlt = true allows for more confidence in tests.
Resolve failures with cuttlefish configuration changes in Riak 2.0.
Remove riak_control_upgrade, since riak_control should cover those use
cases completely.
The verify_busy_dist_port helper function cause_bdp:spam_nodes/1
recently changed to be more aggressive in triggering busy_dist_port
warnings. The function changed to spawn 1 million processes to ensure
the test generated enough activity to trigger the warnings, but that
number of processes exceeds the 256 thousand process limit that is the
Riak default. One consequence of this can be that the rex server
responsible for handling rpc calls can crash. In some cases this leads
to rpc calls by riak_test to shutdown the riak nodes involved in the
test to hang indefinitely. This change reduces the number of processes
spawned to 200 thousand. This should still be enough processes to
trigger the busy_dist_port warnings, but without exceeding the beam
process limit.
Prior to Riak 1.4.8 replication registers as a service prior to
completing all initialization tasks including establishing realtime
connections to sink clusters. This leads to a race condition in the
replication_upgrade and replication2_upgrade tests where the test may
begin writing data to the source cluster to verify the function of
realtime replication before the most recently upgraded node
establishes a connection to the sink cluster. The result of this is
that the data is silently discarded by the realtime replication system
and the test fails because all of the expected data is not replicated
and able to be read on the sink cluster. Change the
replication_upgrade and replication2_upgrade tests to explicitly wait
for the realtime connection to be established after each source
cluster node is upgraded before proceeding with the test.
Verify that upgrading Riak with Bitcask to 2.0 or later will trigger
an upgrade mechanism that will end up merging all existing bitcask
files. This is necessary so that old style tombstones are reaped,
which might otherwise stay around for a very long time. This version
writes tombstones that can be safely dropped during a merge. Bitcask
could resurrect old values easily when reaping tombstones during a
partial merge if a restart happened later.
Establish a new PB connection to the legacy node after it is upgraded
in order to avoid a failure. The PB connection may close if the node
upgrade takes too long and its reuse in such a case can lead to test
failure because use of the pid returns {error, disconnected} errors.
When performing the test of object reformatting through replication,
assert that if we happen to downgrade the format we can still read the
keys which have been replicated.
Wait for transfers to complete in
replication2_pg:test_pg_proxy. Replication tests that test the n_val=1
request option can fail with insufficient_vnodes errors if the cluster
setup does not include waiting for transfers to complete. Change the
test_pg_proxy test case to wait until transfers complete on the "A"
and "B" clusters before proceeding.
Fix an error that can lead to failure of tests using
replication2_pg:test_pg_proxy test case. A protocol buffers connection
is established to a node in the "B" cluster, the leader node from that
cluster is shut down, and then that protocol buffers connection is
used to exercise proxy_get. If the connection was established to the
former leader and that is subsequently shut down it can cause the test
to stall and eventually fail. This changes that test to establish a
new connection to a node remaining in the "B" cluster to use for the
proxy_get and prevents the test from stalling.