even in the rare and pathological case where the cluster is partitioned before all 3 nodes
have received the update. riakc_flag:disable(F) requires context, which isn’t there in the
new map that would be created on the side of the partition with no data.
This test's a little confused in the sources as-is since it prints
like it's based on the number of requests, even though the actual
comparison is done against a function of THRESHOLD. I've reverted
to the comparison used currently, since it looks to me like this
test should really expect to have ~NUM_REQUESTS processes, and a
vnode queue pretty close to THRESHOLD. I'd appreciate review here
though, particularly if anyone recalls the original intent of
these comparison numbers.
Previously we'd used a sort of fuzzy 'metric' where we expect the
number of successful requests/fsms to be less than some fudge
factor over the overload threshold. This tends to kick up spurious
failures on the test board without offering much more in the way
of assurances about overload's functionality.
This change instead bases test success on the number of requests
only, not the threshold — if some amount of work was shed at all
we consider that a passing test.
In the future we should revisit this and change the request
accounting machinery to just explicitly track denials instead of
fsm processes / vnode queue depth.
In the case that no advanced.config file exists (everycase!) rt
would not add any advanced config settings to the conf.
This PR teaches rtdev to create an advanced.comfig file if none exists
so that tests may set advanced config.
In this case we set ring_size and also the `crdt_mixed_versions` app env
Still have not completed upgrade and feature-flag switch.
I changed the versions from atoms to "2.0.2" and "2.0.4", we can
bikeshed that with build/test czars on Monday.
Added some useful logging statements to the plain-upgrade test.
Removed unnecessary clean_cluster and systest_read calls.
This test is getting Biig, and there is still a lot to add
(see comments at the end of the test.)
Maybe we should break it out into a few tests, and there are some
questions still.
* Checks the list of stats keys returned from the HTTP endpoint
is complete -- delineating between riak and riak_ee. The test will
fail if the list returned from the HTTP endpoint does not exactly match
the expected list. This behavior acts as a forcing function to ensure
that the expected list is properly maintained as stats are added and
removed.
* Modifies reset-current-env to properly clean dependencies when a
full clean is requested and remove the current directory in the
target test instance.
* Adds logging to verify_riak_stats to explain the addition steps
being performed
* Adds rt:product/1 to determine whether a node is running riak,
riak_ee, or riak_cs
* Adds tools.mk support and eunit scaffolding to rebar.config
* Modifies reset-current-env.sh to remove the current directory in
the target test instance
Because list-keys and list-buckets use coverage, we might hit latent
replicas depending on the coverage plan. This gives each call some extra
tries to complete successfully.
It was previously possible for the 'minority' network partition to
become the majority network partition by a naive network partitioning
strategy. Previously, when a preference list of 5 keyspace partitions
was created on only four distinct nodes, it became possible for a 2 node
'minority' network partition group to actually have a majority of
keyspace partitions because 2 keyspace partitions were assigned to 1
node in the 'minority' group. This was fixed so that the 'majority'
group now always has a majority of keyspace partitions by preventing
nodes with greater than 1 keyspace partition from becoming part of the
'minority' group.
Now green when run in sequence:
Test Results:
pb_cipher_suites-bitcask: pass
pb_security-bitcask : pass
---------------------------------------------
0 Tests Failed
2 Tests Passed
That's 100.0% for those keeping score
This changes the test assertion so that it retries fetching the value
from the second cluster until it is the expected value, at which point
the test will either pass if the sibling count is reasonable or fail if
it is too damn high.
Fetch the sink object on each iteration of the wait_until, just in case
that the entire set of siblings didn't make it across the repl link.
This also gives read-repair a chance to happen, in case the version the
sink wrote didn't make it to all replicas.
Trying to use the repl features before newly started nodes have
riak_repl completely initialized leads to all sorts of nasty crashes and
noise. Frequently it makes fullsync stuck forever, which makes a lot of
the tests fail.
This also tweaks the AAE fullsync tests to remove assumptions about
failure stats when AAE transient errors occur. The behavior in the
handling of those errors has changed recently with the introduction of
soft exits.
Reduces the probability of a race condition between the calculation of spiral/histogram metrics and SNMP stat cache refresh by reducing the SNMP poll interval to 1 second during test execution
Don't wait for convergence of the ring, because bucket properties are no
longer stored in the ring; instead, wait until the property changes,
which means the gossip has stabilized.
Add an rt:admin/3 function that accepts a list of options as the third
parameter. Currently the only valid option is return_exit_code. The
rtdev, rtssh, and rt_cs_dev harnesses have been updated to support
this option. If return_to_exit is specified the return from a
?HARNESS:admin call is a pair with the exit code as the first member
and the command output as the second member. Finally the
basic_command_line test has been changed to use return_for_exit to
verify the changes.
Due to the refactor for the cluster manager/connection manager system to
use otp behaviors, the raw message method of getting stats has been ousted.
Instead, it uses a call. To allow the riak_test to be able to check older
clusters as well as the method, the function was extended to try new and
then the old.
Add testing of the handoff heartbeat change from the following pull
request: https://github.com/basho/riak_core/pull/560. Add an intercept
module for the riak_core_handoff_sender module to introduce artificial
delay on item visitation during a handoff fold. This delay along with
the changes to the verify_handoff test induces test failure when run
without the heartbeat change. The handoff_receive_timeout is exceeded,
handoff stalls, and the test eventually fails due to timeout. The test
succeeds when run with the heartbeat change.
Ensemble_ring_changes tests writing a value, expanding the cluster, then
updating and reading that value after ring expansion has completed. It
also creates a bucket using a bucket type with a different n_val from
the default bucket type. The latter tests basho/riak_kv#1008 and it's
corresponding riak_core PR.
Use riak_test_runner:metadata/0 to get the configured backend instead of
defaulting to bitcask. Additionally we use rt:clean_data_dir/2 to safely
remove backend directories.
This is the first iteration of creating byzantine dataloss tests that
show both recoverable and unrecoverable, but detectable errors. This tests the
following scenarios.
* Lose one partition worth of data, but no synctrees and recover.
* Lose all but one partition of ensemble data, but no synctrees and
recover.
* Lose minority of synctrees. Only the peers with the missing
synctrees are restarted. System remains available.
* Loss of majority of synctrees. Majority peers are restarted. System
recovers when they all come back online.
* Loss of majority of synctrees with one node partitioned. All peers
restarted except partitioned one. System does not recover with that
node partitioned. When the partition is healed the system recovers.
* Loss of all data and synctree except on one peer recovers.
* Backing up and restoring old data but not synctrees results in
detected errors. Restoring newer data fixes this.
* Delete all data on all nodes, but not synctrees. This is detected and
an error returned to the user.
Change the ACL test case in the replication_ssl and replication2_ssl
tests to use certificates generated within the tests instead of
relying on certificates created outside the test that are prone to
expire and cause spurious test failure.
Also change the replication_ssl and replication2_ssl tests to avoid a
cycle of standing up the test clusters and then immediately restarting
them before any tests cases execute. This should make the test
execution slightly faster for both test modules.
This commit also changes the tests to be a bit more robust in checking
for cluster state when restarting nodes and removes an unnecessary
five second sleep call in the replication_ssl test.
Change replication_ssl to use the wait_for_site_ips function from the
replication module introduced in
297090ded6 instead of the defunct
verify_site_ips function.
Avoid a race condition in the replication test module when checking
for site IP addresses in the replication status output. The test
waits for a connection on the leader, but it only queries the
replication status to check for the expected site IP addresses a
single time. Change the test to wait and re-check the status output to
give greater assurance that if the expected site IP addresses are not
present it is due to legitimate failure and not a race condition in
checking the replication status. This change affects the replication
and replication_upgrade tests as well as any other tests that call the
replication:replication function.
Prevent a situtation where the auto-reconnect hasn't triggered yet
causing the result to be an error, instead of ok, on the next operation
after reconnecting. Force a disconnect and reconnect to make sure the
test is deterministic.
Changing to fetching the list of peers first, then check if the riak_kv
service is up. If the service is up, then check the peers. Otherwise it
is possible to see the service down, then peers up because it went up in
the interim.
Also, making KV vnode delay configurable.
Now ensemble peers are prevented from starting up until the riak_kv
service is up to avoid nasty races that could even lead to node crashes
as the ensembles frantically query for data that isn't ready.
Re-initiate fullsync after 100 failed checks for completion. The
number of retries of the 'start fullsync and then check for
completion' cycle is configurable using
repl_util:start_and_wait_until_fullsync_complete/4 and defaults to 20
retries. This change is to avoid spurious test failures due to a rare
condition where the rpc call to start fullsync fails to actually
initiate the fullsync. A very similar changed for the version of the
start_and_wait_until_fullsync_complete in the replication module
introduced in 0a36f9974c has had good
effect at avoiding this condition for v2 replication tests.
Part of the condition checking done in the replication_object_reformat
test is to validate the results of a fullsync using
repl_util:validate_completed_fullsync/6. The way in which the the
function is called from the test expects fullsync to complete with 0
error_exit or retry_exit conditions occurring. This requires that sink
cluster be in a steady state with all partitions available. The test
failed to wait for such conditions to occur and instead relied on
performing a node downgrade asynchronously and waiting for up to 60
seconds for a completion message before continuing with the test. The
test was continually failing after a node was downgraded to `previous`
due to partitions being reported as `down` on that node. To resolve
the issue the node downgrade process is now done in the primary test
process instead of in a separate spawned process. After the version
downgrade is complete, the test now waits for the riak_repl and the
riak_kv services, calls rt:wait_until_nodes_ready/1, calls
rt:wait_until_no_pending_changes/1, and finally waits for the
riak_repl2_fs_node_reserver named process to be registered on the
downgraded node. This process is responsible for handling partition
reservation requests and is key to determining the the new node is
able to handle a fullsync without partition errors.
Update the calls to rt:systest_read in repl_util and
repl_aae_fullsync_util to treat identical siblings resulting from the
use of DVV as a single value. These changes are specifically to
address failures seen in the repl_aae_fullsync_custom_n and
replication_object_reformat tests, but should be generally useful for
replication tests using the utility modules that and that have
allow_mult set to true.
Fix problem with cacertdir specification in replication_ssl test. The
code used load cert files in v2 replication expects the path specific
by the cacertdir key to only be a directory. With v3 replication the
code used is flexible enough to allow a directory or a file. Also
correct a typo in the certfile path for the SSLConfig1 configuration.
* Do not attempt to cancel fullsync if the initial attempt to start
and wait for completion fails. It has not been observed that the
problem is fullsync starting and not completing in time, but rather
the issue is that the initial call to start fullsync does not take
effect. Therefore the cancellation is unnecessary.
* Replace the call to repl_util:wait_for_connection/2 in the node
upgrade process with a call to
replication:wait_until_connection/1. This function is geared towards
v2 replication and should speed up test execution.
Replace use of a 40 second sleep in the test_supervision test case
with a wait condition to better handle variances in the time it takes
to progress through 10 retry attempts.
AAE is disabled. (defect https://github.com/basho/riak_kv/issues/959)
- Adds additional console output to reset-current-env to explain
configuration and steps being executed
- Adds the -n option to the reset-current-env script to specify the
number of nodes to build. By default, 5 will be created.
As of commit 3044839456 tests that
return something other than the prescribed success atom 'pass' to
indicate success result in test failure. Change the
replication_upgrade and replication2_upgrade tests that return the
result of the a call to lists:foreach/2 to instead return 'pass' to
indicate success.
Prior to this commit, the various riak_ensemble related tests would
manually enable the consensus system on one-and-only-one node in a
given cluster in order to work around issue basho/riak_core#571.
This commit changes the tests to work properly after the above issue
has been fixed.
In addition to removing the call to riak_ensemble_manager:enable()
that is now handled automatically by Riak, this commit also removes
a few wait_until_stable/2 checks against 1-node clusters. These
checks no longer apply, since Riak is now designed to only enable
the consensus system after the cluster contains at least 3 nodes.
Changed intercept to explicitly return `{error, econnrefused}`. Moved
helper functions to `repl_util` and added a new helper to distinguish
between disconnects on `cluster_by_name` and `cluster_by_address`
connections.
Added asserts to all wait_for functions.
ensemble_remove_node2 uses an intercept to prevent a riak_ensemble
related transition that is necessary for nodes to completely exit and
shutdown after removal. In fact, testing for this scenario is the
entire point of this test, since it is testing logic that was added to
solve basho/riak_core#572 and that logic prevents nodes from exiting
until that transition occurs.
However, even without this new logic, there is an unrelated
riak_ensemble related bug that can trigger a race condition that also
prevents nodes from shutting down.
The good news is that other changes made as part of the solution to
solve basho/riak_core#572 also fix this unrelated bug. Therefore this
commit extends ensemble_remove_node2 to remove the intercept at the
end of the test and verify that the removed nodes do actually end up
exiting as expected. Thus, the test now tests for both the negative
and positive scenarios and serves as a test against future regressions
that stall node removal/shutdown.
Adding a test to verify a bucket type is visible from a number of nodes
since the active status is given as long as the claimant sees it. But
requests to other nodes can end up hitting the dreaded {error, no_type}.
Also added a general utility that can be used for bucket type checks and
for general verification of bucket properties across nodes.
an undefined bucket type is specified. (defect #875)
- Adds a description of the reset-current-env.sh script and its
usage to README.md
- Corrects a spelling mistake in an information message emitted by
the reset-current-env.sh script
While r16b02-basho5 did not need the cacertfile path put in, r16b03 did.
The test still passes r16b02-basho5 with the added cacertfile line. Since
there is no harm in putting it in, better for forwards compatibility than
not.