As of commit 3044839456 tests that
return something other than the prescribed success atom 'pass' to
indicate success result in test failure. Change tests that return the
atom 'ok' or some other value to instead return 'pass' to indicate
success.
This reverses an earlier change to support a feature that has been stripped
(for now). When said feature is put back in, it should support multi.
Setting this to allow mutlt = true allows for more confidence in tests.
Resolve failures with cuttlefish configuration changes in Riak 2.0.
Remove riak_control_upgrade, since riak_control should cover those use
cases completely.
The verify_busy_dist_port helper function cause_bdp:spam_nodes/1
recently changed to be more aggressive in triggering busy_dist_port
warnings. The function changed to spawn 1 million processes to ensure
the test generated enough activity to trigger the warnings, but that
number of processes exceeds the 256 thousand process limit that is the
Riak default. One consequence of this can be that the rex server
responsible for handling rpc calls can crash. In some cases this leads
to rpc calls by riak_test to shutdown the riak nodes involved in the
test to hang indefinitely. This change reduces the number of processes
spawned to 200 thousand. This should still be enough processes to
trigger the busy_dist_port warnings, but without exceeding the beam
process limit.
Prior to Riak 1.4.8 replication registers as a service prior to
completing all initialization tasks including establishing realtime
connections to sink clusters. This leads to a race condition in the
replication_upgrade and replication2_upgrade tests where the test may
begin writing data to the source cluster to verify the function of
realtime replication before the most recently upgraded node
establishes a connection to the sink cluster. The result of this is
that the data is silently discarded by the realtime replication system
and the test fails because all of the expected data is not replicated
and able to be read on the sink cluster. Change the
replication_upgrade and replication2_upgrade tests to explicitly wait
for the realtime connection to be established after each source
cluster node is upgraded before proceeding with the test.
Verify that upgrading Riak with Bitcask to 2.0 or later will trigger
an upgrade mechanism that will end up merging all existing bitcask
files. This is necessary so that old style tombstones are reaped,
which might otherwise stay around for a very long time. This version
writes tombstones that can be safely dropped during a merge. Bitcask
could resurrect old values easily when reaping tombstones during a
partial merge if a restart happened later.
Establish a new PB connection to the legacy node after it is upgraded
in order to avoid a failure. The PB connection may close if the node
upgrade takes too long and its reuse in such a case can lead to test
failure because use of the pid returns {error, disconnected} errors.
When performing the test of object reformatting through replication,
assert that if we happen to downgrade the format we can still read the
keys which have been replicated.
Wait for transfers to complete in
replication2_pg:test_pg_proxy. Replication tests that test the n_val=1
request option can fail with insufficient_vnodes errors if the cluster
setup does not include waiting for transfers to complete. Change the
test_pg_proxy test case to wait until transfers complete on the "A"
and "B" clusters before proceeding.
Fix an error that can lead to failure of tests using
replication2_pg:test_pg_proxy test case. A protocol buffers connection
is established to a node in the "B" cluster, the leader node from that
cluster is shut down, and then that protocol buffers connection is
used to exercise proxy_get. If the connection was established to the
former leader and that is subsequently shut down it can cause the test
to stall and eventually fail. This changes that test to establish a
new connection to a node remaining in the "B" cluster to use for the
proxy_get and prevents the test from stalling.
Once riak_ensemble_manager:enable() is called, we need to call
riak_core_ring_manager:force_update() so that the members will be
created and added to the ensembles trying to get a quorum. During ticks
in core, new members are created only if the ring has changed. There is
a race that can sometimes prevent the members from starting and thus the
quorum from being achieved indefinitely. This small change to the test
infrastructure resolves this issue, but it still requires fixing in
riak_core and/or riak_kv.
repl_consistent_object_filter calls riak_ensemble_manager:enable() which
fails to bootstrap the ensemble because the ring has stabilized already.
An issue for this will be opened in riak_kv, but this quick fix will
allow the test to get beyond that point.
Add ensemble_basic4, ensemble_sync, and ensemble_interleave tests.
ensemble_sync tests the new AAE-based peer syncing logic. The test
checks various scenarios with different levels of data corruption.
ensemble_interleave tests a specific scenario where two peers become
corrupted one after the other. This tests the scenario where the
second peer becomes untrusted while the first peer may be syncing
with it.
As of Riak 2.0 the vm.args zdbbl setting defaults to 32768. Previously
the default of 1024 was used. Change the cause_bdp helper module for
the verify_busy_dist_port test to be more aggressive in order to
trigger a busy_dist_port message with the higher zdbbl setting.
There is a race condition that can cause the force-replace test case
in the verify_dynamic_ring test to fail. This issue is being tracked
by riak_core issue #570. This changes replaces the force-replace
testing with another resize test. Once issue #570 is resolved this
change can be reverted.
Add ensemble_basic, ensemble_basic2, and ensemble_basic3 tests.
These tests test that Riak correctly generates proper consensus
groups, these groups reach quorum, handle leader failures, etc.
ensemble_basic3 tests basic consistent K/V API as well as behavior
during simple network partitions.
Fix some race conditions in the cluster leader helper functions. Also
re-initiate fullsync after a certain number of checks for
completion. V2 replication has problems where calling
riak_repl_console:start_fullsync is basically ignored and needs to be
retried.
Improve the reliability of the proxy_get test by asserting that
ownership transfer completes before killing the leader node.
Related to basho/riak_repl#352.
Change some of the helper functions in the repl_util module to handle
errors more sensibly so that cluster setup race conditions do not
cause unnecessary test failures.
* Switch to a user name of "Hindi" in Devanagari to test protobuf unicode
* Turn "user" into a variable in http_security so we can test utf-8 usernames when we fix HTTP support for authentication of same
Well, that's not true. They break riak_kv's context operations on Maps.
This change works around that breakage by turning the context off for
the operations in this test. It is a temporary thing, when the context fix
work has been done, we'll be changing back.
The heartbeat timeout enforcement was recently updated to be
specified in seconds to match the documentation for that option. The
repl_rt_heartbeat test has since been failing since it still specified
the timeout in milliseconds. This change makes the test use seconds
for the heartbeat timeout gets the test passing again.
The wait_until_leader_converge function could incorrectly return
success if all of the results from the get_leader rpc calls were
either undefined or all returned a badrpc tuple. In either case the
particular result ends up as the sole unique value in a list and the
success condition is verifying that the list is of length 1 regardless
of the value of the member of the list. Change the function to filter
the list of results for values that indicate failure prior to the
success condition checking.
This test verifies that AAE repairs replicas of values without passive
read repairs. This includes missing replicas and replicas with divergent
values. It will also repair entire KV partitions lost, and if
configured for trees to rebuild, it will recover from AAE data loss and
corruption.
This version differs from the original 1.4 test only in the handling of
siblings. It does get before put for modifications and merges values by
choosing the longest one, as modifications in this test append bits.
Replication of consistent objects is not currently supported. Add a
test to ensure that fullsync replication filters these objects. No
testing is necesary for realtime replication at this time because the
postcommit hook mechnaism it uses is not invoked in the consistent
object code path.
cluster_meta_basic has been intermittently failing [1]. This commit
includes two improvements, the second of which addreses this
intermittent failure.
The first change modifies the test to "wait_until_object_count"
instead of reading the object count at a given moment and getting a
possibly stale, or soon to be updated, value. This alone does not
cause the to pass reliably. However, it highlights that the
underlying race condition is one where the object count will never
reach the expected value.
The second change modifies the test to avoid the race, which was
caused by resolving on two different nodes concurrently, each of which
in turn wrote and broadcasted the resolved result. If an interleaving,
such that both writes are allowed to succeeed before one node is
notified of the modification on the other, occurs so does the
failure. The test has been changed to only perform the write/broadcast
on a single node ensuring that we eventually converge the the expected
value and object count.
[1]
http://giddyup.basho.com/#/projects/riak_ee/scorecards/73/73-1556-cluster_meta_basic-centos-6-64/35530/artifacts/532185
Add test which ensures that the AAE source worker doesn't deadlock when
waiting for responses from the process which is computing the hashtree
differences.
Unfortunately, this test uses timeouts because as the code currently
stands, I can't figure out a way to make it any cleaner.