Add testing of the handoff heartbeat change from the following pull
request: https://github.com/basho/riak_core/pull/560. Add an intercept
module for the riak_core_handoff_sender module to introduce artificial
delay on item visitation during a handoff fold. This delay along with
the changes to the verify_handoff test induces test failure when run
without the heartbeat change. The handoff_receive_timeout is exceeded,
handoff stalls, and the test eventually fails due to timeout. The test
succeeds when run with the heartbeat change.
Something in Riak has changed such that the previous approach to
setting public IPs no longer appears to work for rtssh. Thus, rtssh
users such as rtcloud cannot provision clusters that can talk on
anything other than 127.0.0.1.
This commits add code that explicitly sets the PB/HTTP IPs when
Cuttlefish is being used, which seems to fix the issue.
Ensemble_ring_changes tests writing a value, expanding the cluster, then
updating and reading that value after ring expansion has completed. It
also creates a bucket using a bucket type with a different n_val from
the default bucket type. The latter tests basho/riak_kv#1008 and it's
corresponding riak_core PR.
Use riak_test_runner:metadata/0 to get the configured backend instead of
defaulting to bitcask. Additionally we use rt:clean_data_dir/2 to safely
remove backend directories.
This is the first iteration of creating byzantine dataloss tests that
show both recoverable and unrecoverable, but detectable errors. This tests the
following scenarios.
* Lose one partition worth of data, but no synctrees and recover.
* Lose all but one partition of ensemble data, but no synctrees and
recover.
* Lose minority of synctrees. Only the peers with the missing
synctrees are restarted. System remains available.
* Loss of majority of synctrees. Majority peers are restarted. System
recovers when they all come back online.
* Loss of majority of synctrees with one node partitioned. All peers
restarted except partitioned one. System does not recover with that
node partitioned. When the partition is healed the system recovers.
* Loss of all data and synctree except on one peer recovers.
* Backing up and restoring old data but not synctrees results in
detected errors. Restoring newer data fixes this.
* Delete all data on all nodes, but not synctrees. This is detected and
an error returned to the user.