This is the first iteration of creating byzantine dataloss tests that
show both recoverable and unrecoverable, but detectable errors. This tests the
following scenarios.
* Lose one partition worth of data, but no synctrees and recover.
* Lose all but one partition of ensemble data, but no synctrees and
recover.
* Lose minority of synctrees. Only the peers with the missing
synctrees are restarted. System remains available.
* Loss of majority of synctrees. Majority peers are restarted. System
recovers when they all come back online.
* Loss of majority of synctrees with one node partitioned. All peers
restarted except partitioned one. System does not recover with that
node partitioned. When the partition is healed the system recovers.
* Loss of all data and synctree except on one peer recovers.
* Backing up and restoring old data but not synctrees results in
detected errors. Restoring newer data fixes this.
* Delete all data on all nodes, but not synctrees. This is detected and
an error returned to the user.
Change the ACL test case in the replication_ssl and replication2_ssl
tests to use certificates generated within the tests instead of
relying on certificates created outside the test that are prone to
expire and cause spurious test failure.
Also change the replication_ssl and replication2_ssl tests to avoid a
cycle of standing up the test clusters and then immediately restarting
them before any tests cases execute. This should make the test
execution slightly faster for both test modules.
This commit also changes the tests to be a bit more robust in checking
for cluster state when restarting nodes and removes an unnecessary
five second sleep call in the replication_ssl test.
Change replication_ssl to use the wait_for_site_ips function from the
replication module introduced in
297090ded6 instead of the defunct
verify_site_ips function.
Avoid a race condition in the replication test module when checking
for site IP addresses in the replication status output. The test
waits for a connection on the leader, but it only queries the
replication status to check for the expected site IP addresses a
single time. Change the test to wait and re-check the status output to
give greater assurance that if the expected site IP addresses are not
present it is due to legitimate failure and not a race condition in
checking the replication status. This change affects the replication
and replication_upgrade tests as well as any other tests that call the
replication:replication function.
Prevent a situtation where the auto-reconnect hasn't triggered yet
causing the result to be an error, instead of ok, on the next operation
after reconnecting. Force a disconnect and reconnect to make sure the
test is deterministic.