Add ensemble_basic4, ensemble_sync, and ensemble_interleave tests.
ensemble_sync tests the new AAE-based peer syncing logic. The test
checks various scenarios with different levels of data corruption.
ensemble_interleave tests a specific scenario where two peers become
corrupted one after the other. This tests the scenario where the
second peer becomes untrusted while the first peer may be syncing
with it.
Increase timeout for waiting for init:stop/0 to stop nodes from an extra
second to 10 extra seconds. Be sure to wait until the timeout expires
before using kill on any nodes that fail to stop. To avoid unconditionally
waiting the full timeout period, use kill -0 where possible to watch for
nodes stopping. Use kill -9 only after the full timeout period has elapsed
and the node still hasn't stopped. Fix setting of the cookie when
converting the riak_test node to a distributed node.
As of Riak 2.0 the vm.args zdbbl setting defaults to 32768. Previously
the default of 1024 was used. Change the cause_bdp helper module for
the verify_busy_dist_port test to be more aggressive in order to
trigger a busy_dist_port message with the higher zdbbl setting.
There is a race condition that can cause the force-replace test case
in the verify_dynamic_ring test to fail. This issue is being tracked
by riak_core issue #570. This changes replaces the force-replace
testing with another resize test. Once issue #570 is resolved this
change can be reverted.
Add ensemble_basic, ensemble_basic2, and ensemble_basic3 tests.
These tests test that Riak correctly generates proper consensus
groups, these groups reach quorum, handle leader failures, etc.
ensemble_basic3 tests basic consistent K/V API as well as behavior
during simple network partitions.
Fix some race conditions in the cluster leader helper functions. Also
re-initiate fullsync after a certain number of checks for
completion. V2 replication has problems where calling
riak_repl_console:start_fullsync is basically ignored and needs to be
retried.
Using kill -9 on a node can leave it in a zombie process state, stuck in a
system call never to return. OS X Mavericks seems especially vunerable to
this problem. Only a reboot can clear out such zombies. Change
rt:brutal_kill/1 to try a normal kill -15 first, and set a 5 second timer
to perform a kill -9 if the normal kill doesn't work. Change
rtdev:stop_all/1 to first try to connect to the nodes to shut them down via
an init:stop/0 rpc, and if that fails attempt to stop them via "riak stop"
instead. Then, ps is used to check for any stragglers and those are killed
via kill -15, wait 5 seconds, kill -9.
Improve the reliability of the proxy_get test by asserting that
ownership transfer completes before killing the leader node.
Related to basho/riak_repl#352.
Change some of the helper functions in the repl_util module to handle
errors more sensibly so that cluster setup race conditions do not
cause unnecessary test failures.
* Change group leader for cover_server while generating reports, so the
'includes data from imported files' message can be suppressed.
* Append a phash of the test metadata to the coverdata filename to keep
them unique.
* Removed unused maybe_stop function.
* Switch to a user name of "Hindi" in Devanagari to test protobuf unicode
* Turn "user" into a variable in http_security so we can test utf-8 usernames when we fix HTTP support for authentication of same
To enable us to be able to see the *combined* coverage of our unit and
integration tests, modify riak_test and the smoke_test runner to capture
coverage data per-test and post it as a giddyup artifact.
To maintain the current riak_test behaviour where the *combined*
coverage is reported on at the end of a run, each test writes its own
.coverdata file, cover is reset and then once all tests are run, the
coverdata files are all loaded and the total coverage is reported.
Recently Scott was running into an issue running `verify_handoff`
where his old data was not being properly reset when running
`setup_harness`. I noticed we were using `os:cmd` which doesn't check
the exit code of the command. I modified `run_git` to use `cmd` as
well as verify the exit code is 0. This allowed Scott to catch the
real issue which turned out to be a bad path in his config.
While making this modification I noticed a bug in the pipe cleanup
code. The `file:del_dir` call is actually returning `{error, eexist}`
because there is a `bin` directory under each pipe dir which had not
yet been deleted. Rather than spend time writing a recursive delete in
Erlang I changed the code to use `cmd` and to confirm an exit of 0.
I modified `stop_all`, which is used by `setup_harness`, to also use
`cmd` and check exit codes.
Finally I make sure that `spawn_cmd` flattens the list before passing
it along as `open_port` wants a string not an iolist.
Well, that's not true. They break riak_kv's context operations on Maps.
This change works around that breakage by turning the context off for
the operations in this test. It is a temporary thing, when the context fix
work has been done, we'll be changing back.