Fix race condition in partition_repair

If we just wait for the old vnode to die, we are not guaranteed that the new one will have yet been started and registered with the vnode manager, so it's possible we will end up trying to do a call into the old dead vnode in the subsequent test code. We saw a couple of test failures in giddyup recently which I believe were caused by this race condition. To fix, we can wait for the vnode manager to return a new pid instead of just waiting for the old pid to die.
2024-11-06 08:35:22 +00:00 · 2016-09-22 13:02:04 -04:00 · 2016-09-22 13:02:04 -04:00 · bff9ddc872
commit bff9ddc872
parent 95b8747c58
1 changed files with 8 additions and 1 deletions
--- a/tests/partition_repair.erl
+++ b/tests/partition_repair.erl
@ -128,7 +128,14 @@ kill_repair_verify({Partition, Node}, DataSuffix, Service) ->
                         [Partition, VNodeName]),
    ?assert(rpc:call(Node, erlang, exit, [Pid, kill_for_test])),
    
-    rt:wait_until(Node, fun(N) -> not(rpc:call(N, erlang, is_process_alive, [Pid])) end),
+    %% We used to wait for the old pid to die here, but there is a delay between
+    %% the vnode process dying and a new one being registered with the vnode
+    %% manager. If we don't wait for the manager to return a new vnode pid, it's
+    %% possible for the test to fail with a gen_server:call timeout.
+    rt:wait_until(fun() -> {ok, Pid} =/=
+                           rpc:call(Node, riak_core_vnode_manager, get_vnode_pid,
+                                    [Partition, VNodeName])
+                  end),

    lager:info("Verify data is missing"),
    ?assertEqual(0, count_data(Service, {Partition, Node})),