Fix race in repl_rt_heartbeat due to short timeout

One particular timeout in the repl_rt_heartbeat test was slightly too
short, which could cause us to occasionally hit a false positive on this
test if various timings lined up just right. This PR bumps up the
timeout, which should prevent this from happening again.

I would really like to do a proper fix for this, which would use
intercepts or something to confirm that the actual timeout is being hit
in the code...but we don't really have time for that, and a half fix is
better than no fix I suppose.
This commit is contained in:
Nick Marino 2015-11-23 11:09:21 -05:00
parent 0fc3f7721b
commit 2be9c2f83b

View File

@ -65,8 +65,14 @@ confirm() ->
suspend_heartbeat_messages(LeaderA),
%% sleep longer than the HB timeout interval to force re-connection;
%% and give it time to restart the RT connection. Wait an extra 2 seconds.
timer:sleep(timer:seconds(?HB_TIMEOUT) + 2000),
%% and give it time to restart the RT connection.
%% Since it's possible we may disable heartbeats right after a heartbeat has been fired,
%% it can take up to 2*?HB_TIMEOUT seconds to detect a missed heartbeat. The extra second
%% is to avoid rare race conditions due to the timeouts lining up exactly. Not the prettiest
%% solution, but it failed so rarely at 2*HB_TIMEOUT, that this should be good enough
%% in practice, and it beats having to write a bunch of fancy intercepts to verify that
%% the timeout has been hit internally.
timer:sleep(timer:seconds(?HB_TIMEOUT*2) + 1000),
%% Verify that RT connection has restarted by noting that it's Pid has changed
RTConnPid2 = get_rt_conn_pid(LeaderA),