We disabled the tests because they were intermittent.
I was somewhat alarmed to hear this.
To an extent, it was understandable - these particular tests were failing
really quite a lot, and they were preventing us from deploying some basic
maintenance updates. Also, they were doing things we hadn’t been doing
before, so it was possible the approach was flawed - fixing them would require
some investigation.
But these tests were important - they were a brand new regression test suite,
a last line of sanity-checks against unexpected changes in our journal formats.
So putting them back was a high priority.
The tests were quite simple: we write journals of various events out, and these
files are then handed over to a pipeline which processes them for reporting
purposes. We want to make sure that journal format is consistent, and to be
aware of exactly what changes we’re making when we’re doing so intentionally.
So, our tests were quite simple. We take a snapshot of the journals, then we
perform some action, then we look at the journals again, filter anything out
we’d already seen, and compare what’s left to what we expect to be left.
When they failed, we were seeing nothing in the actual output. It was as if the
events didn’t happen. We had some theories - were we trying to read from the
logs before buffers had been flushed? Were we seeing journal rotation happening
between our action and our read operation? We couldn’t reproduce any of these
cases.
The key was that these tests were failing a lot in the full build, but it was
difficult to work out what was going on because they were passing when we
debugged them. They were also passing when we just ran them in the IDE.
Those are not the characteristics of an intermittent test.
Those are the characteristics of an interferent test.
Our tests each reused basically the exact same input, and then ran different
assertions on the outcome. Our journals contained timestamps, but only with
second granularity.
So, when we ran the full suite of tests, we’d run a test, add some lines to the
journal, sample that journal, run the same scenario again generating the exact
same output, remove lines which matched what we’d seen before (including what
had just been generated), and hey, no log lines for this action.
Our tests were interfering with each other.
The solution was quite simple - make sure the tests each generated different
output. Seed the input with a different characteristic piece of data, and then
the output from each test case will be distinct and identifiable.
Furthermore, if rather than filtering out lines not associated with an
individual event, we can select for them - that enables the possibility of
running such tests in parallel in the future.
In summation
If your tests pass in isolation, but not en masse, they’re interfering with
each other. Somewhere, you have a shared dependence on mutable state - and
for end-to-end tests, that could mean files, databases, other services, all the
sorts of mutable state applications exist to manage.
If your tests are interfering with each other, then you need to find some way
of isolating them - of ensuring the bits of state that this test case interacts
with are distinctly separable from all other state.
If your tests aren’t interfering with each other, there’s a good possibility
they will in the future. Any time you’re generating persistent state in tests,
work out on what basis you want to isolate one test from another.
Ideally, isolate on something you can randomly generate at runtime, so that when
people copy-and-paste an existing test to do something new in the future, the
right sort of isolation just magically happens.