The Continuous Integration Mystery

Published 03 December 2020 in Vancouver, BC, Canada (~5min read)

Yesterday I faced a version control situation I rarely face.

It showed me that I may rely a little too much on the green light from CI tools like CircleCI and GitHub Actions when deciding whether it’s safe to merge a branch.

Here’s what happened:

My colleague added new expect clauses to a test, plus the code to pass it.
They merged this into the main branch via a PR.
Later, I forked off of the main branch.
I added more new clauses to the test my colleague had earlier modified.
I worked on it for the rest of the day.
I made a pull request.
Next, two colleagues reviewed my work and approved it on GitHub.
All the pre-merge checks on CircleCI were passing, including tests and style checks.
I rebased and decided to save the merge for the morning.
Soon after, an error was found related to the code the first colleague had deployed.
They reverted the PR they had merged on Day 1.
Not long after, I checked my PR again - there were no merge conflicts.
I merged my code.

An hour or so later, another colleague tells me that, according to CircleCI, a test I wrote was failing on the main branch. How could this be, they said? It appears to have been passing on the branch it came from!

What is the cause of this mysterious failure?

I’d recently touched that test, so I looked at the error and quickly worked out what it was: The test I’d added to was failing because one of it’s expect clauses relied on code which had been been reverted - it was no longer a valid expectation. GitHub didn’t run a new diff on my PR after the removal of the clause in question from the main branch; the reverted test code simply looks ‘unchanged’ in the diff, as if it had been and was still there.

I didn’t remove it, I didn’t rebase, and my PR ended up re-adding the recently-removed expect clause even though it didn’t appear to.

Is there a lesson to be learned here?

On one hand, the process is working. There was a merge error caused by the git equivalent of a race condition, we were told about it, and we were able to resolve it. Why bother running tests on the main branch before you deploy unless you are concerned that there is a chance they’ll fail?

Maybe the lesson is to keep on doing what we’re doing.

On the other hand, the process felt like it was disrupting the order of things. Something like this is often said: “if you have a reliable QA and CI process, then if something is on the main branch it should be deployable.” And yet here was an anomalous case which suggested otherwise.

Perhaps, then, the lesson is that our QA and CI process isn’t robust enough. Should CI create a merged branch behind-the-scenes and run tests on that before allowing the branch to be merged?

I’m not sure. In the meantime, while I try to decide what the lesson is, I think I’ll just rebase my branches more often.

A discussion has begun on Hacker News, please share your thoughts if you have any.