“There’s a fatal error on the front page, I’m rolling back the build!”.
No one wants this to happen to them. In my case it was my bug that went out, it’s my software that I wrote with my own two hands that had a fatal “oh shit” error happening tonight. This is not the first time, nor will it be the last. I should do a blameless postmortem with myself, if I learned anything from those training classes. Since I’m pretty much the only person working on this software, it’s pretty much all my fault, though. I’ll call it a blameful postmortem.
This happened not on one of my direct projects, but on a project I’m consulting for. It’s a small MVP with tight deadlines and high hopes for feature count. You know, every startup you’ve ever worked at. We want to move fast and get things out the door so we can show our beta testers (and hopefully future customers) a product that they might want to spend money on.
The Initial Bug
The bug was caused by incomplete information with regards to Facebook’s API, a slight mixup with how json_decode works with deeply nested arrays, and a complete lack of automated testing. That’s also my fault. I can try to defend myself, and I have my reasons for not spending time writing tests, but all in all, that’s my fault. I wrote this crap. If I had to leave now I would feel sorry for the person that inherited it.
So, then the scramble. I’m given some screenshots of the error (this doesn’t happen locally or in test, as I can’t get bookface to give me fake data on this API, so I’m left to collect it from real production output). Luckily it’s Laravel, and I can figure out what’s wrong fairly quickly. But there were a few stumbling blocks along the way, and it took ~3 different pushes to get the errors to finally stop happening. Travis CI decides that now is the time to take FOREVER doing builds. Seriously taking like 45 minutes to push something out today. We’re getting impatient.
So, the error finally goes away. Woo! What was this build for again? Oh right, we’re supposed to be showing some Facebook API data on the screen…where’s that change?
What is HAPPENING?
The change, at the end of the day, should have worked. While I was in this scramble I got someone to get me some data out of production to test on. I tested it. It worked locally. I sounded like every junior engineer when a bug comes their way: “this works on my machine”.
While this scramble was going down, I had asked for some production data and inserted it into my local machine. I tested this feature, and it worked. I tested it before I pushed each of my new merge requests off this branch too. I know that the feature still worked, despite the bug, which was in a separate part of code. What in the ever loving hell is going on tonight?
So, as I mentioned, we’re on Travis CI for deployments. It’s set up so that commits into master get deployed to production automatically. Generally the commit that gets deployed is a merge after a pull request, but technically it doesn’t have to be. Another type of commit it could be is a git revert.
Git is Strange
Some of you are probably getting the gist of what’s happening here right now. For those that aren’t as git savvy, a git revert is sort of like an undo for a commit. It takes whatever the changes were in the commit you’ve selected, does the opposite change (meaning, goes back to the old way), and recommits that as a new change. And that works well for commits along a straight line. But it’s not quite an undo, right? Like git still stores that you did the first commit, and then did another commit to undo that. It didn’t undo your changes, it made new changes that happened to be the exact opposite of the old ones.
This gets really murky when you try to revert a merge commit. When you do that, the merge has still happened: all those commits on your feature branch have been added to your main branch. And then the revert doesn’t delete those commits, it just makes an equal and opposite change to all of them (on the main branch only) and commits that.
So then, when you go back to work on your feature branch and fix your bugs (it had bugs, but it has most of the code that we still want, of course), and try to merge that in again, just like you normally do, you’re main branch is only getting bug fixes, but will no longer have the main changes. This behavior makes sense when you know what happened, but if you didn’t notice the revert on main, you (I) would be utterly confused for about 20-30 minutes until you went and looked at the past builds.
So, blameful postmortem, what’d we learn?
- Write tests you idiot.
- I need to plan out a way to gather data before we start using it, in order to be able to test more thoroughly. I should have never written a piece of code against a data schema that I wasn’t sure of. That’s bad planning.
- Start a new branch from the main branch if I’ve already merged in. I would have caught the problem earlier, and been able to do a cherry pick or something else in git.
- Maybe don’t use git reverts to revert a build. It’d probably be better to just point Travis CI at the previous build label.
- Everyone means well. I was trying to get a bug fix out quickly. The original error was glaring and on the front page of the site. I wanted it fixed, even though I didn’t have all the information, so I took a best guess and was wrong. There’s nothing inherently wrong with that. When shit hit the fan, we wanted to revert quickly, and git revert seemed like the best option. Again, nothing inherently wrong, it just ended up causing some problems that kept me up passed midnight.
- Failure is always an option. When you’re moving quickly, you need to keep in the back of your head “What could go wrong? What am I not thinking of?” This slows you down, but this tiny bit of self-doubt will save you when the time comes. You are not infallible.
I’ve gone on record saying that I am a pretty good engineer, and I’ve been doing this for a long time. I still think that, despite the currently bruised ego. But having experience doesn’t keep you from making mistakes. It only teaches you how to avoid the one’s you’ve made (sometimes), and how to fix them when they happen.