What My Team Does Did

My team built and extended the microservices and backend components critical to Shotgun software (now ShotGrid software): permissions, database query optimizations, email notifications, file uploading, transcoding, virus scanning, watermarking, managing the data model and schema changes for new API consumers and new API endpoints. We worked closely with security and devops on shared points of interest like our cloud infrastructure, our deployment infrastructure, and ongoing efforts at container hardening.

Good Software Shouldn’t Have to depend on Heroics

I'm proud that my team worked on complicated important stuff, and of our commitment to each other as teammates. The work we did wasn’t always straightforward, and there was often the need to continue late into the night, and into the weekends. After all, we spent a lot of time collaborating on Slack—during planning, deployments, production tuning, rollbacks, and postmortems. Maybe because of this way of carrying on, there was a stronger than normal predisposition to behave like owners rather than employees—which I’m also proud of.

But possessing a whatever-it-takes mindset, as admirable as that may be, doesn’t scale well beyond a tight-knit group of 4 or 5 people. When we first formed as a team, we tried like crazy to meet our sprint commitments, all for the sake of making the burndown charts as steep as possible. Occasionally, the late nights resulted because of bad estimates. More often, it was due to an unforeseen complication with interservice communication, unplanned work from a tangential area of the software (like a change in deployment infrastructure), or trying to get a PR approved before a release-related code freeze deadline.

After several successive high-stress sprints, we were feeling the pain. During a quarterly planning, my team did a round table to discuss the cumulative impact the work was having on our personal lives. It was an eye opener. We later found out that our experiences were corroborated by those of other engineering teams. So we made a decision to change how we worked. We wanted to accomplish our goals AND ensure that everyone could actually enjoy dinner with their family - so that the need for heroics would eventually go away. We wanted to increase the likelihood of fixing bugs and releasing features quickly, to be able to try new ideas safely, and course-correct early on at a sustainable pace. Even while being interrupted with all the unforeseen stuff. We basically wanted CI/CD and Kanban. Yesterday.

What we focused on to improve Software Quality (and quality of life)

FAST LANE/SLOW LANE BRANCHING

While a true CI/CD implementation was not possible for us, we were able to take a gigantic step in that direction with fast lane/slow lane branching. Doing things this way was far superior and simpler to our other way of working. There were only 2 branches that ever got deployed to clients - Fast Lane and Slow Lane. Here’s how it worked:

Fast Lane (aka: master)

We deployed Fast Lane to clients 4 x a week (or greater if there’s a need)
This branch was always in a good state because only PR’s that met 5 strict criteria could be merged into it
- The PR had to be free of migrations
- All changes had to be fully covered with tests
- Your feature (if this applied) was hidden behind a flag and/or disabled
- Your bugfix (if this applies) was low/ medium risk
- Backend only: the user's session would remain valid (no need to clear)
Because the frequency of deployments was high, there was no (artificial) deadline-induced stress to hurry up just to get your bugfix/feature into the next client release
Rollbacks (if necessary) were pretty simple because the deltas were small and there were no risky data migrations ever

Slow Lane (aka: slow)
- We deployed Slow Lane every 4 weeks
- Included major feature work and/or workflow changes and database migrations
- Observed the code freeze cutoff date
- Built-in 1 week stabilization and regression testing period

STOP SAYING “THIS WILL JUST TAKE 10 MINUTES”

In general, when people ask you for a favor, most people are inclined to drop what they’re doing to see if they can help - especially when the request is accompanied by Knock Brush. For my team, there was almost never a question that could actually be answered in 10 minutes. An example of this kind of question would be…Why does this tiny video file that is only 3 frames long give this strange error, and transcode with 2 extra frames at the end?

Ticket everything that costs time: investigations, spikes, consultation, etc. If unplanned work is non-negotiable, shine a bright spotlight on it so that dips in velocity are understood

GROOM the BACKLOG WITH A PURPOSE

Don’t be afraid to close issues as won’t fix—if only 2 out of 5,000 users has complained about an edge case bug, there is more important work to do
Aggressively identify tickets that you will not work on (low priority, edge case) - and close tickets accordingly. Just don’t do it in a vacuum. Involve stakeholders to present a reasoned case why a ticket in the backlog is being marked as won’t fix
Scour the backlog for tech debt work or client reported bugs that thematically overlap with what we’re currently working on

FOCUS ON AUTOMATION & OTHER “SMALL BALL” EFFICIENCIES

Automate everything that a machine can do better than a person, and have zero tolerance for unstable tests
Adopt a definition of done that explicitly answers the question: does this have automated tests?
Write tickets that are concise, but not terse. Highlight your assumptions and explain your setup steps so that non subject matter experts can read and understand what you’re talking about

CODE REVIEWS & PULL REQUESTS ALA MARIE KONDO

Keep pull requests as small as possible
Keep feature branches as short-lived as possible

Miscellaneous

We also moved our scrum from 11:15am to 2:00pm. Since our team is almost all on east coast time, this small change had the positive side effect of giving everyone a full morning of uninterrupted focus time without cutting into any significant productive time in the afternoon. This benefit was not at all anticipated, but definitely welcomed!

Acknowledge the Importance of Writing

Document everything of high value with the assumption that you will not be around to explain it. Things like how-to guides, setup guides, debugging guides, the purpose of an investigation, the reason for a failure, the steps that were taken to mitigate the failure, etc.

The article How to Build Good Software includes many great observations and anecdotes, but the one I could not stop thinking of was the paradoxical scenario in which a group of talented people who work well together make bad software. I mention this article because my reading of it left me with a deeper appreciation for the role of writing in codifying a development team’s collective understanding — something that is particularly important during periods of change.

I’ve worked through a company acquisition, several reorgs, dozens of changes in dev environments and deployments, the phasing in of CI/CD, GDPR audits, and various other forms of disruption. Engineering teams with strong writing practices (ie: documentation, postmortems, incident reports, etc.) tend to have greater resiliency and higher bus factor in the face of organizational stress and uncertainty.

“[as software grows] system complexity increases as a whole. Software should be treated not as a static product, but as a living manifestation of the development team’s collective understanding.”

— Hongyi Li