Changing our thinking about who owns quality

In June, 2019 Shotgun Software had a top-down approach to solving our software automation challenge that looked like this: appoint a dev manager (or two) to direct developers to improve the state of automation (somehow). The problem with this is that in most cases, dev managers’ lack of proximity to the nitty gritty of engineering work makes them unsuitable candidates for effectively translating their goals into a concrete set of steps. In fact, I would argue that we wasted several years taking this exact same approach by allowing a single developer to build out an incomprehensible Selenium test harness that we wound up scrapping completely.

In an organization attempting to reinvent something (effective UI test automation) that it has never actually invented in the first place, we would need the direct involvement of people who do the work of building and testing the features and bugfixes that get shipped to our customers. We needed to fundamentally change our thinking about who is in charge of automation.


Formation of the Cypress Guild

In July 2019, I made my opinion known to our director of Engineering, and pitched an idea: one that embraced a bottom-up approach, and builds on the success of the Cypress initiative I had started in May 2018. Specifically:

  • Get the dev managers out of the way

  • Announce the formation of a new task force, and set the expectation that this group will do the following:

    • Build up a solid base of test coverage, improve our tools, and develop an internal talent pool we currently don’t have

    • Meet informally on a TBD meeting schedule, and solicit participation from engineers from any scrum team

    • Coordinate efforts to produce and share tangible results in the short term that are easy to understand

  • Clarify that although the primary objective is to expand test coverage, the long term automation strategy is to develop a test coverage model that is rooted in measurability of value, as it is understood from both a business and a customer perspective

He accepted this proposal, and on August 4, announced it in an all-hands email, informally christening it the Cypress guild. To be honest, I was not crazy about the name at first, but it’s grown on me.


In August, we had our first meeting

There were about 10 developers and QA engineers in attendance, representing 3 kinds of stakeholders.

  1. The back-end teams: my team & my previous team, whose main focus were the API’s, file uploads, permissions, and microservice-y stuff like email notifications, the event service, and webhooks. We had the most experience, since my work had been the genesis of the Cypress initiative.

  2. The front-end teams: their main focus was delivering high visibility UI work in the form of Reactified components. This team had the highest number of Node.js experts, and because they did UI work, were the most likely to see regressions.

  3. The automation team: their focus was to optimize build times and empower the dev teams to write test automation with the best tools available. This team was the most knowledgeable about the capabilities of the Jenkins build server and the specifics of the CloudOS infrastructure that the build server was part of.


Challenge # 1: Figuring out how we’d work together

One of the first things we asked was for each person to identify their desired level of participation in the guild - either as a test writer, a documentation writer, or as a reviewer. We agreed to organize our meetings on a biweekly cadence, and follow this general pattern:

  • Anyone checking in tests would “stay in their lanes” - meaning, we’d be responsible for building test coverage for the features and bugs in our own backlogs

  • All test automation work, if it were to be discussed, would get a Jira ticket linked to either a bug or user story, and would follow a simple tagging convention: cypress_taskforce

  • We’d kick off each meeting with a quick standup to see which ticket(s) everyone was working on, followed by any blockers (if any)

  • We’d wrap with discussion, then action items (eg: tickets to be created, owners to be found, etc.)


Challenge #2: Figuring out what to work on

The overall gist of the meetings was to keep things tightly focused, avoid pointless brainstorming, and to push every discussion point to its logical conclusion: is this idea good, and if so, how much work does it create, for whom? General topics were:

  • Test stability

    • Were any test(s) failing non-deterministically? If so, what improvements could be made?

    • Could something be changed about the way in which Jenkins executed the test so as to increase stability?

  • Cleaning up and optimizing

    • Improving the speed of the Docker builds

    • Removing dead fixtures and unused files

    • Consolidating the 2 cypress directories into 1, and organizing the directory structure of the merged folders into subdirectories corresponding to their software features

    • Identifying good patterns and conventions to follow - what is working and what is not?

  • Increasing high value coverage

    • Identifying high-value workflows and features that are under-represented by existing test coverage

    • Discussing recent regressions and feature work that might necessitate new or modified tests

  • Documentation

    • Coding conventions around spec file names, variable names, and function names

    • How to use the custom commands (found in the support files)

    • How to install and build the Cypress Docker container

    • How to run tests locally, and in the Docker container


Short Term Results (3 months)

Most of the effort in the first few months was heavily tilted toward code cleanup, documentation, and additional Jenkins integrations to make test execution faster and debugging easier. As expected, the improvement in documentation encouraged more developers to write tests, which led to greater team engagement with test automation. Some of the specific benefits we saw come out of this period were:

  • Code cleanup and consolidation

  • Documentation in several areas, making it easier for developers to install and build the Cypress Docker image, then get started writing tests

  • Identification and correction of unstable tests, as well as the creation of a process for quarantining unstable tests

  • Better developer experience around failed tests (better logging and automatic Slack notifications)

  • Optimized parallelization and reduction of total test time

  • Increase of total # of tests from 465 - 903


Next steps

The work of the Cypress guild has unquestionably had a dramatic positive impact on the agile teams’ confidence in releases, and is laying the foundation for a truly customer-oriented test coverage model that will allow us to fully implement CI/CD. Though we are still in the short term phase of building a strong base of test coverage, it’s clear that our next steps will be qualitatively different. We’ve started to think about different ways of measuring coverage (not just in terms of lines of code that got executed), how we want to persist test data, what sorts of triggers should be invoked when failures occur, and how to effectively connect the dots between customer value, tests, and coverage.