I was browsing r/devops the other day and came across a good post. The post asked the question:

What would you do if you had full authority over DevOps at your company?

It struck a cord with me because we’ve been discussing internal organizational issues at work and what can be done. This post is my answer to that question. Our internal discussions are not specific to devops, but more about how we can more effectively organization to achieve our goals.

My answer is more organizational and process focused.

Situation Report

There are two large divisions in the organization. There are “central” teams that support the “local” teams. The central team has 3 offices across the globe. The central team includes product development. Product development contains all the engineers, product owners, designers, and engineering managers. This is where my team sits. Product Development contains a few agile teams (engineering manager, mix of developers, a product owner, a designer, and one tester). Then there are two horizontal support teams: SRE (my team) and a Data/Analytics team. The business is setup so that central team builds product that is configured for one market and local teams handle business operations.

The central teams build product and run it in production. Business operations are administered by the local businesses. There are four local teams distributed across Asia and Africa. Coordination with the local teams happens through higher level management; however one member of the product team represents one local team. The local teams handle customer support and staffing. The PD teams handles technical operations.

Here is the same expressed in tree form:

  • Organization
    • Central
      • PD
        • Agile Teams (mix of backend, frontend, testers, etc)
        • SRE <— I lead this team. Team leads make up the PD management team head by the CTO.
        • Data
      • Marketing
      • C-Level Positions
    • Local Teams
      • Marketing
      • Customer Support
      • Business Development

My position as a team lead gives me an intersection of what’s going in PD and what requests are coming in from the local teams. However most people outside the local teams are really unaware of how those organizations function. Admittedly, I’m still largely unaware compare to higher management, but I’m vastly more informed than people below me.

PD does high level quarterly planning. This produces a roadmap with estimated delivery dates for business objectives. This trickles down to each person through individual quarterly goals which hopefully align throughout the hierarchy. In theory the C level positions ultimately set the priority for business objectives which turn into quarterly goals for the CTO. This translates to quarterly goals for their reports aligned with their goals. Then all the way down.

The company quasi practices continuous delivery. I say quasi because individual components are tested but there is still a manual integration process performed by dedicated QA staff with manual deploys. However deployments themselves are automated can be triggered for any component at any time. Certain components use a continuous deployment.

Product development mainly uses two week sprints. My team is the exception which works solely off a prioritized backlog.

Problems

Overall things are working OK. We are not operating at our potential in my opinion. There are few key problems:

  1. Too much in flight work. Management is over committing on expected deliverables by 3x.
  2. Competing Priorities. Each team and individual is given their own quaterly goals. However what does a person do when they want to meet their goal but they cannot do alone? How can one person ask another to work together when that’s taking them away from their own deliverables? This is especially problematic in a growing orgnization when there are new engineers who do not have enough technical experience to operate independently. Large amounts of in-flight work multiply this problem.
  3. Quarterly Planning. This doesn’t work and have never worked for the organization in the past 3 years. Software development does not time box itself into pre-packaged chunks. Sprints do not help with this problem either. Committing to quaterly deliverables with deadlines is madness. Assumptions made at beginning never hold true for months. This requires constant rejuggling during the sprints to adjust priorities and even shift people between teams.
  4. PD is Largely Disconnected from Business Impacts. Unfortunately this is by design. The PD team is disconnected from the local teams and their needs. However this entirely where the company generates revenue. The PD team is unware of the day to day problems the local teams have operating the software, the conditions they work in, and larger business operational issues they face. Engineers do not see their impact on end user or how changes to internal systems impact or improve the situtation for other employees.
  5. Horizontal vs Vertical Teams. The PD team tries to enforce hard technical and process boundaries between teams. This doesn’t work because significant business objectives require collabration between many engineers with different skill sets. A large feature cannot go out with development, testing, data tracking, and operational input.
  6. Long Feedback Cycles. This is a corollary to quarterly planning. Given things are committed to for quaters (or even more!), projects tend to be large. Large software development projects are astoundingly hard to manage succesfully (especially if you’re trying to hit a deadline!). The business is afraid to ship small changes because their many too many bugs or it may not functional enough to a user. This creates much longer iterations.
  7. Continuous Deployment is too Scary. I do not hide it. Continuous deployment, in my opinion, is the best way to build and ship modern software. We had a reorg about one and half years ago that created the structure I described earlier. Previously PD had a web, android/ios, and backend team (I lead this team). The web and backend teams were practicing continuous deployment. It was deamed too scary to continue.
  8. Data is Produced but not Analyzed. Teams are responsible for reporting KPIs. These KPIs are not used in planning, specifically which KPIs will be impacted, nor are the KPIs being verified after releasing. This may be happening in some places in the orgnization. If it is, it’s not visible to everyone in the organization.

My Vision

The question focused on “DevOps”, so I’ll focus on some DevOps principles and how I’d apply them. The goal is to create an organization that delivers business objectives at high velocity, without regressions in a sustainable approach.

My first change is to replace quarterly planning and the somewhat arbitrary deadlines with a single prioritized backlog. I hope that the local market requests will be prioritized against all other requests until the engineering team is large enough to have separate PD and local backlogs. The engineering organization should grow to support all business stake holders. The existing team leads would be given a single priority from this list and see it all the way through to production. When an objective is achieved the team will disband and a new team will form around the next priority based on current available staffing.

Second, replace the concept of vertical and horizontal teams with elastic teams formed to tackle individual business objectives. Different business objective require a different mix of technical and product knowledge. Teams will be formed to tackle the objective at hand based on need. Engineers will naturally gravitate to their interest area. Semantically I’d like to remove the different “Web”, “Android”, “Services”, “SRE”, etc from job titles as I don’t see the team like that. Everyone is an engineer who 1) can write code, 2) maintain a test suite for that code, 3) deploy that code) and finally 4), run production operations.

This change would happily abolish the SRE (my team) and Data teams. They cannot fulfill their mission without heavy collaboration between other teams and specific engineers. This would make the overall engineering responsible for all facets of production operations and not singling out individual engineers responsible for each individual stage. Ultimately this everyone’s responsibility.

My third change is to shorten the critical path to production. A manager, product owner, engineer, and QA staff are not required to ship all different types of changes. Engineers backed by a strong test suite (unit/integration test etc for individually deployment components, a cross component end-to-end suite, and no bugfixes without a regression test) can fearlessly deploy to production. Certain roles are not required for certain types of changes thus they should be ruthlessly drop from the critical path.

The fourth change is to reinstate continuous deployment. This requires the team to create an automated end-to-end test suite for user facing functionality across multiple clients.

My firth change is three parts. First, the number of KPIs from hundreds to maximum five. This set must connect to every business objective and be tracked in real time. Everyone in the organization should be able to map their efforts onto changes in these KPIs. Second, impact on KPIs must be considered to start new work and expected changes verified are releasing. Third, the KPIs must be visible to everyone in the company with minimal effort.

Solving Problems

I raised 8 problems. Here’s how my vision addreses each of them.

  1. Too Much in Flight Work. Organizing around priorities naturally enforces a cap. Concurrent work will happen as long as their enough engineers to achieve the objective. The only way to overcommit is to spread resources out so things may complete but will certainly take longer.
  2. Competing Priorities. Organizing around priorities addresses this issue by giving teams a sole priority. Reorganizing and shifting resources if new priorities come up is encouraged beacuse everything is truely priority driven. If you are engineer working on a priority 10 item and priority 1 item requires your attention, you should happily adjust your efforts because you’re having a larger impact on the organization.
  3. Quarterly Planning. Organizing around priorities removes this problem. If things happen to complete in a given quater than great, but they are not longer forced into pre-defined time boxes. There is no need for quaterly goals and the related structure because everything is replaced by a singular priority.
  4. PD is Largely Disconnected from Business Impacts. Organizing around priorities addresses (but does not solve this problem). People will work on priorities. Growing the team large enough to support central and local team requests will further address this issue because they will be working on tasks impacting more business areas.
  5. Horizontal vs Vertical Teams. Creating elastic teams, removing prefined role boundaries, and shifting to a “you build it, you run it” approach generally turns this problem on it’s head.
  6. Inability to Ship Small Changes. Focusing on KPIs and starting continuous deployment will improve this area. Stakeholder should be motivated to move the KPIs with an 80% solution quicker instead of a 100% solution over longer iterations. Large multiple efforts would become less prefered to shorter effecitve and measured iteration results.
  7. Continuous Deployment is too Scary. It isn’t. It’s only scary because the association no quality software. Continuous deployment is the exact opposite. Automate, assert correct functionality, and refute regressions. Measures of technical and product quality will increase.
  8. Data is Produced but not Analyzed. Moving KPIs to the forefront of the organization combined with making KPIs impact acceptance tests for stories change the decision making process.

What Would You Do?

Let me come back to the original question:

What would you do if you had full authority over DevOps at your company?

I encourage to you analyze your own situation and consider what you would do differently. Do not make mistake that DevOps equates to purely technical changes. Technical changes are only a manifestation of organizational culture. So what would you do? What tech would you change and how you refactor your organization or culture? Please let me know, I’m curious to learn about your experience.

Good look out there. Happy shipping!