On Self-Service Cost Control in the Cloud

One of the nice side effects of cloud services is that they enable engineering teams to provision their own infrastructure without having to go through a gatekeeper. This could be seen as an instance of the general trend towards shifting left: Enabling teams to spin up infrastructure according to their needs removes friction from the process and enables faster iterations.

However, this newly gained freedom also comes with the expectation that engineering teams manage the cost of their infrastructure needs responsibly.

I want to argue that engineers can generally be trusted with this responsibility and that re-introducing control structures to manage cloud spending would be an understandable but flawed reflex. Instead, to avoid waste, we need more transparency. The best way forward is to make this new responsibility easier to deal with for engineering teams.

Not An Incentive Problem

One argument that you could make is that placing the responsibility of cost control into the hands of engineering teams is dangerous because their incentive is to solve the technical problem at any cost. Therefore, the infrastructure costs incurred are an externality that the organisation has to carry.

This view does not match my experience. I would argue that in a healthy organisation, individuals are incentivised to provide value to their employer. This means using the organisation’s resources efficiently to generate the best possible ROI. Keeping an eye on the cost is an important component of this responsibility.

Think about it this way: You already ask your engineers to manage their own time efficiently, for many teams arguably a larger cost factor than the cloud spendings they have to keep track of.

The risk with the misaligned incentive school of thought is that it leads to the conclusion that the fix must be to require approval from the person that owns the team’s budget – effectively creating a gatekeeper. This gatekeeper might eliminate some waste, but they also re-introduce the same problem you had before moving to the cloud: Having to seek approval creates a drag and limits the speed with which the engineering team can iterate.

It’s a Transparency Problem

If the conclusion is that you can in principle trust your engineers to manage the organisation’s resources effectively, then what’s the problem? Well, I have seen many cloud resources being used inefficiently – and I have been the culprit in several instances. May the engineer that never left an expensive cluster running by accident cast the first stone!

However, in my experience, all examples of wasteful use of cloud resources can be traced back to honest mistakes. The two leading causes seem to be simply forgetting about a running cost and misjudging computational capacity requirements.

To me these look like very typical byproducts of shifting left: Yes, you do move more responsibilities into your engineering teams. If that additional load is getting too large, it is not surprising that the ball will get dropped from time to time.

By the same token, the response is similar to the one we use to tackle the challenges of other left-shifts: We hope that empowering engineers will lead to more automation which eventually reduces the impact of these new responsibilities to near-zero.

Existing Tools

There are of course existing tools to help. AWS notably provides the Cost Management tools, the Application Cost Profiler and the Billing pages. These systems are building blocks, not solutions.

This is a bit like CloudWatch is a tool that can help with monitoring your cloud resources – but you will still have to work to create a monitoring solution for your system that integrates the available tools.

Directions To Explore

To enable teams to control their own cost, I think that we will need to create more transparency. A lot of existing solutions rely on dashboards for cost tracking. That can be part of the answer. However, I don’t think this is sufficient.

To avoid the oops-I-left-the-machine-running problem, for example, I would suggest a combination of a clear ownership system and a smart notification scheme.

Ownership

All large cloud providers support resource tagging. This can be a simple way of establishing ownership: Require that every resource be tagged with the name of the person responsible. Note that resources include test systems, development environments and production systems – everything should have a clear owner.

Any ownership system needs to be linked to a hierarchy of backups to make sure the system works if the person in charge is unavailable when action is required. A simple extension would be to bubble ownership up the org-chart: If an individual contributor is not available to respond to a request, their manager will get called instead.

A natural extension to this scheme would be virtual ownership. For example: “The owner of this resource is whoever is currently on-call”.

There are many ways of implementing such a scheme. What matters is the following: At any point in time and for every resource it must be clear who is going to make sure that the resource is not generating waste.

Notifications

The second ingredient is a non-intrusive notification scheme. This is where things are getting more complicated. The simplest implementation would be an email or slack message listing the resources owned by you currently running and asking for confirmation that these are still required. However, we’re all well aware of the problem with notifications: If they are not targeted and actionable, the risk is that they will go directly to the spam bucket or be glossed over. Therefore, this part hinges on making the signal-to-noise ratio as meaningful as possible. Some ideas would be the following:

Cluster related resources. Instead of listing all subnets and container instances, it would be good to see a meaningful report like “that test environment you spun up yesterday is still running”. An easy way of achieving that is to take into account resource hierarchies and to prefer to show higher-order objects like resource groups or cloud formation stacks.

Prioritising resources by cost. If your lambda function is not generating any cost, who cares that it is still deployed? Like with any optimisation, we want to first and foremost focus on the big fish, like that 200 node EMR cluster for example. A notification system needs to know the so-what: “resource X incurred a cost of X so far (Y/day)”. The highest consumers should be at the top of the list.

Exponential back-off. Some resources are expected to be short-lived, others are expected to run for a long time (e.g. your prod environment). Every time you confirm that a resource is still needed, the system could decide to decrease the frequency of notifications for that resource. A snooze function with exponential back-off if you will. This means that new resources are checked frequently and long-running resources are only considered on occasion.

Conclusion

The above system does not yet address the second typical problem of cost control, that of misjudged capacity requirements. There are similar strategies that we can apply to tackle this issue, though. The gist is the following: Giving teams autonomy to manage their own resources is a good thing. This comes with additional responsibilities which do cost time and can lead to mistakes. Our strategy should be to provide the tools and automation to enable these teams to rise to the challenge – and not to install yet another gatekeeper.