I say SRE..you say?
Thought experiment - what are you picturing when you think “SRE”? I reckon most people see a dashboard and someone looking at it. I say the SRE function should not exist in an organisation, at least not the dashboard viewing + human alerter version that I see most often.
What should exist, is an infrastructure that tells you when it’s warming up in all the wrong ways. Followed by systems that allow on-calls to triage the problem and find causality. Mean Time To Recovery (MTTR) must be the metric that any team that minds the infrastructure should be held to.
No dashboard viewing human is going to help you improve this.
What’s the playbook?
How do we get to this mythical infrastructure goodness? The playbook in itself is as follows :
- For all resources (api’s, databases, message buses, vendors, whatever…)
- Document It : What’s where and what value does it add?
- Add Monitors & Tune It : What to measure, when it it yellow, when is it red?
- Create On-Calls : You build it, you support it, place skin in the game
- Add Alerts : Let a machine call your on-call who must triage and drive incidents
- Incident Management : Institute an incident management protocol
- Causality : It’s smoking in the chimney, but the fire is in the basement. Logs / dashboards are your friends
- Sleep better at night
The play book is fairly straightforward, the discipline to get it right is harder. The TL;DR ends here.
Starting with the basics, can you identify all the moving parts that make your product? This includes pretty much everything you can think of that is needed to deliver value to your customer. Oddly, this is not readily available - most people will volunteer their services, since you’re only talking to your back-end teams. The key item to remember here is - what can disrupt customer value realisation?
I haven’t found a standardised way to describe all the moving parts of your system. There’s stuff like BackStage and so on, but you still need your teams to put all this together. Whatever you use, Excel, YAML some open source goodness, it will still require your team to pull it together. This is step 1. Do this incompletely and you will discover fires through your customers and irate executives.
Document It With Thresholds
I’m a big believer in “Forward Debt”, which effectively means, remember what you’re building, even when intermediate steps look nothing like your end result. So, if you are aggregating the parts of your empire, also, while you have the attention of your team, put in operating thresholds.
This is the building block of the first step - which is to create monitors. To know if you’re speeding, you need a speedometer. We’re going to connect speedometers. First, we must identify where our “tyres” are.
Here’s a starter checklist, whatever you can fill is gravy
- Name of the thing
- What does it do?
- How to find it?
- Environment URI’s
- How to get help?
- Team Oncall Email / Bot / Pager
- Team Wiki Home
- Team Dashboard Home (Level 1/2 Performance and Log viewers)
- Deployment Topology
- [Low | High] Resource Count Amber
- [Low | High] Resource Count Red
- p90 (or whatever your high bar needs to be)
- Amber | Red
- Amber | Red
- Amber | Red
- Connection Count
- Amber | Red
- External Reachability
- Ping Tests
- Smoke Metric (ONE metric, that is your early warning sign)
- Vendor provided data point / one lead service or app metric
This should give you a sense of the things you want to chase. Shard by vendor, type of resource, as needed. This must be a shared, living document, whatever you use. It will form an invaluable platform from which many vertical applications can be created.
Monitor & Tune Them Thresholds
Onwards to the wiring bit. Whatever flavour you end up using, recognise your metric, as articulated earlier, emit it, aggregate it and project it out, so that machines can begin to reason over the data domain of the metric and help you to move to the next step.
Monitors will be noisy. This is a rite of passage. You think you know thresholds, but, “hello temporal behaviour”. Get ready to field a lot of false positives, dont’ give up - use this phase to learn what the true range of amber and red is.
This is the step where most teams fall off because we send emails, slack integrations and there is so much inbound, that we begin to ignore our monitors or it’s just not humanly possible to look at all the “look-at-me” data that a machine can generate. The senior engineers and architects in the group must earn their stripes here and persist through to cut through the noise and find the fine tune.
Create an On-call Schedule
While you’re improving Signal : Noise ratio, you will be traversing a cultural hurdle. Who should support? Who should look at the inbound messages that your system is now telling you about. My principle is old school, if you build it, you support it. Every senior engineer and higher grade in the organisation must be on an on-call schedule.
On-calls graduate from being team on-calls to platform on-calls. Yes, an Android engineer should be expected to be a platform on-call. Your mileage may vary. The spirit is that understanding the platform , while hard, will help to build empathy and stronger products for your customer. None of this stuff is easy.
On-call culture is hard to enforce, so, look to spread the work load. Weekly oncalls, with your turn coming every 3-4 months or longer is a reasonable expectation from senior engineers in your team.
Not all team members will like this, some will quit, there will be a lot of complaining, at this point, leaders will toy with the idea of an “SRE Team”, “hey, lets hire a group of people who didn’t build this to be responsible for it.”
This is your cultural hurdle. Each organisation will find what is right for it. There is no silver bullet. This problem statement can be re-phrased as - “who is responsible for quality?”, if the answer is “QA Team”, then that’s the DNA of your team, you delegate responsibility. Separate blog.
Engineers are responsible for quality, they are responsible for the services they put out in production. While we create DevOps, Automation teams, their focus is to help the builders be more efficient and shift risk “left”.
Though, this is your cultural hurdle. Each leader will process this differently.
Alert On It, Act on it
We’ve got monitors that are tuned (we think), we have on-calls (yay!). Now the fun begins. Stop relying on humans to look at thousands of monitor messages and figure out if something is wrong. Invest in a product that calls you - phone / sms / push notifications/ smoke signals - whatever.
Money worth spending to manage on-call schedules, deal with swaps, holidays and something that can detect something is off and route that alert to the right on-call. If they don’t answer, it knows how to escalate all the way to whatever your comfort its. I still get woken up in the middle of the night with pages.
If you are relying on any human dependent alerting, you’re not reaping the benefits of all the work done so far. Granted, some startup’s might not be able to spend money on such tooling - but re-phrase the spend as - “what is the cost of customer downtime?”.
Why do we fall, Bruce? So we can learn to pick ourselves up.
This is a skill where it helps to have someone who has done it before. However, each of your team members will need to earn their stripes.
No matter what the incident
- Don’t Panic
- Mitigate First, Stop the bleeding
- Ride out the storm
Some hygiene here :
- Every incident has ONE owner, she orchestrates actions
- Every incident has ONE notes taker, several people can update a shared incident log, but one person oversees it
- Every incident MUST have a chronology maintained, the note taker is best placed to record everything that happened.
- Maintain sanity, control conversations and theories, fork people into groups rather than have everyone talk over each other in the “main group”, where the incident owner drives.
- Incident owners have absolute decision making power. Everyone can make recommendations, but follow a strict decision making protocol.
There is no substitute for drills here, but, the real thing is the real thing. So, you just need to live through a bunch of these. While it’s not in the “play book”, hygiene on “system mutation” is a life saver. Be draconian about documenting when something changed in production, why it changed and who changed it. When something breaks, this chronology will help you.
Now we’re getting into MTTR (Mean Time To Recovery) territory. Knowing where to look becomes a factor of how quickly you will recover. A chronology of change control is a good place to begin. Being able to look at your system topology, with flow logs and potentially where things are not meeting the bar is super valuable. Depending on how lucky you are, the offending service or resource is less likely to be the cause of the trouble.
Building a toplogy and its flow control is an improvement, however, leveraging your system registry to look up key dashboards and metrics to identify the “smoke” will help. In its most basic form, a dashboard of just your key P0 service/resource/vendor metrics will be your first port of call. Leverage a dashboard aggregator service to pull this one key dashboard, if you do nothing else. P0 is defined as the bare minimum number of things that must keep running to continue to deliver customer value.
Creating a culture of ownership around platform stability and customer facing value falls to senior leadership. While everything in this play book requires discipline and execution excellence, ultimately the tone must be set right at the top. If your DNA is to fire-fight, then no matter what you do, you’ll fire-fight.
If you want to not be a fire-station, then this playbook or some variant of it that fits your organisation will help you. This particular discipline requires cultural alignment.
Operational Intelligence > Early Warning Systems
Over time, your system traffic flow logs become sources of raw data to analyse how your system behaves, how traffic flows temporally and ultimately help to build intelligence on “curves” of traffic that let detect anomalies. The best incidents are those that were caught early and snuffed out before they became fires. Leveraging machine models, one can build these systems out and train them on the operational data collected through the course of building an efficient observability function.
I say SRE, you say I saw it coming
Somewhere up there we talked about sleeping better. When I think SRE, I picture myself being paged when something is starting to smoke.
Let your system call you when a human needs to intervene, it doesn’t need a nanny.