Update to infra oncall process (#10162)

Update to infra oncall process

---------

Co-authored-by: Mike McNeil <mikermcneil@users.noreply.github.com>
This commit is contained in:
Zay Hanlon 2023-02-28 16:19:03 -05:00 committed by GitHub
parent 4593c49ec4
commit c9576a135f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -269,7 +269,7 @@ Estimating bugs and outages can be helpful in getting on the same page about how
[⚗️ Roadmap](https://github.com/orgs/fleetdm/projects/41) - Planning for the next release (shared with the Product team).
## Scaling GOTCHAS
## Scaling gotchas
### Overall
@ -876,9 +876,9 @@ The following rituals are engaged in by the directly responsible individual (DRI
| Release testing/QA | Every three weeks | Every release cycle, by end of day Wednesday of release week, all issues move to "Ready for release" on the 🚀Release board. | Reed Haynes |
| Release testing/QA report | Every three weeks | Every release cycle, on the Monday of release week, the DRI for the release ritual is updated on status of testing. | Reed Haynes |
## Fleet Sandbox Oncall
## 24/7 on-call
Oncall for Sandbox will happen in shifts of 1 week. The people involved in them will be:
The 24/7 on-call (aka infrastructure on-call) team is responsible for alarms related to fleetdm.com, Fleet sandbox, Fleet managed cloud, as well as delivering 24/7 support for Fleet Ultimate customers. The infrastructure (24/7) on-call responsibility happen in shifts of 1 week. The people involved in them will be:
First responders:
@ -887,23 +887,26 @@ First responders:
Escalations (in order):
- Zach Wasserman
- Zay Hanlon
- Zach Wasserman (Fleet app)
- Eric Shaw (fleetdm.com)
- Mike McNeil
The first responder oncall will take ownership of the @infrastructure-oncall alias in Slack for ease first thing Monday morning.
The first responder oncall will take ownership of the @infrastructure-oncall alias in Slack first thing Monday morning. The previous weeks oncall will provide a summary in the #g-customer-experience Slack channel with a an update on alarms that came up the week before, open issues with or without direct end user impact, and other things to keep an eye out for.
Expected response times: during business hours, 1 hour. Outside of business hours 4 hours.
Expected response times: during business hours, 1 hour. Outside of business hours <4 hours.
If the issue is non user facing (provisioner/deprovisioner/temporary errors in osquery/etc), we'll proceed to address the issue. If the issue is user facing (as in, the user noticed this error first hand through the Fleet UI), then we'll proceed to identify the user and ping Mike McNeil in #help-p1 with information about the error (see below) so that he can contact them letting them know that we are aware of this issue and working on it.
For fleetdm.com and sandbox alarms, if the issue is not user facing (e.g. provisioner/deprovisioner/temporary errors in osquery/etc), the oncall engineer will proceed to address the issue. If the issue is user facing (e.g. the user noticed this error first hand through the Fleet UI), then the oncall engineer will proceed to identify the user and contact them letting them know that we are aware of the issue and working on a resolution. They may also request more information from the user if it is needed. They will cc the VP of Customer Success on any user correspondence.
We should collect both the email for the customer and information for the error. If the error happens in work hours, we should make a best effort to understand where in the app the error might have occurred. For this, the oncall engineer will post in `#help-engineering` the data they know of the issue and see if anybody in the frontend team can help identify what might be causing it. Its more helpful to say “we saw that you saw an error while trying to create a query” than to say “your POST /api/blah failed”, but sometimes its not exactly clear where the API might be used exactly.
For Fleet managed cloud alarms that are user facing, the first responder should collect the email address for the customer and all available information on the error. If the error occurs during business hours, the first responder should make a best effort to understand where in the app the error might have occurred. Assistance can be requested in `#help-engineering` by including the data they know regarding the issue, and when available, a frontend or backend engineer can help identify what might be causing the problem. If the error occurs outside of business hours, then the oncall engineer will contact the user letting them know that we are aware of the issue and working on a resolution. Its more helpful to say something like “we saw that you saw an error while trying to create a query” than to say “your POST /api/blah failed”.
Escalation of issues will be done by hand for the moment.
Escalation of issues will be done manually by the first responder according to the escalation contacts mentioned above. An outage issue (template available) should be created in the Fleet confidential repo addressing: who was affected?, for how long?, what expected behavior occurred?, how do you know?, what near-term resolution can be taken to recover the affected user?, what is the underlying reason or suspected reason for the outage?, and what are the next steps Fleet will take to address the root cause?.
All infrastructure alarms will go to #help-p1.
All infrastructure alarms (fleetdm.com, Fleet managed cloud, and sandbox) will go to #help-p1.
The information needed to evaluate and potentially fix any issues is documented in the [runbook](https://github.com/fleetdm/fleet/blob/main/infrastructure/sandbox/readme.md).
There will not be updates on the Fleet version running in sandbox on Fridays unless there's a critical issue being fixed.
When an infrastructure oncall engineer is out of office, Mike McNeil will serve as a backup to oncall in #help-p1. All absences must be communicated in advance with Zay Hanlon and Mike McNeil.
## Slack channels