fleet/handbook/engineering
Noah Talerman b72badccce
Update product DRIs and rituals (#14690)
- "Head of Product" => "Head of Product Design"
- #help-product => #help-product-design
- "Sprint kickoff review" is now one ritual that includes both MDM and
Endpoint ops teams
- "Pre-sprint prioritization" ritual is now one ritual that includes
both MDM and Endpoint ops teams
- Remove "Sprint release notes kickoff" ritual. Plan is to inform
#g-demand of new features asynchronously. Any discussion that needs to
happen live will happen at product office hours
- Remove "Report number of estimated stories (Endpoint ops))" and
"Report number of estimated stories (MDM)" rituals. One person (Head of
Product Design) is both reporting and tracking product KPIs
- Remove "Bug de-prioritization" ritual. Trying this instead: ~~CEO,~~
Head of Product Design, and Head of Product Development align on next
steps for which bugs to schedule into the next sprint and which can be
de-prioritized during the "Churned bug review" ritual. Less meetings.

---------

Co-authored-by: Mike McNeil <mikermcneil@users.noreply.github.com>
2023-10-23 13:45:30 -05:00
..
Debugging.md Reorganize Fleet documentation (#12871) 2023-07-27 17:40:01 -05:00
engineering.rituals.yml Handbook: Engineering (update to new structure) (#14369) 2023-10-09 21:42:45 -05:00
Load-testing.md Fix broken link (#14332) 2023-10-05 14:59:49 -05:00
README.md Update product DRIs and rituals (#14690) 2023-10-23 13:45:30 -05:00
scaling-fleet.md Update scaling-fleet.md (#14285) 2023-10-04 22:35:18 -05:00

Engineering

This handbook page details processes specific to working with and within this department

What we do

The 🚀 Engineering department at Fleet is directly responsible for writing and maintaining the code for Fleet's core product and infrastructure.

Team

Role Contributor(s)
Director of Product Devleopment Luke Heath (@lukeheath)
Engineering Managers George Karr (@georgekarrv), Sharon Katz (@sharon-fdm)
Product Quality Specialists Reed Haynes (@xpkoala), Sabrina Coy (@sabrinabuckets)
Infrastructure Engineers Robert Fairburn (@rfairburn)
Developers See "Current product groups"

Contact us

Scrum at Fleet

Fleet product groups employ scrum, an agile methodology, as a core practice in software development. This process is designed around sprints, which last three weeks to align with our release cadence.

New tickets are estimated, specified, and prioritized on the roadmap:

Scrum items

Our scrum boards are exclusively composed of four types of scrum items:

  1. User stories: These are simple and concise descriptions of features or requirements from the user's perspective, marked with the story label. They keep our focus on delivering value to our customers. Occasionally, due to ZenHub's ticket sub-task structure, the term 'epic' may be seen. However, we treat these as regular user stories.

  2. Sub-tasks: These smaller, more manageable tasks contribute to the completion of a larger user story. Sub-tasks are labeled as ~sub-task and enable us to break down complex tasks into more detailed and easier-to-estimate work units. Sub-tasks are always assigned to exactly one user story.

  3. Timeboxes: Tasks that are specified to complete within a pre-defined amount of time are marked with the timebox label. Timeboxes are research or investigation tasks necessary to move a prioritized user story forward, sometimes called "spikes" in scrum methodology. We use the term "timebox" because it better communicates its purpose. Timeboxes are always assigned to exactly one user story.

  4. Bugs: Representing errors or flaws that result in incorrect or unexpected outcomes, bugs are marked with the bug label. Like user stories and sub-tasks, bugs are documented, prioritized, and addressed during a sprint. Bugs may be estimated or left unestimated, as determined by the product group's engineering manager.

Our sprint boards do not accommodate any other type of ticket. By strictly adhering to these four types of scrum items, we maintain an organized and focused workflow that consistently adds value for our users.

Sprint ceremonies

Each sprint is marked by five essential ceremonies:

  1. Sprint kickoff: On the first day of the sprint, the team, along with stakeholders, select items from the backlog to work on. The team then commits to completing these items within the sprint.
  2. Daily standup: Every day, the team convenes for updates. During this session, each team member shares what they accomplished since the last standup, their plans until the next meeting, and any blockers they are experiencing. Standups should last no longer than fifteen minutes. If additional discussion is necessary, it takes place after the standup with only the required partipants.
  3. Weekly estimation sessions: The team estimates backlog items once a week (three times per sprint). These sessions help to schedule work completion and align the roadmap with business needs. They also provide estimated work units for upcoming sprints. The EM is responsible for the point values assigned to each item and ensures they are as realistic as possible.
  4. Sprint demo: On the last day of each sprint, all engineering teams and stakeholders come together to review the next release. Engineers are allotted 3-10 minutes to showcase features, improvements, and bug fixes they have contributed to the upcoming release. We focus on changes that can be demoed live and avoid overly technical details so the presentation is accessible to everyone. Features should show what is capable and bugs should identify how this might have impacted existing customers and how this resolution fixed that. (These meetings are recorded and posted publicly to YouTube or other platforms, so participants should avoid mentioning customer names. For example, instead of "Fastly", you can say "a publicly-traded hosting company", or use the customer's codename.)
  5. Sprint retrospective: Also held on the last day of the sprint, this meeting encourages discussions among the team and stakeholders around three key areas: what went well, what could have been better, and what the team learned during the sprint.

Meetings

Principles

  • Support the Maker Schedule by keeping meetings to a minimum.
  • Each individual must have a weekly or biweekly sync 1:1 meeting with their manager. This is key to making sure each individual has a voice within the organization.
  • Favor async communication when possible. This is very important to make sure every stakeholder on a project can have a clear understanding of whats happening or what was decided, without needing to attend every meeting (i.e., if a person is sick or on vacation or just life happened.)
  • If an async conversation is not proving to be effective, never hesitate to hop on or schedule a call. Always document the decisions made in a ticket, document, or whatever makes sense for the conversation.

Eng Together

This meeting is to disseminate engineering-wide announcements, promote cohesion across groups within the engineering team, and connect with engineers (and the "engineering-curious") in other departments. Held monthly for one hour.

Participants

Everyone at the company is welcome to attend. All engineers are asked to attend. The subject matter is focused on engineering.

Agenda

  • Announcements
  • Engineering KPIs review
  • “Tech talks”
    • At least one engineer from each product group demos or discusses a technical aspect of their recent work.
    • Everyone is welcome to present on a technical topic. Add your name and tech talk subject in the agenda doc included in the Eng Together calendar event.
  • Social
    • Structured and/or unstructured social activities

User story discovery

User story discovery meetings are scheduled as needed to align on large or complicated user stories. Before a discovery meeting is scheduled, the user story must be prioritized for product drafting and go through the design and specification process. When the user story is ready to be estimated, a user story discovery meeting may be scheduled to provide more dedicated, synchronous time for the team to discuss the user story than is available during weekly estimation sessions.

All participants are expected to review the user story and associated designs and specifications before the discovery meeting.

Participants

  • Product Manager
  • Product Designer
  • Engineering Manager
  • Backend Software Engineer
  • Frontend Software Engineer
  • Product Quality Specialist

Agenda

  • Product Manager: Why this story has been prioritized
  • Product Designer: Walk through user journey wireframes
  • Engineering Manager: Review specifications and any defined sub-tasks
  • Software Engineers: Clarifying questions and implementation details
  • Product Quality Specialist: Testing plan

Group weeklies

A chance for deeper, synchronous discussion on topics relevant across product groups like “Frontend weekly”, “Backend weekly”, etc.

Participants

Anyone who wishes to participate.

Sample agenda (Frontend weekly)

  • Discuss common patterns and conventions in the codebase
  • Review difficult frontend bugs
  • Write engineering-initiated stories

Engineering-initiated stories

Engineering-initiated stories are types of user stories created by engineers to make technical changes to Fleet. Technical changes should improve the user experience or contributor experience. For example, optimizing SQL that improves the response time of an API endpoint improves user experience by reducing latency. A script that generates common boilerplate, or automated tests to cover important business logic, improves the quality of life for contributors, making them happier and more productive, resulting in faster delivery of features to our customers.

It is important to frame engineering-initiated user stories the same way we frame all user stories. Stay focused on how this technical change will drive value for our users.

Engineering-initiated stories follow the user story drafting process. Once your user story is created using the new story template, add the ~engineering-initiated label, assign it to yourself, and work with an EM or PM to progress the story through the drafting process.

We prefer the term engineering-initiated stories over technical debt because the user story format helps keep us focused on our users.

Creating an engineering-initiated story

  1. Create a new feature request issue in GitHub.
  2. Ensure it is labeled with ~engineering-initiated and the relevant product group. Remove any ~customer-request label.
  3. Assign it to yourself. You will own this user story until it is either prioritized or closed.
  4. Schedule a time with an EM and/or PM to present your story. Iterate based on feedback.
  5. You, your EM or PM can bring this to Feature Fest for consideration. All engineering-initiated changes go through the same drafting process as any other story.

We aspire to dedicate 20% of each sprint to technical changes, but may allocate less based on customer needs and business priorities.

Documentation for contributors

Fleet's documentation for contributors can be found in the Fleet GitHub repo.

Release process

This section outlines the release process at Fleet.

The current release cadence is once every three weeks and is concentrated around Wednesdays.

Release freeze period

To ensure release quality, Fleet has a freeze period for testing beginning the Tuesday before the release at 9:00 AM Pacific. Effective at the start of the freeze period, new feature work will not be merged into main.

Bugs are exempt from the release freeze period.

Freeze day

To begin the freeze, open the repo on Merge Freeze and click the "Freeze now" button. This will freeze the main branch and require any PRs to be manually unfrozen before merging. PRs can be manually unfrozen in Merge Freeze using the PR number.

Any Fleetie can unfreeze PRs on Merge Freeze if the PR contains documentation changes or bug fixes only. If the PR contains other changes, please confirm with your manager before unfreezing.

Check dependencies

Before kicking off release QA, confirm that we are using the latest versions of dependencies we want to keep up-to-date with each release. Currently, those dependencies are:

  1. Go: Latest minor release

In Go versioning, the number after the first dot is the "major" version, while the number after the second dot is the "minor" version. For example, in Go 1.19.9, "19" is the major version and "9" is the minor version. Major version upgrades are assessed separately by engineering.

  1. macadmins-extension: Latest release

Note: Some new versions of the macadmins-extension include updates that require code changes in Fleet. Make sure to note in the bug that the update should be checked for any changes, like new tables, that require code changes in Fleet.

Our goal is to keep these dependencies up-to-date with each release of Fleet. If a release is going out with an old dependency version, it should be treated as a critical bug to make sure it is updated before the release is published.

Create release QA issue

Next, create a new GitHub issue using the Release QA template. Add the release version to the title, and assign the quality assurance members of the MDM and Endpoint ops product groups.

Merging during the freeze period

We merge bug fixes and documentation changes during the freeze period, but we do not merge other code changes. This minimizes code churn and helps ensure a stable release. To merge a bug fix, you must first unfreeze the PR in Merge Freeze, and click the "Unfreeze 1 pull request" text link.

It is sometimes necessary to delay the release to allow time to complete partially merged feature work. In these cases, an exception process must be followed before merging during the freeze period.

  1. The engineer requesting the feature work merge exception during freeze notifies their Engineering Manager.
  2. The Engineering Manager notifies the QA lead for the product group and the release ritual DRI.
  3. The Engineering Manager, QA lead, and release ritual DRI must all approve the feature work PR before it is unfrozen and merged.

Release readiness

After each product group finishes their QA process during the freeze period, the EM @ mentions the release ritual DRI in the #help-qa Slack channel. When all EMs have certified that they are ready for release, the release ritual DRI begins the release process.

Release day

Documentation on completing the release process can be found here.

Deploying to dogfood

After each Fleet release, the new release is deployed to Fleet's dogfood (internal) instance.

How to deploy a new release to dogfood:

  1. Head to the Tags page on the fleetdm/fleet Docker Hub: https://hub.docker.com/r/fleetdm/fleet/tags
  2. In the Filter tags search bar, type in the latest release (ex. v4.19.0).
  3. Locate the tag for the new release and copy the image name. An example image name is "fleetdm/fleet:v4.19.0".
  4. Head to the "Deploy Dogfood Environment" action on GitHub: https://github.com/fleetdm/fleet/actions/workflows/dogfood-deploy.yml
  5. Select Run workflow and paste the image name in the The image tag wished to be deployed. field.

Note that this action will not handle down migrations. Always deploy a newer version than is currently deployed.

Note that "fleetdm/fleet:main" is not a image name, instead use the commit hash in place of "main".

Milestone release ritual

Immediately after publishing a new release, we close out the associated GitHub issues and milestones.

Update milestone in GitHub

  1. Rename current milestone: In GitHub, change the current milestone name from 4.x.x-tentative to 4.x.x. 4.37.0-tentative becomes 4.37.0.

ZenHub housekeeping

  1. Update product group boards: In ZenHub, go to each product group board tracking the current release. Usually, these are #g-endpoint-ops and #g-mdm.

  2. Remove milestone from unfinished items: If you see any items in columns other than "Ready for release" tagged with the current milestone, remove that milestone tag. These items didn't make it into the release.

  3. Prep release items: Make sure all items in the "Ready for release" column have the current milestone and sprint tags. If not, select all items in the column and apply the appropriate tags.

  4. Move user stories to drafting board: Select all items in "Ready for release" that have the story label. Apply the :product label and remove the :release label. These items will move back to the product drafting board.

  5. Confirm and close: Make sure that all items with the story label have left the "Ready for release" column. Select all remaining items in the "Ready for release" column and move them to the "Closed" column. This will close the related GitHub issues.

  6. Confirm and celebrate: Now, head to the Drafting board. Find all story issues with the current milestone (these are the ones you just moved). Move them to the "Confirm and celebrate" column. Product will close the issues during their confirm and celebrate ritual.

  7. Close GitHub milestone: Visit GitHub's milestone page and close the current milestone.

  8. Create next milestone: Create a new milestone for the next versioned release, 4.x.x-tentative.

  9. Remove the freeze: Open the repo in Merge Freeze and click the "Unfreeze" button.

  10. Announce that main is unfrozen and the milestone has been closed in #help-engineering.

Oncall rotation

The rotation

See the internal Google Doc for the engineers in the rotation.

Fleet team members can also subscribe to the shared calendar for calendar events.

New engineers are added to the oncall rotation by their manager after they have completed onboarding and at least one full release cycle. We aim to alternate the rotation between product groups when possible.

The oncall rotation may be adjusted with approval from the EMs of any product groups affected. Any changes should be made before the start of the sprint so that capacity can be planned accordingly.

Responsibilities

Second-line response

The oncall engineer is a second-line responder to questions raised by customers and community members.

The community contact (Kathy) is responsible for the first response to GitHub issues, pull requests, and Slack messages in the #fleet channel of osquery Slack, and other public Slacks. Kathy and Zay are responsible for the first response to messages in private customer Slack channels.

We respond within 1-hour (during business hours) for interactions and ask the oncall engineer to address any questions sent their way promptly. When Kathy is unavailable, the oncall engineer may sometimes be asked to take over the first response duties. Note that we do not need to have answers within 1 hour -- we need to at least acknowledge and collect any additional necessary information, while researching/escalating to find answers internally. See Escalations for more on this.

Response SLAs help us measure and guarantee the responsiveness that a customer can expect from Fleet. But SLAs aside, when a Fleet customer has an emergency or other time-sensitive situation ongoing, it is Fleet's priority to help them find them a solution quickly.

PR reviews

PRs from Fleeties are reviewed by auto-assignment of codeowners, or by selecting the group or reviewer manually.

PRs should remain in draft until they are ready to be reviewed for final approval, this means the feature is complete with tests already added. This helps keep our active list of PRs relevant and focused. It is ok and encouraged to request feedback while a PR is in draft to engage the team.

All PRs from the community are routed through the oncall engineer. For documentation changes, the community contact (Kathy) is assigned by the oncall engineer. For code changes, if the oncall engineer has the knowledge and confidence to review, they should do so. Otherwise, they should request a review from an engineer with the appropriate domain knowledge. It is the oncall engineer's responsibility to monitor community PRs and make sure that they are moved forward (either by review with feedback or merge).

Customer success meetings

The oncall engineer is encouraged to attend some of the customer success meetings during the week. Post a message to the #g-endpoint-ops Slack channel requesting invitations to upcoming meetings.

This has a dual purpose of providing more context for how our customers use Fleet. The engineer should actively participate and provide input where appropriate (if not sure, please ask your manager or organizer of the call).

Improve documentation

The oncall engineer is asked to read, understand, test, correct, and improve at least one doc page per week. Our goal is to 1, ensure accuracy and verify that our deployment guides and tutorials are up to date and work as expected. And 2, improve the readability, consistency, and simplicity of our documentation with empathy towards first-time users. See Writing documentation for writing guidelines, and don't hesitate to reach out to #g-digital-experience on Slack for writing support. A backlog of documentation improvement needs is kept here.

Clearing the plate

Engineering managers are asked to be aware of the oncall rotation and schedule a light workload for engineers while they are oncall. While it varies week to week considerably, the oncall responsibilities can sometimes take up a substantial portion of the engineer's time.

We aspire to clear sprint work for the oncall engineer, but due to capacity or other constraints, sometimes the oncall engineer is required for sprint work. When this is the case, the EM will work with the oncall engineer to take over support requests or @oncall assignment completely when necessary.

The remaining time after fulfilling the responsibilities of oncall is free for the engineer to choose their own path. Please choose something relevant to your work or Fleet's goals to focus on. If unsure, speak with your manager.

Some ideas:

  • Do training/learning relevant to your work.
  • Improve the Fleet developer experience.
  • Hack on a product idea. Note: Experiments are encouraged, but not all experiments will ship! Check in with the product team before shipping user-visible changes.
  • Create a blog post (or other content) for fleetdm.com.
  • Try out an experimental refactor.

How to reach the oncall engineer

Oncall engineers do not need to actively monitor Slack channels, except when called in by the Community or Customer teams. Members of those teams are instructed to @oncall in #help-engineering to get the attention of the oncall engineer to continue discussing any issues that come up. In some cases, the Community or Customer representative will continue to communicate with the requestor. In others, the oncall engineer will communicate directly (team members should use their judgment and discuss on a case-by-case basis how to best communicate with community members and customers).

Escalations

When the oncall engineer is unsure of the answer, they should follow this process for escalation.

To achieve quick "first-response" times, you are encouraged to say something like "I don't know the answer and I'm taking it back to the team," or "I think X, but I'm confirming that with the team (or by looking in the code)."

How to escalate:

  1. Spend 30 minutes digging into the relevant code (osquery, Fleet) and/or documentation (osquery, Fleet). Even if you don't know the codebase (or even the programming language), you can sometimes find good answers this way. At the least, you'll become more familiar with each project. Try searching the code for relevant keywords, or filenames.

  2. Create a new thread in the #help-engineering channel, tagging @zwass and provide the information turned up in your research. Please include possibly relevant links (even if you didn't find what you were looking for there). Zach will work with you to craft an appropriate answer or find another team member who can help.

Handoff

The oncall engineer changes each week on Wednesday.

A Slack reminder should notify the oncall of the handoff. Please do the following:

  1. The new oncall engineer should change the @oncall alias in Slack to point to them. In the search box, type "people" and select "People & user groups." Switch to the "User groups" tab. Click @oncall. In the right sidebar, click "Edit Members." Remove the former oncall, and add yourself.

  2. Hand off newer conversations (Slack threads, issues, PRs, etc.). For more recent threads, the former oncall can unsubscribe from the thread, and the new oncall should subscribe. The former oncall should explicitly share each of these threads and the new oncall can select "Get notified about new replies" in the "..." menu. The former oncall can select "Turn off notifications for replies" in that same menu. It can be helpful for the former oncall to remain available for any conversations they were deeply involved in, so use your judgment on which threads to hand off. Anything not clearly handed off remains the responsibility of the former oncall engineer.

In the Slack reminder thread, the oncall engineer includes their retrospective. Please answer the following:

  1. What were the most common support requests over the week? This can potentially give the new oncall an idea of which documentation to focus their efforts on.

  2. Which documentation page did you focus on? What changes were necessary?

  3. How did you spend the rest of your oncall week? This is a chance to demo or share what you learned.

Incident postmortems

At Fleet, we take customer incidents very seriously. After working with customers to resolve issues, we will conduct an internal postmortem to determine any documentation or coding changes to prevent similar incidents from happening in the future. Why? We strive to make Fleet the best osquery management platform globally, and we sincerely believe that starts with sharing lessons learned with the community to become stronger together.

At Fleet, we do postmortem meetings for every production incident, whether it's a customer's environment or on fleetdm.com.

Postmortem document

Before running the postmortem meeting, copy this Postmortem Template document and populate it with some initial data to enable a productive conversation.

Postmortem meeting

Invite all stakeholders, typically the team involved and QA representatives.

Follow the document topic by topic. Keep the goal in mind which is to take action items for addressing the root cause and making sure a similar incident will not happen again.

Distinguish between the root cause of the bug, which by that time was solved and released, and the root cause of why this issue reached our customers. These could be different issues. (e.g. the root cause of the bug was a coding issue, but the root causes (plural) of the event may be that the test plan did not cover a specific scenario, a lack of testing, and a lack of metrics to identify the issue quickly).

Example Finished Document

Postmortem action items

Each action item will have an owner that will be responsible for creating a Github issue promptly after the meeting. This Github issue should be prioritized with the relevant PM/EM.

Outages

At Fleet, we consider an outage to be a situation where new features or previously stable features are broken or unusable.

  • Occurences of outages are tracked in the Outages spreadsheet.
  • Fleet encourages embracing the inevitability of mistakes and discourages blame games.
  • Fleet stresses the critical importance of avoiding outages because they make customers' lives worse instead of better.

Scaling Fleet

Fleet, as a Go server, scales horizontally very well. Its not very CPU or memory intensive. However, there are some specific gotchas to be aware of when implementing new features. Visit our scaling Fleet page for tips on scaling Fleet as efficiently and effectively as possible.

Load testing

The load testing page outlines the process we use to load test Fleet, and contains the results of our semi-annual load test.

Version support

To provide the most accurate and efficient support, Fleet will only target fixes based on the latest released version. In the current version fixes, Fleet will not backport to older releases.

Community version supported for bug fixes: Latest version only

Community support for support/troubleshooting: Current major version

Premium version supported for bug fixes: Latest version only

Premium support for support/troubleshooting: All versions

Reviewing PRs from the community

If you're assigned a community pull request for review, it is important to keep things moving for the contributor. The goal is to not go more than one business day without following up with the contributor.

A PR should be merged if:

  • It's a change that is needed and useful.
  • The CI is passing.
  • Tests are in place.
  • Documentation is updated.
  • Changes file is created.

For PRs that aren't ready to merge:

  • Thank the contributor for their hard work and explain why we can't merge the changes yet.
  • Encourage the contributor to reach out in the #fleet channel of osquery Slack to get help from the rest of the community.
  • Offer code review and coaching to help get the PR ready to go (see note below).
  • Keep an eye out for any updates or responses.

Sometimes (typically for Fleet customers), a Fleet team member may add tests and make any necessary changes to merge the PR.

If everything is good to go, approve the review.

For PRs that will not be merged:

  • Thank the contributor for their effort and explain why the changes won't be merged.
  • Close the PR.

Merging community PRs

When merging a pull request from a community contributor:

  • Ensure that the checklist for the submitter is complete.
  • Verify that all necessary reviews have been approved.
  • Merge the PR.
  • Thank and congratulate the contributor.
  • Share the merged PR with the team in the #help-promote channel of Fleet Slack to be publicized on social media. Those who contribute to Fleet and are recognized for their contributions often become great champions for the project.

Changes to tables' schema

Whenever a PR is proposed for making changes to our tables' schema(e.g. to schema/tables/screenlock.yml), it also has to be reflected in our osquery_fleet_schema.json file.

The website team will periodically update the json file with the latest changes. If the changes should be deployed sooner, you can generate the new json file yourself by running these commands:

cd website
./node_modules/sails/bin/sails.js run generate-merged-schema

When adding a new table, make sure it does not already exist with the same name. If it does, consider changing the new table name or merge the two tables if it makes sense.

If a table is added to our ChromeOS extension but it does not exist in osquery or if it is a table added by fleetd, add a note that mentions it. As in this example.

Quality

Human-oriented QA

Fleet uses a human-oriented quality assurance (QA) process to make sure the product meets the standards of users and organizations.

Automated tests are important, but they can't catch everything. Many issues are hard to notice until a human looks empathetically at the user experience, whether in the user interface, the REST API, or the command line.

The goal of quality assurance is to identify corrections and improvements before release:

  • Bugs
  • Edge cases
  • Error message UX
  • Developer experience using the API/CLI
  • Operator experience looking at logs
  • API response time latency
  • UI comprehensibility
  • Simplicity
  • Data accuracy
  • Perceived data freshness

Finding bugs

To try Fleet locally for QA purposes, run fleetctl preview, which defaults to running the latest stable release.

To target a different version of Fleet, use the --tag flag to target any tag in Docker Hub, including any git commit hash or branch name. For example, to QA the latest code on the main branch of fleetdm/fleet, you can run: fleetctl preview --tag=main.

To start a preview without starting the simulated hosts, use the --no-hosts flag (e.g., fleetctl preview --no-hosts).

For each bug found, please use the bug report template to create a new bug report issue.

For unreleased bugs in an active sprint, a new bug is created with the ~unreleased bug label. The :release label and associated product group label is added, and the engineer responsible for the feature is assigned. If QA is unsure who the bug should be assigned to, it is assigned to the EM. Fixing the bug becomes part of the story.

Debugging

You can read our guide to diagnosing issues in Fleet on the debugging page.

Bug process

All bugs in Fleet are tracked by QA on the bugs board in ZenHub.

Bug states

The lifecycle stages of a bug at Fleet are:

  1. Inbox
  2. Reproduced
  3. In product drafting (as needed)
  4. In engineering
  5. Awaiting QA

The above are all the possible states for a bug as envisioned in this process. These states each correspond to a set of GitHub labels, assignees, and boards.

See Bug states and filters at the end of this document for descriptions of these states and links to each GitHub filter.

Inbox

Quickly reproducing bug reports is a priority for Fleet.

When a new bug is created using the bug report form, it is in the "inbox" state.

At this state, the bug review DRI (QA) is responsible for going through the inbox and documenting reproduction steps, asking for more reproduction details from the reporter, or asking the product team for more guidance. QA has 1 business day to move the bug to the next step (reproduced).

For community-reported bugs, this may require QA to gather more information from the reporter. QA should reach out to the reporter if more information is needed to reproduce the issue. Reporters are encouraged to provide timely follow-up information for each report. At two weeks since last communication QA will ping the reporter for more information on the status of the issue. After four weeks of stale communication QA will close the issue. Reporters are welcome to re-open the closed issue if more investigation is warranted.

Once reproduced, QA documents the reproduction steps in the description and moves it to the reproduced state. If QA or the engineering manager feels the bug report may be expected behavior, or if clarity is required on the intended behavior, it is assigned to the group's product manager. See on GitHub.

Weekly bug review

QA has weekly check-in with product to go over the inbox items. QA is responsible for proposing “not a bug”, closing due to lack of response (with a nice message), or raising other relevant questions. All requires product agreement

QA may also propose that a reported bug is not actually a bug. A bug is defined as “behavior that is not according to spec or implied by spec.” If agreed that it is not a bug, then it's assigned to the relevant product manager to determine its priority.

Reproduced

QA has reproduced the issue successfully. It should now be transferred to engineering.

Remove the “reproduce” label, add the label of the relevant team (e.g. #g-endpoint-ops, #g-mdm, #g-infra, #g-website), and assign it to the relevant engineering manager. (Make your best guess as to which team. The EM will re-assign if they think it belongs to another team.) See on GitHub.

Fast track for Fleeties

Fleeties do not have to wait for QA to reproduce the bug. If you're confident it's reproducible, it's a bug, and the reproduction steps are well-documented, it can be moved directly to the reproduced state.

In product drafting (as needed)

If a bug requires input from product, the :product label is added, it is assigned to the product group's PM, and the bug is moved to the "Product drafting" column of the bugs board. It will stay in this state until product closes the bug, or removes the :product label and assigns to an EM.

In engineering

A bug is in engineering after it has been reproduced and assigned to an EM. If a bug meets the criteria for a critical bug, the :release and ~critical bug labels are added, and it is moved to the "Current release' column of the bugs board. If the bug is a ~critical bug, the EM follows the critical bug notification process.

If the bug does not meet the criteria of a critical bug, the EM will determine if there is capacity in the current sprint for this bug. If so, the :release label is added, and it is moved to the "Current release' column on the bugs board. If there is no available capacity in the current sprint, the EM will move the bug to the "Sprint backlog" column where it will be prioritized for the next sprint.

When fixing the bug, if the proposed solution requires changes that would affect the user experience (UI, API, or CLI), notify the EM and PM to align on the acceptability of the change.

Engineering teams coordinate on bug fixes with the product team during the joint sprint kick-off review. If one team is at capacity and a bug needs attention, another team can step in to assist by following these steps:

For MDM support on Endpoint ops bugs:

  • Remove the #g-endpoint-ops label and add #g-mdm label.
  • Add ~assisting g-endpoint-ops to clarify the bugs origin.

For Endpoint ops support on MDM bugs:

  • Remove the #g-mdm label and add #g-endpoint-ops label.
  • Add ~assisting g-mdm to clarify the bugs origin.

Fleet always prioritizes bugs into a release within six weeks. If a bug is not prioritized in the current release, and it is not prioritized in the next release, it is removed from the "Sprint backlog" and placed back in the "Product drafting" column with the :product label. Product will determine if the bug should be closed as accepted behavior, or if further drafting is necessary.

Awaiting QA

Bugs will be verified as fixed by QA when they are placed in the "Awaiting QA" column of the relevant product group's sprint board. If the bug is verified as fixed, it is moved to the "Ready for release" column of the sprint board. Otherwise, the remaining issues are noted in a comment, and it is moved back to the "In progress" column of the sprint board.

All bugs

Bugs opened this week

This filter returns all "bug" issues opened after the specified date. Simply replace the date with a YYYY-MM-DD equal to one week ago. See on GitHub.

Bugs closed this week

This filter returns all "bug" issues closed after the specified date. Simply replace the date with a YYYY-MM-DD equal to one week ago. See on Github.

Release testing

When a release is in testing, QA should use the Slack channel #help-qa to keep everyone aware of issues found. All bugs found should be reported in the channel after creating the bug first.

When a critical bug is found, the Fleetie who labels the bug as critical is responsible for following the critical bug notification process below.

All unreleased bugs are addressed before publishing a release. Released bugs that are not critical may be addressed during the next release per the standard bug process.

Release blockers

Product may add the ~release blocker label to user stories to indicate that the story must be completed to publish the next version of Fleet. Bugs are never labeled as release blockers.

Critical bugs

A critical bug is a bug with the ~critical bug label. A critical bug is defined as behavior that:

  • Blocks the normal use of a workflow
  • Prevents upgrades to Fleet
  • Causes irreversible damage, such as data loss
  • Introduces a security vulnerability

Critical bug notification process

We need to inform customers and the community about critical bugs immediately so they dont trigger it themselves. When a bug meeting the definition of critical is found, the bug finder is responsible for raising an alarm. Raising an alarm means pinging @here in the #help-product-design channel with the filed bug.

If the bug finder is not a Fleetie (e.g., a member of the community), then whoever sees the critical bug should raise the alarm. (We would expect this to be Customer success in the community Slack or QA in the bug inbox, though it could be anyone.) Note that the bug finder here is NOT necessarily the first person who sees the bug. If you come across a bug you think is critical, but it has not been escalated, raise the alarm!

Once raised, product confirms whether or not it's critical and defines expected behavior. When outside of working hours for the product team or if no one from product responds within 1 hour, then fall back to the #help-p1.

Once the critical bug is confirmed, Customer success needs to ping both customers and the community to warn them. If Customer success is not available, the oncall engineer is responsible for doing this. If a quick fix workaround exists, that should be communicated as well for those who are already upgraded.

When a critical bug is identified, we will then follow the patch release process in our documentation.

Measurement

We track the success of this process by observing the throughput of issues through the system and identifying where buildups (and therefore bottlenecks) are occurring. The metrics are:

  • Number of bugs opened this week
  • Total # bugs open
  • Bugs in each state (inbox, acknowledged, reproduced)
  • Number of bugs closed this week

Each week these are tracked and shared in the weekly KPI sheet by Luke Heath.

Definitions

In the above process, any reference to "product" refers to: Noah Talerman, Head of Product Design. In the above process, any reference to "QA" refers to: Reed Haynes, Product Quality Specialist

Infrastructure

The infrastructure product group is responsible for deploying, supporting, and maintaining all Fleet-managed cloud deployments.

The following are quick links to infrastructure-related README files in both public and private repos that can be used as a quick reference for infrastructure-related code:

Best practices

The infrastructure team follows industry best practices when designing and deploying infrastructure. For containerized infrastructure, Google has created a reference document as an ideal reference for these practices.

Many of these practices must be implemented in Fleet directly, and engineering will work to ensure that feature implementation follows these practices. The infrastructure team will make itself available to provide guidance as needed. If a feature is not compatible with these practices, an issue will be created with a request to correct the implementation.

24/7 on-call

The 24/7 on-call (aka infrastructure on-call) is responsible for alarms related to fleetdm.com and Fleet managed cloud, as well as delivering 24/7 support for Fleet Ultimate customers. The infrastructure (24/7) on-call responsibility happens in shifts of one week. The people involved in them will be:

First responders:

  • Zachary Winnerman
  • Robert Fairburn

Escalations (in order):

  • Luke Heath
  • Zach Wasserman (Fleet app)
  • Eric Shaw (fleetdm.com)
  • Mike McNeil

The first responder on-call will take ownership of the @infrastructure-oncall alias in Slack first thing Monday morning. The previous week's on-call will provide a summary in the #g-infra Slack channel with an update on alarms that came up the week before, open issues with or without direct end-user impact, and other issues to keep an eye out for.

Expected response times: during business hours, 1 hour. Outside of business hours <4 hours.

For fleetdm.com alarms, if the issue is not user-facing, the on-call engineer will proceed to address the issue. If the issue is user-facing (e.g. the user noticed this error first-hand through the Fleet UI), then the on-call engineer will proceed to identify the user and contact them letting them know that we are aware of the issue and working on a resolution. They may also request more information from the user if it is needed. They will cc the EM and PM of the #g-infra group on any user correspondence.

For Fleet managed cloud alarms that are user-facing, the first responder should collect the email address of the customer and all available information on the error. If the error occurs during business hours, the first responder should make their best effort to understand where in the app the error might have occurred. Assistance can be requested in #help-engineering by including the data they know regarding the issue, and when available, a frontend or backend engineer can help identify what might be causing the problem. If the error occurs outside of business hours, the on-call engineer will contact the user letting them know that we are aware of the issue and working on a resolution. Its more helpful to say something like “we saw that you received an error while trying to create a query” than to say “your POST /api/blah failed”.

Escalation of issues will be done manually by the first responder according to the escalation contacts mentioned above. An outage issue (template available) should be created in the Fleet confidential repo addressing:

  1. Who was affected and for how long?
  2. What expected behavior occurred?
  3. How do you know?
  4. What near-term resolution can be taken to recover the affected user?
  5. What is the underlying reason or suspected reason for the outage?
  6. What are the next steps Fleet will take to address the root cause?

All infrastructure alarms (fleetdm.com and Fleet managed cloud) will go to #help-p1.

The information needed to evaluate and potentially fix any issues is documented in the runbook.

When an infrastructure on-call engineer is out of the office, Zach Wasserman will serve as a backup to on-call in #help-p1. All absences must be communicated in advance to Luke Heath and Zach Wasserman.

Accounts

Engineering is responsible for managing third-party accounts required to support engineering infrastructure.

Apple developer account

We use the official Fleet Apple developer account to notarize installers we generate for Apple devices. Whenever Apple releases new terms of service, we are unable to notarize new packages until the new terms are accepted.

When this occurs, we will begin receiving the following error message when attempting to notarize packages: "You must first sign the relevant contracts online." To resolve this error, follow the steps below.

  1. Visit the Apple developer account login page.

  2. Log in using the credentials stored in 1Password under "Apple developer account".

  3. Contact the Head of Business Operations to determine which phone number to use for 2FA.

  4. Complete the 2FA process to log in.

  5. Accept the new terms of service.

Responsibilities

work in progress, contributions welcome, please just make only one small change at a time per PR. See https://fleetdm.com/handbook/company/leadership#vision-for-dept-handbook-pages for info

Interview a developer candidate

As a hiring manager we want to ensure the interview process follows these steps in order. This process must follow creating a new position through receiving job applications. Once the position is approved manage this process per candidate in hiring pipeline

  1. Reach out: If you are not already the primary contact with this candidate send an email or linkedin message introducing yourself and the intent that you would like the start the interview process including the link to the position and asking if they are comfortable with completing a coding exercise.
  2. Deliver code prompt: After recieving confirmation that they are interested download the zip of the code challenge and ask them to complete this and send their entry back within 5 business days.
  3. Test code prompt: Verify the code runs and can complete the challenge correctly. Check the code for good style and tests that match our standards here at Fleet.
  4. Schedule manager interview: Send the candidate a calendly link for 1hr to talk to you and screen them if they are a good fit for this role and our culture.
  5. Schedule technical interview: Send the candidate a calendly link for 1hr to talk to a senior engineer on your team where the goal is to understand the thechnical capabilities of the candidate. An additional engineer can optionally join if available.
  6. Schedule DOPD interview: Send the candidate a calendly link for 30m talk to the Director of Product Development @lukeheath.
  7. Schedule CTO interview: Send the candidate a calendly link for 30m talk with our CTO @zwass.

If the candidate passes all of these steps then continue with hiring a new team member.

Renew MDM certificate signing request (CSR)

The certificate signing request (CSR) certificate expires every year. It needs to be renewed prior to expiring. This is notified to the team by the MDM calendar event IMPORTANT: Renew MDM CSR certificate

Steps to renew the certificate:

  1. Visit the Apple developer account login page.
  2. Log in using the credentials stored in 1Password under Apple developer account.
  3. Verify you are using the Enterprise subaccount for Fleet Device Management Inc.
  4. Generate a new certificate following the instructions in MicroMDM.
  5. Note: mdmctl (a micromdm command for MDM vendors) will generate a VendorPrivateKey.key and VendorCertificateRequest.csr using an appropriate shared email relay and a passphrase (suggested generation method with pwgen available in brew / apt / yum pwgen -s 32 -1vcy)
  6. Uploading VendorCertificateRequest.csr to Apple you will download a corresponding mdm.cer file
  7. Convert the downloaded cert to PEM with openssl x509 -inform DER -outform PEM -in mdm.cer -out server.crt.pem
  8. Update the Config vars in Heroku:
  • Update sails_custom__mdmVendorCertPem with the results from step 7 server.crt.pem
  • Update sails_custom__mdmVendorKeyPassphrase with the passphrase used in step 4
  • Update sails_custom__mdmVendorKeyPem with VendorPrivateKey.key from step 4
  1. Store updated values in Confidential 1Password Vault
  2. Verify by logging into a normal apple account (not billing@...) and Generate a new Push Certificate following our setup MDM steps and verify the Expiration date is 1 year from today.
  3. Adjust calendar event to be between 2-4 weeks before the next expiration.

Rituals

Stubs

Scrum boards

Please see 📖handbook/company/engineering#contact-us