mirror of https://github.com/empayre/fleet.git synced 2024-11-06 08:55:24 +00:00

History

Sarah Gillespie 64b85f87f7 Restructure "Scaling Fleet" handbook page for ease of reference (#16850 ) This PR consolidates various subheadings into one list that appears "above the fold" to make it easier for contributors to find the info they are looking for on the page. As it was previously, important info was getting buried under the "Connect to Dogfood" instructions, which gave the wrong impression about the scope of the page content.		2024-02-15 22:24:03 -06:00
..
Debugging.md	Merge fleetd doc page to enroll hosts page (#15907 )	2024-01-10 14:35:18 -05:00
engineering.rituals.yml	Add new responsibilities to engineering handbook (#16755 )	2024-02-13 13:12:09 -06:00
Load-testing.md	Fix broken link (#14332 )	2023-10-05 14:59:49 -05:00
README.md	Fix broken link (#16799 )	2024-02-14 01:45:19 -06:00
scaling-fleet.md	Restructure "Scaling Fleet" handbook page for ease of reference (#16850 )	2024-02-15 22:24:03 -06:00

README.md

Engineering

This handbook page details processes specific to working with and within this department

Team

Role	Contributor(s)
Head of Product Engineering	Luke Heath (@lukeheath)
Engineering Manager	George Karr (@georgekarrv), Sharon Katz (@sharon-fdm)
Product Quality Specialist	Reed Haynes (@xpkoala), Sabrina Coy (@sabrinabuckets)
Developer	See "Current product groups"

Contact us

Any community member can file a 🦟 "Bug report" (If urgent, mention a team member in the #help-engineering Slack channel).
- Any Fleet team member can view the Endpoint ops, MDM, or Website kanban boards including the status on all reported bugs.
- Please use issue comments and GitHub mentions to communicate follow-ups or answer questions related to your request.

Responsibilities

The 🚀 Engineering department at Fleet is directly responsible for writing and maintaining the code for Fleet's core product and infrastructure.

Record engineering KPIs

We track the success of this process by observing the throughput of issues through the system and identifying where buildups (and therefore bottlenecks) are occurring. The metrics are:

Number of bugs opened this week
Total # bugs open
Bugs in each state (inbox, acknowledged, reproduced)
Number of bugs closed this week

Each week these are tracked and shared in the weekly KPI sheet by Luke Heath.

Begin a merge freeze

To ensure release quality, Fleet has a freeze period for testing beginning the Tuesday before the release at 9:00 AM Pacific. Effective at the start of the freeze period, new feature work will not be merged into main.

Bugs are exempt from the release freeze period.

To begin the freeze, open the repo on Merge Freeze and click the "Freeze now" button. This will freeze the main branch and require any PRs to be manually unfrozen before merging. PRs can be manually unfrozen in Merge Freeze using the PR number.

Any Fleetie can unfreeze PRs on Merge Freeze if the PR contains documentation changes or bug fixes only. If the PR contains other changes, please confirm with your manager before unfreezing.

Merge a pull request during the freeze period

We merge bug fixes, documentation changes, and website updates during the freeze period, but we do not merge other code changes. This minimizes code churn and helps ensure a stable release. To merge a bug fix, you must first unfreeze the PR in Merge Freeze, and click the "Unfreeze 1 pull request" text link.

To allow a stable release test, the final 24 hours before release is a deep freeze when only bugs with the ~release-blocker or ~unreleased-bug labels are merged.

If there is partially merged feature work when freeze begins, the previously merged code must be reverted. If there is an exceptional, business-critical need to merge feature work during freeze, as determined by the release ritual DRI, the following exception process may be followed:

The engineer requesting the feature work merge exception during freeze notifies their Engineering Manager.
The Engineering Manager notifies the QA lead for the product group and the release ritual DRI.
The Engineering Manager, QA lead, and release ritual DRI must all approve the feature work PR before it is unfrozen and merged.

This exception process should be avoided whenever possible. Any feature work merged during freeze will likely result in a significant release delay.

Confirm latest versions of dependencies

Before kicking off release QA, confirm that we are using the latest versions of dependencies we want to keep up-to-date with each release. Currently, those dependencies are:

Go: Latest minor release

Check the version included in Fleet.
Check the latest minor version of Go. For example, if we are using go1.19.8, and there is a new minor version go1.19.9, we will upgrade.
If the latest minor version is greater than the version included in Fleet, file a bug and assign it to the release ritual DRI and the current oncall engineer. Add the ~release blocker label. We must upgrade to the latest minor version before publishing the next release.
If the latest major version is greater than the version included in Fleet, create a story and assign it to the release ritual DRI and the current oncall engineer. This will be considered for an upcoming sprint. The release can proceed without upgrading the major version.
Note that major version upgrades also require an update to go.mod.

In Go versioning, the number after the first dot is the "major" version, while the number after the second dot is the "minor" version. For example, in Go 1.19.9, "19" is the major version and "9" is the minor version. Major version upgrades are assessed separately by engineering.

macadmins-extension: Latest release

Check the latest version of the macadmins-extension.
Check the version included in Fleet.
If the latest stable version of the macadmins-extension is greater than the version included in Fleet, file a bug and assign it to the release ritual DRI and the current on-call engineer.
Add the ~release blocker label.

Note: Some new versions of the macadmins-extension include updates that require code changes in Fleet. Make sure to note in the bug that the update should be checked for any changes, like new tables, that require code changes in Fleet.

Our goal is to keep these dependencies up-to-date with each release of Fleet. If a release is going out with an old dependency version, it should be treated as a critical bug to make sure it is updated before the release is published.

osquery: Latest release

Check the latest version of osquery.
Check the version included in Fleet.
If the latest release of osquery is greater than the version included in Fleet, file a bug and assign it to the release ritual DRI and the current on-call engineer.
Do not add the ~release blocker label.
Update the bug description to note that changes to osquery command-line flags require updates to Fleet's flag validation and related documentation as shown in this pull request.

Vulnerability data sources

Check the NIST National Vulnerability Database website for any announcements that might impact our NVD data feed.
Check the CISA website for any news or announcements that might impact our CISA data feed.

If an announcement is found for either data source that may impact data feed availability, notify the current on-call engineer. Notify them that it is their responsibility to investigate and file a bug or take further action as necessary.

Fleetd components

Check for code changes to Orbit or Desktop since the last orbit-* tag was published.
Check for code changes to the fleetd-chrome extension since the last fleetd-chrome-* tag was published.

If code changes are found for any fleetd components, create a new release QA issue to update fleetd. Create and assign the release QA issue to a corresponding GitHub milestone for each tag that will be issued (fleet-, orbit-, fleetd-chrome-).

Create release QA issue

Next, create a new GitHub issue using the Release QA template. Add the release version to the title, and assign the quality assurance members of the MDM and Endpoint ops product groups.

The issue's template will contain validation steps for Fleet and individual fleetd components. Remove any instructions that do not apply to this release.

Indicate your product group is release-ready

Once a product group completes its QA process during the freeze period, its QA lead moves the smoke testing ticket to the "Ready for release" column on their ZenHub board. They then notify the release ritual DRI by tagging them in a comment, indicating that their group is prepared for release. The release ritual DRI starts the release process after all QA leads have made these updates and confirmed their readiness for release.

Prepare Fleet release

Documentation on completing the release process can be found here.

Deploy a new release to dogfood

After each Fleet release, the new release is deployed to Fleet's "dogfood" (internal) instance.

How to deploy a new release to dogfood:

Head to the Tags page on the fleetdm/fleet Docker Hub: https://hub.docker.com/r/fleetdm/fleet/tags
In the Filter tags search bar, type in the latest release (ex. v4.19.0).
Locate the tag for the new release and copy the image name. An example image name is "fleetdm/fleet:v4.19.0".
Head to the "Deploy Dogfood Environment" action on GitHub: https://github.com/fleetdm/fleet/actions/workflows/dogfood-deploy.yml
Select Run workflow and paste the image name in the The image tag wished to be deployed. field.

Note that this action will not handle down migrations. Always deploy a newer version than is currently deployed.

Note that "fleetdm/fleet:main" is not a image name, instead use the commit hash in place of "main".

Conclude current milestone

Immediately after publishing a new release, we close out the associated GitHub issues and milestones.

Rename current milestone: In GitHub, change the current milestone name from 4.x.x-tentative to 4.x.x. 4.37.0-tentative becomes 4.37.0.
Update product group boards: In ZenHub, go to each product group board tracking the current release. Usually, these are #g-endpoint-ops and #g-mdm.
Remove milestone from unfinished items: If you see any items in columns other than "Ready for release" tagged with the current milestone, remove that milestone tag. These items didn't make it into the release.
Prep release items: Make sure all items in the "Ready for release" column have the current milestone and sprint tags. If not, select all items in the column and apply the appropriate tags.
Move user stories to drafting board: Select all items in "Ready for release" that have the story label. Apply the :product label and remove the :release label. These items will move back to the product drafting board.
Confirm and close: Make sure that all items with the story label have left the "Ready for release" column. Select all remaining items in the "Ready for release" column and move them to the "Closed" column. This will close the related GitHub issues.
Confirm and celebrate: Now, head to the Drafting board. Find all story issues with the current milestone (these are the ones you just moved). Move them to the "Confirm and celebrate" column. Product will close the issues during their confirm and celebrate ritual.
Close GitHub milestone: Visit GitHub's milestone page and close the current milestone.
Create next milestone: Create a new milestone for the next versioned release, 4.x.x-tentative.
Remove the freeze: Open the repo in Merge Freeze and click the "Unfreeze" button.
Announce that main is unfrozen and the milestone has been closed in #help-engineering.

Update the Fleet releases calendar

The Fleet releases Google calendar is kept up-to-date by the release ritual DRI. Any change to targeted release dates is reflected on this calendar.

Review a community pull request

If you're assigned a community pull request for review, it is important to keep things moving for the contributor. The goal is to not go more than one business day without following up with the contributor.

A PR should be merged if:

It's a change that is needed and useful.
The CI is passing.
Tests are in place.
Documentation is updated.
Changes file is created.

For PRs that aren't ready to merge:

Thank the contributor for their hard work and explain why we can't merge the changes yet.
Encourage the contributor to reach out in the #fleet channel of osquery Slack to get help from the rest of the community.
Offer code review and coaching to help get the PR ready to go (see note below).
Keep an eye out for any updates or responses.

Sometimes (typically for Fleet customers), a Fleet team member may add tests and make any necessary changes to merge the PR.

If everything is good to go, approve the review.

For PRs that will not be merged:

Thank the contributor for their effort and explain why the changes won't be merged.
Close the PR.

Merge a community pull request

When merging a pull request from a community contributor:

Ensure that the checklist for the submitter is complete.
Verify that all necessary reviews have been approved.
Merge the PR.
Thank and congratulate the contributor.
Share the merged PR with the team in the #help-promote channel of Fleet Slack to be publicized on social media. Those who contribute to Fleet and are recognized for their contributions often become great champions for the project.

Schedule developer on-call workload

Engineering managers are asked to be aware of the on-call rotation and schedule a light workload for engineers while they are on-call. While it varies week to week considerably, the on-call responsibilities can sometimes take up a substantial portion of the engineer's time.

We aspire to clear sprint work for the on-call engineer, but due to capacity or other constraints, sometimes the on-call engineer is required for sprint work. When this is the case, the EM will work with the on-call engineer to take over support requests or @oncall assignment completely when necessary.

The remaining time after fulfilling the responsibilities of on-call is free for the engineer to choose their own path. Please choose something relevant to your work or Fleet's goals to focus on. If unsure, speak with your manager.

Some ideas:

Do training/learning relevant to your work.
Improve the Fleet developer experience.
Hack on a product idea. Note: Experiments are encouraged, but not all experiments will ship! Check in with the product team before shipping user-visible changes.
Create a blog post (or other content) for fleetdm.com.
Try out an experimental refactor.

Assume developer on-call alias

The on-call developer is responsible for:

Knowing the on-call rotation.
Preforming the on-call responsibilities.
Escalating community questions and issues.
Successfully transferring the on-call persona to the next developer.

Notify community members about a critical bug

We inform customers and the community about critical bugs immediately so they don’t trigger it themselves. When a bug meeting the definition of critical is found, the bug finder is responsible for raising an alarm. Raising an alarm means pinging @here in the #help-product-design channel with the filed bug.

If the bug finder is not a Fleetie (e.g., a member of the community), then whoever sees the critical bug should raise the alarm. (We would expect this to be Customer success in the community Slack or QA in the bug inbox, though it could be anyone.) Note that the bug finder here is NOT necessarily the first person who sees the bug. If you come across a bug you think is critical, but it has not been escalated, raise the alarm!

Once raised, product confirms whether or not it's critical and defines expected behavior. When outside of working hours for the product team or if no one from product responds within 1 hour, then fall back to the #help-p1.

Once the critical bug is confirmed, Customer success needs to ping both customers and the community to warn them. If Customer success is not available, the on-call engineer is responsible for doing this. If a quick fix workaround exists, that should be communicated as well for those who are already upgraded.

When a critical bug is identified, we will then follow the patch release process in our documentation.

After a critical bug is fixed, an incident postmortem is scheduled by the EM of the product group that fixed the bug.

Notify stakeholders when a user story is pushed to the next release

User stories are intended to be completed in a single sprint. When a user story selected for a release has not merged into main by the time the merge freeze begins, it is the product group EM's responsibility to notify stakeholders:

Add the ~pushed label to the user story.
Update the user story's milestone to the next minor version milestone.
Comment on the GitHub issue and at-mention the PM and anyone listed in the requester field.
If customer- labels are applied to the user story, at-mention the VP of Customer Success.

Run Fleet locally for QA purposes

To try Fleet locally for QA purposes, run fleetctl preview, which defaults to running the latest stable release.

To target a different version of Fleet, use the --tag flag to target any tag in Docker Hub, including any git commit hash or branch name. For example, to QA the latest code on the main branch of fleetdm/fleet, you can run: fleetctl preview --tag=main.

To start a preview without starting the simulated hosts, use the --no-hosts flag (e.g., fleetctl preview --no-hosts).

For each bug found, please use the bug report template to create a new bug report issue.

For unreleased bugs in an active sprint, a new bug is created with the ~unreleased bug label. The :release label and associated product group label is added, and the engineer responsible for the feature is assigned. If QA is unsure who the bug should be assigned to, it is assigned to the EM. Fixing the bug becomes part of the story.

Accept new Apple developer account terms

Engineering is responsible for managing third-party accounts required to support engineering infrastructure. We use the official Fleet Apple developer account to notarize installers we generate for Apple devices. Whenever Apple releases new terms of service, we are unable to notarize new packages until the new terms are accepted.

When this occurs, we will begin receiving the following error message when attempting to notarize packages: "You must first sign the relevant contracts online." To resolve this error, follow the steps below.

Visit the Apple developer account login page.
Log in using the credentials stored in 1Password under "Apple developer account".
Contact the Head of Business Operations to determine which phone number to use for 2FA.
Complete the 2FA process to log in.
Accept the new terms of service.

Interview a developer candidate

As a hiring manager we want to ensure the interview process follows these steps in order. This process must follow creating a new position through receiving job applications. Once the position is approved manage this process per candidate in hiring pipeline

Reach out: If you are not already the primary contact with this candidate send an email or linkedin message introducing yourself and the intent that you would like the start the interview process including the link to the position and asking if they are comfortable with completing a coding exercise.
Deliver code prompt: After recieving confirmation that they are interested download the zip of the code challenge and ask them to complete this and send their entry back within 5 business days.
Test code prompt: Verify the code runs and can complete the challenge correctly. Check the code for good style and tests that match our standards here at Fleet.
Schedule manager interview: Send the candidate a calendly link for 1hr to talk to you and screen them if they are a good fit for this role and our culture.
Schedule technical interview: Send the candidate a calendly link for 1hr to talk to a senior engineer on your team where the goal is to understand the thechnical capabilities of the candidate. An additional engineer can optionally join if available.
Schedule DOPD interview: Send the candidate a calendly link for 30m talk to the Director of Product Development @lukeheath.
Schedule CTO interview: Send the candidate a calendly link for 30m talk with our CTO @zwass.

If the candidate passes all of these steps then continue with hiring a new team member.

Renew MDM certificate signing request (CSR)

The certificate signing request (CSR) certificate expires every year. It needs to be renewed prior to expiring. This is notified to the team by the MDM calendar event IMPORTANT: Renew MDM CSR certificate

Steps to renew the certificate:

Visit the Apple developer account login page.
Log in using the credentials stored in 1Password under Apple developer account.
Verify you are using the Enterprise subaccount for Fleet Device Management Inc.
Generate a new certificate following the instructions in MicroMDM.
Note: mdmctl (a micromdm command for MDM vendors) will generate a VendorPrivateKey.key and VendorCertificateRequest.csr using an appropriate shared email relay and a passphrase (suggested generation method with pwgen available in brew / apt / yum pwgen -s 32 -1vcy)
Uploading VendorCertificateRequest.csr to Apple you will download a corresponding mdm.cer file
Convert the downloaded cert to PEM with openssl x509 -inform DER -outform PEM -in mdm.cer -out server.crt.pem
Update the Config vars in Heroku:

Update sails_custom__mdmVendorCertPem with the results from step 7 server.crt.pem
Update sails_custom__mdmVendorKeyPassphrase with the passphrase used in step 4
Update sails_custom__mdmVendorKeyPem with VendorPrivateKey.key from step 4

Store updated values in Confidential 1Password Vault
Verify by logging into a normal apple account (not billing@...) and Generate a new Push Certificate following our setup MDM steps and verify the Expiration date is 1 year from today.
Adjust calendar event to be between 2-4 weeks before the next expiration.

Preform an incident postmortem

At Fleet, we take customer incidents very seriously. After working with customers to resolve issues, we will conduct an internal postmortem to determine any process, documentation, or coding changes to prevent similar incidents from happening in the future. Why? We strive to make Fleet the best osquery management platform globally, and we sincerely believe that starts with sharing lessons learned with the community to become stronger together.

At Fleet, we do postmortem meetings for every service or feature outage and every critical bug, whether it's a customer's environment or on fleetdm.com.

Postmortem documentation Before running the postmortem meeting, copy this Postmortem Template document and populate it with some initial data to enable a productive conversation.
Postmortem meeting Invite all stakeholders, typically the team involved and QA representatives.

Follow the document topic by topic. Keep the goal in mind which is to take action items for addressing the root cause and making sure a similar incident will not happen again.

Distinguish between the root cause of the bug, which by that time was solved and released, and the root cause of why this issue reached our customers. These could be different issues. (e.g. the root cause of the bug was a coding issue, but the root causes (plural) of the event may be that the test plan did not cover a specific scenario, a lack of testing, and a lack of metrics to identify the issue quickly).

Example Finished Document

Postmortem action items Each action item will have an owner that will be responsible for creating a Github issue promptly after the meeting. This Github issue should be prioritized with the relevant PM/EM.