fleet/docs/Using-Fleet/Monitoring-Fleet.md

# Monitoring Fleet
- [Health checks](#health-checks)
- [Metrics](#metrics)
  - [Alerting](#alerting)
  - [Graphing](#graphing)
- [Fleet server performance](#fleet-server-performance)
  - [Horizontal scaling](#horizontal-scaling)
  - [Availability](#availability)
  - [Monitoring](#monitoring)
  - [Debugging performance issues](#debugging-performance-issues)
    - [MySQL and Redis](#mysql-and-redis)
    - [Fleet server](#fleet-server)

## Health checks

Fleet exposes a basic health check at the `/healthz` endpoint. This is the interface to use for simple monitoring and load-balancer health checks.

The `/healthz` endpoint will return an `HTTP 200` status if the server is running and has healthy connections to MySQL and Redis. If there are any problems, the endpoint will return an `HTTP 500` status. Details about failing checks are logged in the Fleet server logs.

Individual checks can be run by providing the `check` URL parameter (e.x., `/healthz?check=mysql` or `/healthz?check=redis`).
## Metrics

Fleet exposes server metrics in a format compatible with [Prometheus](https://prometheus.io/). A simple example Prometheus configuration is available in [tools/app/prometheus.yml](https://github.com/fleetdm/fleet/blob/194ad5963b0d55bdf976aa93f3de6cabd590c97a/tools/app/prometheus.yml).

Prometheus can be configured to use a wide range of service discovery mechanisms within AWS, GCP, Azure, Kubernetes, and more. See the Prometheus [configuration documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) for more information.

### Alerting

#### Prometheus

Prometheus has built-in support for alerting through [Alertmanager](https://prometheus.io/docs/alerting/latest/overview/).

Consider building alerts for

- Changes from expected levels of host enrollment
- Increased latency on HTTP endpoints
- Increased error levels on HTTP endpoints

```
TODO (Seeking Contributors)
Add example alerting configurations
```

#### Cloudwatch Alarms

Cloudwatch Alarms can be configured to support a wide variety of metrics and anomaly detection mechanisms. There are some example alarms
in the terraform reference architecture (see `monitoring.tf`).

* [Monitoring RDS (MySQL)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/monitoring-cloudwatch.html)
* [ElastiCache for Redis](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.WhichShouldIMonitor.html)
* [Monitoring ECS](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html)
* Reference alarms include evaluating healthy targets & response times. We also use target-tracking alarms to manage auto-scaling.

### Graphing

Prometheus provides basic graphing capabilities, and integrates tightly with [Grafana](https://prometheus.io/docs/visualization/grafana/) for sophisticated visualizations.

## Fleet server performance

Fleet is designed to scale to hundreds of thousands of online hosts. The Fleet server scales horizontally to support higher load.

### Horizontal scaling

Scaling Fleet horizontally is as simple as running more Fleet server processes connected to the same MySQL and Redis backing stores. Typically, operators front Fleet server nodes with a load balancer that will distribute requests to the servers. All APIs in Fleet are designed to work in this arrangement by simply configuring clients to connect to the load balancer.

### Availability

The Fleet/osquery system is resilient to loss of availability. Osquery agents will continue executing the existing configuration and buffering result logs during downtime due to lack of network connectivity, server maintenance, or any other reason. Buffering in osquery can be configured with the `--buffered_log_max` flag.

Note that short downtimes are expected during [Fleet server upgrades](../Deploying/Upgrading-Fleet.md) that require database migrations.

### Debugging performance issues

#### MySQL and Redis

If performance issues are encountered with the MySQL and Redis servers, use the extensive resources available online to optimize and understand these problems. Please [file an issue](https://github.com/fleetdm/fleet/issues/new/choose) with details about the problem so that Fleet developers can work to fix them.

#### Fleet server

For performance issues in the Fleet server process, please [file an issue](https://github.com/fleetdm/fleet/issues/new/choose) with details about the scenario, and attach a debug archive. Debug archives can also be submitted confidentially through other support channels.

##### Generate debug archive (Fleet 3.4.0+)

Use the `fleetctl debug archive` command to generate an archive of Fleet's full suite of debug profiles. See the [fleetctl setup guide](./fleetctl-CLI.md)) for details on configuring `fleetctl`.

The generated `.tar.gz` archive will be available in the current directory.

##### Targeting individual servers

In most configurations, the `fleetctl` client is configured to make requests to a load balancer that will proxy the requests to each server instance. This can be problematic when trying to debug a performance issue on a specific server. To target an individual server, create a new `fleetctl` context that uses the direct address of the server.

For example:

```sh
fleetctl config set --context server-a --address https://server-a:8080
fleetctl login --context server-a
fleetctl debug archive --context server-a
```

##### Confidential information

The `fleetctl debug archive` command retrieves information generated by Go's [`net/http/pprof`](https://golang.org/pkg/net/http/pprof/) package. In most scenarios this should not include sensitive information, however it does include command line arguments to the Fleet server. If the Fleet server receives sensitive credentials via CLI argument (not environment variables or config file), this information should be scrubbed from the archive in the `cmdline` file.

<meta name="pageOrderInSection" value="700">
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00			`# Monitoring Fleet`
			`- [Health checks](#health-checks)`
			`- [Metrics](#metrics)`
			`- [Alerting](#alerting)`
			`- [Graphing](#graphing)`
			`- [Fleet server performance](#fleet-server-performance)`
			`- [Horizontal scaling](#horizontal-scaling)`
			`- [Availability](#availability)`
			`- [Monitoring](#monitoring)`
			`- [Debugging performance issues](#debugging-performance-issues)`
Fix broken anchor links in documentation (#509) This PR includes various fixes to anchor links used in the documentation. There are certain characters GitHub doesn't support for the use of anchor links in markdown files. The general rule I've found is to only use a-z or A-Z characters in anchor links. All other characters should be removed. For example, consider the section title How do I connect to the Mailhog simulated server?. The valid GitHub anchor link for this section is #how-do-i-connect-to-the-mailhog-simulated-server. Notice no ?. Closes #494 2021-03-21 23:05:11 +00:00			`- [MySQL and Redis](#mysql-and-redis)`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00			`- [Fleet server](#fleet-server)`

			`## Health checks`

			Fleet exposes a basic health check at the `/healthz` endpoint. This is the interface to use for simple monitoring and load-balancer health checks.

Separate health checks for MySQL and Redis (#6468) This required a bit of refactoring of some mocking due to how the code generation does not handle having the same function in different types. 2022-07-01 11:08:03 +00:00			The `/healthz` endpoint will return an `HTTP 200` status if the server is running and has healthy connections to MySQL and Redis. If there are any problems, the endpoint will return an `HTTP 500` status. Details about failing checks are logged in the Fleet server logs.
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
Editor pass - Separate health checks for MySQL and Redis (#6506) Editor pass for: - https://github.com/fleetdm/fleet/pull/6468 2022-07-06 14:03:35 +00:00			Individual checks can be run by providing the `check` URL parameter (e.x., `/healthz?check=mysql` or `/healthz?check=redis`).
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00			`## Metrics`

Add canonical URL for link to prometheus example configuration (#1854) 2021-08-30 21:00:34 +00:00			`Fleet exposes server metrics in a format compatible with [Prometheus](https://prometheus.io/). A simple example Prometheus configuration is available in [tools/app/prometheus.yml](https://github.com/fleetdm/fleet/blob/194ad5963b0d55bdf976aa93f3de6cabd590c97a/tools/app/prometheus.yml).`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
Fix typo in Prometheus docs (#1814) 2021-08-26 03:08:22 +00:00			`Prometheus can be configured to use a wide range of service discovery mechanisms within AWS, GCP, Azure, Kubernetes, and more. See the Prometheus [configuration documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) for more information.`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
			`### Alerting`

Add Cloudwatch monitoring to AWS Terraform configs (#2485) * add support for minio backend file carving * add changes file * rds alarm and sns topic * added cloudwatch alarm documenation * Update docs/01-Using-Fleet/06-Monitoring-Fleet.md * update aws provider version to fix bug in ecs container insights, add more redis alerts Co-authored-by: Zach Wasserman <zach@fleetdm.com> 2021-10-22 19:38:00 +00:00			`#### Prometheus`

Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00			`Prometheus has built-in support for alerting through [Alertmanager](https://prometheus.io/docs/alerting/latest/overview/).`

			`Consider building alerts for`

			`- Changes from expected levels of host enrollment`
			`- Increased latency on HTTP endpoints`
			`- Increased error levels on HTTP endpoints`

			```
			`TODO (Seeking Contributors)`
			`Add example alerting configurations`
			```

Add Cloudwatch monitoring to AWS Terraform configs (#2485) * add support for minio backend file carving * add changes file * rds alarm and sns topic * added cloudwatch alarm documenation * Update docs/01-Using-Fleet/06-Monitoring-Fleet.md * update aws provider version to fix bug in ecs container insights, add more redis alerts Co-authored-by: Zach Wasserman <zach@fleetdm.com> 2021-10-22 19:38:00 +00:00			`#### Cloudwatch Alarms`

			`Cloudwatch Alarms can be configured to support a wide variety of metrics and anomaly detection mechanisms. There are some example alarms`
			in the terraform reference architecture (see `monitoring.tf`).

			`* [Monitoring RDS (MySQL)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/monitoring-cloudwatch.html)`
			`* [ElastiCache for Redis](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.WhichShouldIMonitor.html)`
			`* [Monitoring ECS](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html)`
			`* Reference alarms include evaluating healthy targets & response times. We also use target-tracking alarms to manage auto-scaling.`

Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00			`### Graphing`

			`Prometheus provides basic graphing capabilities, and integrates tightly with [Grafana](https://prometheus.io/docs/visualization/grafana/) for sophisticated visualizations.`

			`## Fleet server performance`

			`Fleet is designed to scale to hundreds of thousands of online hosts. The Fleet server scales horizontally to support higher load.`

			`### Horizontal scaling`

			`Scaling Fleet horizontally is as simple as running more Fleet server processes connected to the same MySQL and Redis backing stores. Typically, operators front Fleet server nodes with a load balancer that will distribute requests to the servers. All APIs in Fleet are designed to work in this arrangement by simply configuring clients to connect to the load balancer.`

			`### Availability`

			The Fleet/osquery system is resilient to loss of availability. Osquery agents will continue executing the existing configuration and buffering result logs during downtime due to lack of network connectivity, server maintenance, or any other reason. Buffering in osquery can be configured with the `--buffered_log_max` flag.

Remove numbers from documentation filenames in Fleet repo (#4313) * Renaming files and a lot of find and replace * pageRank meta tags, sorting by page rank * reranking * removing numbers * revert changing links that are locked to a commit * update metatag name, uncomment github contributers * Update basic-documentation.page.js * revert link change * more explicit errors, change pageOrderInSection numbers, updated sort * Update build-static-content.js * update comment * update handbook link * handbook entry * update sort * update changelog doc links to use fleetdm.com * move standard query library back to old location, update links/references to location * revert unintentional link changes * Update handbook/community.md Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike McNeil <mikermcneil@users.noreply.github.com> 2022-02-23 18:17:55 +00:00			`Note that short downtimes are expected during [Fleet server upgrades](../Deploying/Upgrading-Fleet.md) that require database migrations.`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
			`### Debugging performance issues`

Fix broken anchor links in documentation (#509) This PR includes various fixes to anchor links used in the documentation. There are certain characters GitHub doesn't support for the use of anchor links in markdown files. The general rule I've found is to only use a-z or A-Z characters in anchor links. All other characters should be removed. For example, consider the section title How do I connect to the Mailhog simulated server?. The valid GitHub anchor link for this section is #how-do-i-connect-to-the-mailhog-simulated-server. Notice no ?. Closes #494 2021-03-21 23:05:11 +00:00			`#### MySQL and Redis`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
			`If performance issues are encountered with the MySQL and Redis servers, use the extensive resources available online to optimize and understand these problems. Please [file an issue](https://github.com/fleetdm/fleet/issues/new/choose) with details about the problem so that Fleet developers can work to fix them.`

			`#### Fleet server`

			`For performance issues in the Fleet server process, please [file an issue](https://github.com/fleetdm/fleet/issues/new/choose) with details about the scenario, and attach a debug archive. Debug archives can also be submitted confidentially through other support channels.`

			`##### Generate debug archive (Fleet 3.4.0+)`

Remove numbers from documentation filenames in Fleet repo (#4313) * Renaming files and a lot of find and replace * pageRank meta tags, sorting by page rank * reranking * removing numbers * revert changing links that are locked to a commit * update metatag name, uncomment github contributers * Update basic-documentation.page.js * revert link change * more explicit errors, change pageOrderInSection numbers, updated sort * Update build-static-content.js * update comment * update handbook link * handbook entry * update sort * update changelog doc links to use fleetdm.com * move standard query library back to old location, update links/references to location * revert unintentional link changes * Update handbook/community.md Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike McNeil <mikermcneil@users.noreply.github.com> 2022-02-23 18:17:55 +00:00			Use the `fleetctl debug archive` command to generate an archive of Fleet's full suite of debug profiles. See the [fleetctl setup guide](./fleetctl-CLI.md)) for details on configuring `fleetctl`.
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
			The generated `.tar.gz` archive will be available in the current directory.

Normalize docs and handbook headings (#2428) * header styles * update headers * updated headings * update padding * handbook headings, update landing page breadcrumbs * update heading and font-size * Update 03-Example-deployment-scenarios.md * handbook styles * Update basic-handbook.page.js * lint fixes 2021-10-07 14:40:22 +00:00			`##### Targeting individual servers`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
			In most configurations, the `fleetctl` client is configured to make requests to a load balancer that will proxy the requests to each server instance. This can be problematic when trying to debug a performance issue on a specific server. To target an individual server, create a new `fleetctl` context that uses the direct address of the server.

			`For example:`

			```sh
			`fleetctl config set --context server-a --address https://server-a:8080`
			`fleetctl login --context server-a`
			`fleetctl debug archive --context server-a`
			```

Normalize docs and handbook headings (#2428) * header styles * update headers * updated headings * update padding * handbook headings, update landing page breadcrumbs * update heading and font-size * Update 03-Example-deployment-scenarios.md * handbook styles * Update basic-handbook.page.js * lint fixes 2021-10-07 14:40:22 +00:00			`##### Confidential information`
Part 2 of documentation restructure. Using Fleet section. (#148) This PR includes the Using Fleet section of the documentation restructure #144. It shouldn't be merged until changes are approved for the entire restructuring (part 1, part 2, and part 3). Update the naming convention for the files to number prefixes. 2020-12-24 22:12:44 +00:00
Fixes to fleetctl debug archive docs (#2203) 2021-09-23 15:50:34 +00:00			The `fleetctl debug archive` command retrieves information generated by Go's [`net/http/pprof`](https://golang.org/pkg/net/http/pprof/) package. In most scenarios this should not include sensitive information, however it does include command line arguments to the Fleet server. If the Fleet server receives sensitive credentials via CLI argument (not environment variables or config file), this information should be scrubbed from the archive in the `cmdline` file.
Remove numbers from documentation filenames in Fleet repo (#4313) * Renaming files and a lot of find and replace * pageRank meta tags, sorting by page rank * reranking * removing numbers * revert changing links that are locked to a commit * update metatag name, uncomment github contributers * Update basic-documentation.page.js * revert link change * more explicit errors, change pageOrderInSection numbers, updated sort * Update build-static-content.js * update comment * update handbook link * handbook entry * update sort * update changelog doc links to use fleetdm.com * move standard query library back to old location, update links/references to location * revert unintentional link changes * Update handbook/community.md Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com> Co-authored-by: Mike McNeil <mikermcneil@users.noreply.github.com> 2022-02-23 18:17:55 +00:00
Editor pass - Separate health checks for MySQL and Redis (#6506) Editor pass for: - https://github.com/fleetdm/fleet/pull/6468 2022-07-06 14:03:35 +00:00			`<meta name="pageOrderInSection" value="700">`