Zwinnerman loadtesting doc updates (#4765)

* Update documentation with new loadtesting findings

* Add FAQ changes with redis findings

* fixup

* Update docs/Deploying/FAQ.md

Co-authored-by: Zach Wasserman <zach@fleetdm.com>

* fixup

* Fix the instance size due to a mistake during loadtesting

* Update docs/Deploying/FAQ.md

Co-authored-by: Martin Angers <martin.n.angers@gmail.com>

* Update docs/Deploying/Load-testing.md

Co-authored-by: Martin Angers <martin.n.angers@gmail.com>

* Update docs/Deploying/FAQ.md

Co-authored-by: Martin Angers <martin.n.angers@gmail.com>

* Update price estimate since I forgot

Co-authored-by: Zach Wasserman <zach@fleetdm.com>
Co-authored-by: Martin Angers <martin.n.angers@gmail.com>
This commit is contained in:
Zachary Winnerman 2022-03-30 13:26:36 -04:00 committed by GitHub
parent 22dda3adf5
commit 83b689ae37
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 40 additions and 32 deletions

View File

@ -74,6 +74,14 @@ The exact solution to this depends on the request client you are using. For exam
NODE_TLS_REJECT_UNAUTHORIZED=0 sails console
```
## I'm only getting partial results from live queries
Redis has an internal buffer limit for pubsub that Fleet uses to communicate query results. If this buffer is filled, extra data is dropped. To fix this, we recommend disabling the buffer size limit. Most installs of Redis should have plenty of spare memory to not run into issues. More info about this limit can be found [here](https://redis.io/topics/clients#:~:text=Pub%2FSub%20clients%20have%20a,64%20megabyte%20per%2060%20second.) and [here](https://raw.githubusercontent.com/redis/redis/unstable/redis.conf) (search for client-output-buffer-limit).
We recommend a config like the following:
```
client-output-buffer-limit pubsub 0 0 60
```
## When do I need to deploy a new enroll secret to my hosts?

View File

@ -14,18 +14,18 @@ A test is deemed successful when the Fleet server is able to receive and make re
## Results
### 1,000 hosts
### 2,500 hosts
With the following infrastructure, 1,000 hosts successfully communicate with Fleet. The Fleet server is able to run live queries against all hosts.
With the following infrastructure, 2,500 hosts successfully communicate with Fleet. The Fleet server is able to run live queries against all hosts.
|Fleet instances| CPU Units |RAM |
|-------|-------------------------|----------------|
| 1 Fargate task | 256 CPU Units |512 MB of memory|
| 1 Fargate task | 512 CPU Units | 4GB of memory |
|&#8203;| Version |Instance type |
|-------|-------------------------|--------------|
| Redis | 5.0.6 |cache.m5.large|
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t4g.medium|
| Redis | 5.0.6 | cache.t4g.medium |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t4g.small |
### 150,000 hosts
@ -33,14 +33,16 @@ With the infrastructure listed below, 150,000 hosts successfully communicate wit
|Fleet instance | CPU Units |RAM |
|-------|-------------------------|----------------|
| 25 Fargate tasks | 1024 CPU units |2048 MB of memory|
| 20 Fargate tasks | 1024 CPU units | 4GB of memory |
|&#8203;| Version |Instance type |
|-------|-------------------------|--------------|
| Redis | 5.0.6 |cache.m5.large|
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t4g.medium|
|&#8203;| Version |Instance type |
|-------|-------------------------|----------------|
| Redis | 5.0.6 | cache.m6g.large |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.r6g.4xlarge |
The above setup auto scaled based on CPU usage. After a while, the task count ended up in 25 instances even while live querying or adding a new label.
In the above setup, the read replica was the same size as the writer node.
The above setup auto scaled based on CPU usage. After a while, the task count ended up at 25 instances even while live querying or adding a new label.
## How we are simulating osquery
@ -56,21 +58,19 @@ After the hosts have been enrolled, you can add `-only_already_enrolled` to make
## Infrastructure setup
The deployment of Fleet was done through the example [terraform provided in the repo](https://github.com/fleetdm/fleet/tree/main/tools/terraform) with the following command:
The deployment of Fleet was done through the loadtesting [terraform maintained in the repo](https://github.com/fleetdm/fleet/tree/main/tools/loadtesting/terraform) with the following command:
```bash
terraform apply \
-var domain_fleetctl=<your domain here> \
-var domain_fleetdm=<alternative domain here> \
-var s3_bucket=<log bucket name> \
-var fleet_image="fleetdm/fleet:<tag targeted>" \
-var vulnerabilities_path="" \
-var fleet_max_capacity=100 \
-var fleet_min_capacity=5
terraform apply -var tag=<your tag here>
```
Scaling differences were done by directly modifying the code and reapplying.
Infrastructure for the loadtest is provided in the loadtesting code (via an ECS Fargate service and an internal load balancer for cost savins). Each instance of the ECS service corresponds to 5000 hosts.
They are sized to be the smallest that Fargate allows, so it is still cost effective to run 30+ instances of the service.
## Limitations
The [osquery-perf](https://github.com/fleetdm/fleet/tree/main/cmd/osquery-perf) tool doesn't simulate all data that's included when a real device communicates to a Fleet instance. For example, system users and software inventory data are not yet simulated by osquery-perf.
<meta name="pageOrderInSection" value="500">
<meta name="pageOrderInSection" value="500">

View File

@ -30,7 +30,7 @@ _**Note if using [terraform reference architecture](https://github.com/fleetdm/f
assume On-Demand pricing (savings are available through Reserved Instances). Calculations do not take into account NAT gateway charges or other networking related ingress/egress costs.**_
### Example configuration breakpoints
#### [Up to 1000 hosts](https://calculator.aws/#/estimate?id=ae7d7ddec64bb979f3f6611d23616b1dff0e8dbd)
#### [Up to 2500 hosts](https://calculator.aws/#/estimate?id=591e3b8176acb074f40959b2ed651947fc8e8388)
| Fleet instances | CPU Units | RAM |
|-----------------|---------------|-----|
@ -38,31 +38,31 @@ assume On-Demand pricing (savings are available through Reserved Instances). Cal
| Dependencies | Version | Instance type |
|--------------|-------------------------|---------------|
| Redis | 6 | t4g.small |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t3.small |
| Redis | 5.0.6 | db.t4g.medium |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t4g.small |
#### [Up to 25000 hosts](https://calculator.aws/#/estimate?id=4a3e3168275967d1e79a3d1fcfedc5b17d67a271)
#### [Up to 25000 hosts](https://calculator.aws/#/estimate?id=855c3796002c329de1cfa7d4628a9b1cc03d1db6)
| Fleet instances | CPU Units | RAM |
|-----------------|---------------|-----|
| 10 Fargate task | 1024 CPU Units | 4GB |
| 5 Fargate task | 1024 CPU Units | 4GB |
| Dependencies | Version | Instance type |
|--------------|-------------------------|---------------|
| Redis | 6 | m6g.large |
| Redis | 5.0.6 | m6g.large |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.r6g.large |
#### [Up to 150000 hosts](https://calculator.aws/#/estimate?id=6a852ef873c0902f0c953045dec3e29fcd32aef8)
#### [Up to 150000 hosts](https://calculator.aws/#/estimate?id=eed68e1f1854ff2b0eacdacddb5803022101c207)
| Fleet instances | CPU Units | RAM |
|-----------------|----------------|-----|
| 30 Fargate task | 1024 CPU Units | 4GB |
| 20 Fargate task | 1024 CPU Units | 4GB |
| Dependencies | Version | Instance type | Nodes |
|--------------|-------------------------|----------------|-------|
| Redis | 6 | m6g.large | 3 |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.m6g.8xlarge | 1 |
| Redis | 5.0.6 | m6g.large | 3 |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.r6g.4xlarge | 1 |
## Cloud providers
@ -302,4 +302,4 @@ services:
```
<meta name="pageOrderInSection" value="600">
<meta name="pageOrderInSection" value="600">