Reference Architectures (#3712)

Added reference architectures using https://docs.gitlab.com/ee/administration/reference_architectures/ as inspiration.

- updated terraform based on some feedback of usage
- pinned fleet docker version in terraform as to no get unexpected upgrades when applying
- updated some documentation around apply migration tasks
This commit is contained in:
Benjamin Edwards 2022-01-21 19:27:55 -05:00 committed by GitHub
parent 9dd6968c5d
commit d650423be0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 312 additions and 65 deletions

View File

@ -0,0 +1,302 @@
# Reference Architectures
You can easily run Fleet on a single VPS that would be capable of supporting hundreds if not thousands of hosts, but
this page details an [opinionated view](https://github.com/fleetdm/fleet/tree/main/tools/terraform) of running Fleet in a production environment, as
well as different configuration strategies to enable High Availability (HA).
## Availability Components
There are a few strategies that can be used to ensure high availability:
- Database HA
- Traffic load balancing
### Database HA
Fleet recommends RDS Aurora MySQL when running on AWS. More details about backups/snapshots can be found
[here](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Backups.html). It is also
possible to dynamically scale read replicas to increase performance and [enable database fail-over](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.AuroraHighAvailability.html).
It is also possible to use [Aurora Global](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html) to
span multiple regions for more advanced configurations(_not included in the [reference terraform](https://github.com/fleetdm/fleet/tree/main/tools/terraform)_).
In some cases adding a read replica can increase database performance for specific access patterns. In scenarios when automating the API or with `fleetctl`
there can be benefits to read performance.
### Traffic load balancing
Load balancing enables distributing request traffic over many instances of the backend application. Using AWS Application
Load Balancer can also [offload SSL termination](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/create-https-listener.html), freeing Fleet to spend the majority of it's allocated compute dedicated
to its core functionality. More details about ALB can be found [here](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html).
_**Note if using [terraform reference architecture](https://github.com/fleetdm/fleet/tree/main/tools/terraform#terraform) all configurations can dynamically scale based on load(cpu/memory) and all configurations
assume On-Demand pricing (savings are available through Reserved Instances). Calculations do not take into account NAT gateway charges or other networking related ingress/egress costs.**_
### Example Configuration breakpoints
#### [Up to 1000 hosts](https://calculator.aws/#/estimate?id=ae7d7ddec64bb979f3f6611d23616b1dff0e8dbd)
| Fleet instances | CPU Units | RAM |
|-----------------|---------------|-----|
| 1 Fargate task | 512 CPU Units | 4GB |
| Dependencies | Version | Instance type |
|--------------|-------------------------|---------------|
| Redis | 6 | t4g.small |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.t3.small |
#### [Up to 25000 hosts](https://calculator.aws/#/estimate?id=4a3e3168275967d1e79a3d1fcfedc5b17d67a271)
| Fleet instances | CPU Units | RAM |
|-----------------|---------------|-----|
| 10 Fargate task | 1024 CPU Units | 4GB |
| Dependencies | Version | Instance type |
|--------------|-------------------------|---------------|
| Redis | 6 | m6g.large |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.r6g.large |
#### [Up to 150000 hosts](https://calculator.aws/#/estimate?id=6a852ef873c0902f0c953045dec3e29fcd32aef8)
| Fleet instances | CPU Units | RAM |
|-----------------|----------------|-----|
| 30 Fargate task | 1024 CPU Units | 4GB |
| Dependencies | Version | Instance type | Nodes |
|--------------|-------------------------|----------------|-------|
| Redis | 6 | m6g.large | 3 |
| MySQL | 5.7.mysql_aurora.2.10.0 | db.m6g.8xlarge | 1 |
## Cloud Providers
### AWS
AWS reference architecture can be found [here](https://github.com/fleetdm/fleet/tree/main/tools/terraform). This configuration includes:
- VPC
- Subnets
- Public & Private
- ACLs
- Security Groups
- ECS as the container orchestrator
- Fargate for underlying compute
- Task roles via IAM
- RDS Aurora MySQL 5.7
- Elasticache Redis Engine
- Firehose osquery log destination
- S3 bucket sync to allow further ingestion/processing
- [Monitoring via Cloudwatch alarms](https://github.com/fleetdm/fleet/tree/main/tools/terraform/monitoring)
Some AWS services used in the provider reference architecture are billed as pay-per-use such as Firehose. This means that osquery scheduled query frequency can have
a direct correlation to how much these services cost, something to keep in mind when configuring Fleet in AWS.
#### AWS Terraform CI/CD IAM Permissions
The following permissions are the minimum required to apply AWS terraform resources:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"cloudwatch:*",
"s3:*",
"lambda:*",
"ecs:*",
"rds:*",
"rds-data:*",
"secretsmanager:*",
"pi:*",
"ecr:*",
"iam:*",
"aps:*",
"vpc:*",
"kms:*",
"elasticloadbalancing:*",
"ce:*",
"cur:*",
"logs:*",
"cloudformation:*",
"ssm:*",
"sns:*",
"elasticache:*",
"application-autoscaling:*",
"acm:*",
"route53:*",
"dynamodb:*",
"kinesis:*",
"firehose:*"
],
"Resource": "*"
}
]
}
```
### GCP
Coming soon
### Azure
Coming soon
### Render
Using [Render's IAC](https://render.com/docs/infrastructure-as-code) see [the repository](https://github.com/edwardsb/fleet-on-render) for full details.
```yaml
services:
- name: fleet
plan: standard
type: web
env: docker
healthCheckPath: /healthz
envVars:
- key: FLEET_MYSQL_ADDRESS
fromService:
name: fleet-mysql
type: pserv
property: hostport
- key: FLEET_MYSQL_DATABASE
fromService:
name: fleet-mysql
type: pserv
envVarKey: MYSQL_DATABASE
- key: FLEET_MYSQL_PASSWORD
fromService:
name: fleet-mysql
type: pserv
envVarKey: MYSQL_PASSWORD
- key: FLEET_MYSQL_USERNAME
fromService:
name: fleet-mysql
type: pserv
envVarKey: MYSQL_USER
- key: FLEET_REDIS_ADDRESS
fromService:
name: fleet-redis
type: pserv
property: hostport
- key: FLEET_SERVER_TLS
value: false
- key: PORT
value: 8080
- name: fleet-mysql
type: pserv
env: docker
repo: https://github.com/render-examples/mysql
branch: mysql-5
disk:
name: mysql
mountPath: /var/lib/mysql
sizeGB: 10
envVars:
- key: MYSQL_DATABASE
value: fleet
- key: MYSQL_PASSWORD
generateValue: true
- key: MYSQL_ROOT_PASSWORD
generateValue: true
- key: MYSQL_USER
value: fleet
- name: fleet-redis
type: pserv
env: docker
repo: https://github.com/render-examples/redis
disk:
name: redis
mountPath: /var/lib/redis
sizeGB: 10
```
### Digital Ocean
Using Digital Ocean's [App Spec](https://docs.digitalocean.com/products/app-platform/concepts/app-spec/) to deploy on the App on the [App Platform](https://docs.digitalocean.com/products/app-platform/)
```yaml
alerts:
- rule: DEPLOYMENT_FAILED
- rule: DOMAIN_FAILED
databases:
- cluster_name: fleet-redis
engine: REDIS
name: fleet-redis
production: true
version: "6"
- cluster_name: fleet-mysql
db_name: fleet
db_user: fleet
engine: MYSQL
name: fleet-mysql
production: true
version: "8"
domains:
- domain: demo.fleetdm.com
type: PRIMARY
envs:
- key: FLEET_MYSQL_ADDRESS
scope: RUN_TIME
value: ${fleet-mysql.HOSTNAME}:${fleet-mysql.PORT}
- key: FLEET_MYSQL_PASSWORD
scope: RUN_TIME
value: ${fleet-mysql.PASSWORD}
- key: FLEET_MYSQL_USERNAME
scope: RUN_TIME
value: ${fleet-mysql.USERNAME}
- key: FLEET_MYSQL_DATABASE
scope: RUN_TIME
value: ${fleet-mysql.DATABASE}
- key: FLEET_REDIS_ADDRESS
scope: RUN_TIME
value: ${fleet-redis.HOSTNAME}:${fleet-redis.PORT}
- key: FLEET_SERVER_TLS
scope: RUN_AND_BUILD_TIME
value: "false"
- key: FLEET_REDIS_PASSWORD
scope: RUN_AND_BUILD_TIME
value: ${fleet-redis.PASSWORD}
- key: FLEET_REDIS_USE_TLS
scope: RUN_AND_BUILD_TIME
value: "true"
jobs:
- envs:
- key: DATABASE_URL
scope: RUN_TIME
value: ${fleet-redis.DATABASE_URL}
image:
registry: fleetdm
registry_type: DOCKER_HUB
repository: fleet
tag: latest
instance_count: 1
instance_size_slug: basic-xs
kind: PRE_DEPLOY
name: fleet-migrate
run_command: fleet prepare --no-prompt=true db
source_dir: /
name: fleet
region: nyc
services:
- envs:
- key: FLEET_VULNERABILITIES_DATABASES_PATH
scope: RUN_TIME
value: /home/fleet
- key: FLEET_BETA_SOFTWARE_INVENTORY
scope: RUN_TIME
value: "1"
health_check:
http_path: /healthz
http_port: 8080
image:
registry: fleetdm
registry_type: DOCKER_HUB
repository: fleet
tag: latest
instance_count: 1
instance_size_slug: basic-xs
name: fleet
routes:
- path: /
run_command: fleet serve
source_dir: /
```

View File

@ -13,9 +13,11 @@ terraform {
}
}
}
provider "aws" {
region = "us-east-2"
}
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
@ -278,7 +280,7 @@ resource "aws_cloudwatch_metric_alarm" "redis-current-connections" {
alarm_name = "redis-current-connections-${each.key}-${terraform.workspace}"
alarm_description = "Redis current connections for node ${each.key}"
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
evaluation_periods = "3"
evaluation_periods = "5"
threshold_metric_id = "e1"
alarm_actions = [aws_sns_topic.cloudwatch_alarm_topic.arn]
ok_actions = [aws_sns_topic.cloudwatch_alarm_topic.arn]
@ -286,7 +288,7 @@ resource "aws_cloudwatch_metric_alarm" "redis-current-connections" {
metric_query {
id = "e1"
expression = "ANOMALY_DETECTION_BAND(m1)"
expression = "ANOMALY_DETECTION_BAND(m1,20)"
label = "Current Connections (Expected)"
return_data = "true"
}
@ -297,7 +299,7 @@ resource "aws_cloudwatch_metric_alarm" "redis-current-connections" {
metric {
metric_name = "CurrConnections"
namespace = "AWS/ElastiCache"
period = "300"
period = "600"
stat = "Average"
unit = "Count"

View File

@ -1,23 +1,7 @@
resource "aws_route53_zone" "dogfood_fleetctl_com" {
name = var.domain_fleetctl
}
resource "aws_route53_zone" "dogfood_fleetdm_com" {
name = var.domain_fleetdm
}
resource "aws_route53_record" "dogfood_fleetctl_com" {
zone_id = aws_route53_zone.dogfood_fleetctl_com.zone_id
name = var.domain_fleetctl
type = "A"
alias {
name = aws_alb.main.dns_name
zone_id = aws_alb.main.zone_id
evaluate_target_health = false
}
}
resource "aws_route53_record" "dogfood_fleetdm_com" {
zone_id = aws_route53_zone.dogfood_fleetdm_com.zone_id
name = var.domain_fleetdm
@ -30,15 +14,6 @@ resource "aws_route53_record" "dogfood_fleetdm_com" {
}
}
resource "aws_acm_certificate" "dogfood_fleetctl_com" {
domain_name = var.domain_fleetctl
validation_method = "DNS"
lifecycle {
create_before_destroy = true
}
}
resource "aws_acm_certificate" "dogfood_fleetdm_com" {
domain_name = var.domain_fleetdm
validation_method = "DNS"
@ -48,23 +23,6 @@ resource "aws_acm_certificate" "dogfood_fleetdm_com" {
}
}
resource "aws_route53_record" "dogfood_fleetctl_com_validation" {
for_each = {
for dvo in aws_acm_certificate.dogfood_fleetctl_com.domain_validation_options : dvo.domain_name => {
name = dvo.resource_record_name
record = dvo.resource_record_value
type = dvo.resource_record_type
}
}
allow_overwrite = true
name = each.value.name
records = [each.value.record]
ttl = 60
type = each.value.type
zone_id = aws_route53_zone.dogfood_fleetctl_com.zone_id
}
resource "aws_route53_record" "dogfood_fleetdm_com_validation" {
for_each = {
for dvo in aws_acm_certificate.dogfood_fleetdm_com.domain_validation_options : dvo.domain_name => {
@ -82,11 +40,6 @@ resource "aws_route53_record" "dogfood_fleetdm_com_validation" {
zone_id = aws_route53_zone.dogfood_fleetdm_com.zone_id
}
resource "aws_acm_certificate_validation" "dogfood_fleetctl_com" {
certificate_arn = aws_acm_certificate.dogfood_fleetctl_com.arn
validation_record_fqdns = [for record in aws_route53_record.dogfood_fleetctl_com_validation : record.fqdn]
}
resource "aws_acm_certificate_validation" "dogfood_fleetdm_com" {
certificate_arn = aws_acm_certificate.dogfood_fleetdm_com.arn
validation_record_fqdns = [for record in aws_route53_record.dogfood_fleetdm_com_validation : record.fqdn]

View File

@ -66,21 +66,11 @@ Replace `cert_arn` with the **certificate ARN** that applies to your environment
### Migrating the DB
After applying terraform run the following to migrate the database:
After applying terraform run the following to migrate the database(`<private_subnet_id>` and `<desired_security_group>` can be obtained from the terraform output after applying, any value will suffice):
```
aws ecs run-task --cluster fleet-backend --task-definition fleet-migrate:<latest_version> --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets=[<private_subnet_id>],securityGroups=[<desired_security_group>]}"
```
### Connecting a Host
### Conecting a host
Build orbit:
```
fleetctl package --type=msi --fleet-url=<alb_dns> --enroll-secret=<secret>
```
Run orbit:
```
"C:\Program Files\Orbit\bin\orbit\orbit.exe" --root-dir "C:\Program Files\Orbit\." --log-file "C:\Program Files\Orbit\orbit-log.txt" --fleet-url "http://<alb_dns>" --enroll-secret-path "C:\Program Files\Orbit\secret.txt" --update-url "https://tuf.fleetctl.com" --orbit-channel "stable" --osqueryd-channel "stable"
```
Use your Route53 entry as your `fleet-url` [following these details.](https://fleetdm.com/docs/using-fleet/adding-hosts)

View File

@ -60,7 +60,7 @@ variable "database_name" {
variable "fleet_image" {
description = "the name of the container image to run"
default = "fleetdm/fleet"
default = "fleetdm/fleet:v4.8.0"
}
variable "software_inventory" {