fleet/docs/02-Deploying/FAQ.md
eashaw 87b3563db7
Add questions to Contributing and Using Fleet FAQ (#2366)
* Update FAQ.md

* Update FAQ.md

* Update docs/01-Using-Fleet/FAQ.md

Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com>

* Update docs/01-Using-Fleet/FAQ.md

I couldn't make a decision about the correct science around the mum and baby reference, so I just deleted to be safe 😅

Co-authored-by: Mike Thomas <78363703+mike-j-thomas@users.noreply.github.com>
2021-10-06 11:51:00 +09:00

10 KiB

Deployment FAQ

How do I get support for working with Fleet?

For bug reports, please use the Github issue tracker.

For questions and discussion, please join us in the #fleet channel of osquery Slack.

Can multiple instances of the Fleet server be run behind a load-balancer?

Yes. Fleet scales horizontally out of the box as long as all of the Fleet servers are connected to the same MySQL and Redis instances.

Note that osquery logs will be distributed across the Fleet servers.

Read the performance documentation for more.

Why aren't my osquery agents connecting to Fleet?

This can be caused by a variety of problems. The best way to debug is usually to add --verbose --tls_dump to the arguments provided to osqueryd and look at the logs for the server communication.

Common problems

  • Connection refused: The server is not running, or is not listening on the address specified. Is the server listening on an address that is available from the host running osquery? Do you have a load balancer that might be blocking connections? Try testing with curl.
  • No node key returned: Typically this indicates that the osquery client sent an incorrect enroll secret that was rejected by the server. Check what osquery is sending by looking in the logs near this error.
  • certificate verify failed: See How do I fix "certificate verify failed" errors from osqueryd.
  • bad record MAC: When generating your certificate for your Fleet server, ensure you set the hostname to the FQDN or the IP of the server. This error is common when setting up Fleet servers and accepting defaults when generating certificates using openssl.

How do I fix "certificate verify failed" errors from osqueryd?

Osquery requires that all communication between the agent and Fleet are over a secure TLS connection. For the safety of osquery deployments, there is no (convenient) way to circumvent this check.

  • Try specifying the path to the full certificate chain used by the server using the --tls_server_certs flag in osqueryd. This is often unnecessary when using a certificate signed by an authority trusted by the system, but is mandatory when working with self-signed certificates. In all cases it can be a useful debugging step.
  • Ensure that the CNAME or one of the Subject Alternate Names (SANs) on the certificate matches the address at which the server is being accessed. If osquery connects via https://localhost:443, but the certificate is for https://fleet.example.com, the verification will fail.
  • Is Fleet behind a load-balancer? Ensure that if the load-balancer is terminating TLS, this is the certificate provided to osquery.
  • Does the certificate verify with curl? Try curl -v -X POST https://fleetserver:port/api/v1/osquery/enroll.

What do I need to do to change the Fleet server TLS certificate?

If the both the existing and new certificates verify with osquery's default root certificates (such as a certificate issued by a well-known Certificate Authority) and no certificate chain was deployed with osquery, there is no need to deploy a new certificate chain.

If osquery has been deployed with the full certificate chain (using --tls_server_certs), deploying a new certificate chain is necessary to allow for verification of the new certificate.

Deploying a certificate chain cannot be done centrally from Fleet.

How do I use a proxy server with Fleet?

Seeing your proxy's requests fail with an error like DEPTH_ZERO_SELF_SIGNED_CERT)? To get your proxy server's HTTP client to work with a local Fleet when using a self-signed cert, disable SSL / self-signed verification in the client.

The exact solution to this depends on the request client you are using. For example, when using Node.js ± Sails.js, you can work around this in the requests you're sending with await sails.helpers.http.get() by lifting your app with the NODE_TLS_REJECT_UNAUTHORIZED environment variable set to 0:

NODE_TLS_REJECT_UNAUTHORIZED=0 sails console

When do I need to deploy a new enroll secret to my hosts?

Osquery provides the enroll secret only during the enrollment process. Once a host is enrolled, the node key it receives remains valid for authentication independent from the enroll secret.

Currently enrolled hosts do not necessarily need enroll secrets updated, as the existing enrollment will continue to be valid as long as the host is not deleted from Fleet and the osquery store on the host remains valid. Any newly enrolling hosts must have the new secret.

Deploying a new enroll secret cannot be done centrally from Fleet.

How do I migrate hosts from one Fleet server to another (eg. testing to production)?

Primarily, this would be done by changing the --tls_hostname and enroll secret to the values for the new server. In some circumstances (see What do I need to do to change the Fleet server TLS certificate?) it may be necessary to deploy a new certificate chain configured with --tls_server_certs.

These configurations cannot be managed centrally from Fleet.

What do I do about "too many open files" errors?

This error usually indicates that the Fleet server has run out of file descriptors. Fix this by increasing the ulimit on the Fleet process. See the LimitNOFILE setting in the example systemd unit file for an example of how to do this with systemd.

Some deployments may benefit by setting the --server_keepalive flag to false.

This was also seen as a symptom of a different issue: if you're deploying on AWS on T type instances, there are different scenarios where the activity can increase and the instances will burst. If they run out of credits, then they'll stop processing leaving the file descriptors open.

I upgraded my database, but Fleet is still running slowly. What could be going on?

This could be caused by a mismatched connection limit between the Fleet server and the MySQL server that prevents Fleet from fully utilizing the database. First determine how many open connections your MySQL server supports. Now set the --mysql_max_open_conns and --mysql_max_idle_conns flags appropriately.

Why am I receiving a database connection error when attempting to "prepare" the database?

First, check if you have a version of MySQL installed that is at least 5.7. Then, make sure that you currently have a MySQL server running.

The next step is to make sure the credentials for the database match what is expected. Test your ability to connect to the database with mysql -u<username> -h<hostname_or_ip> -P<port> -D<database_name> -p.

If you're successful connecting to the database and still receive a database connection error, you may need to specify your database credentials when running fleet prepare db. It's encouraged to put your database credentials in environment variables or a config file.

fleet prepare db \
    --mysql_address=<database_address> \
    --mysql_database=<database_name> \
    --mysql_username=<username> \
    --mysql_password=<database_password>

Is Fleet available as a SaaS product?

No. Currently, Fleet is only available for self-hosting on premises or in the cloud.

Is Fleet compatible with X flavor of MySQL?

Fleet is built to run on MySQL 5.7 or above. However, particularly with AWS Aurora, we recommend 2.10.0 and above, as we've seen issues with anything below that.

What are the MySQL user requirements?

The user fleet prepare db (via environment variable FLEET_MYSQL_USERNAME or command line flag --mysql_username=<username>) uses to interact with the database needs to be able to create, alter, and drop tables as well as the ability to create temporary tables.

What is duplicate enrollment and how do I fix it?

Duplicate host enrollment is when more than one host enrolls in Fleet using the same identifier (hardware UUID or osquery generated UUID). This can be caused by cloning a VM Image with an already enrolled osquery client. To resolve the issues, it's advised to configure --osquery_host_identifier to uuid, and then delete the single host record for that whole set of hosts in the Fleet UI. You can find more information about host identifiers here.