mirror of
https://github.com/empayre/fleet.git
synced 2024-11-06 17:05:18 +00:00
96 lines
4.7 KiB
Markdown
96 lines
4.7 KiB
Markdown
# Fleetctl trigger
|
||
|
||
## Goal
|
||
|
||
As a user, I would like to trigger a set of async jobs using `fleetctl`. For example, I’d like to
|
||
trigger a vuln scan, or an MDM dep sync.
|
||
|
||
The proposed solution to accomplish this goal enables a new CLI command:
|
||
`fleetctl trigger --name <NAME>`.
|
||
|
||
## Background
|
||
|
||
Currently, the Fleet server uses the `schedule` package to create sets of defined jobs that are run
|
||
serially at defined intervals. The initial schedule interval must be specified at the point the
|
||
schedule is instantiated via `schedule.New`. Optionally, `schedule. WithConfigReloadInterval`
|
||
accepts a reload interval function. If specified, the reload interval function is periodically
|
||
called and its return value becomes the new the schedule interval. This mechanism allows the
|
||
schedule interval to be modified by user, for example, by changing the app config; however, there no
|
||
mechanism currently to trigger async jobs on an ad hoc basis.
|
||
|
||
## Proposal
|
||
|
||
### New CLI command `fleetctl trigger --name <NAME>`
|
||
- Upon this command, the CLI client makes a request to a new authenticated endpoint (see below) to
|
||
trigger an ad hoc run of the named schedule.
|
||
|
||
### New `schedule` option `WithTrigger`
|
||
- This option adds a `trigger` channel on the `schedule` struct that will trigger an ad hoc run of
|
||
the scheduled jobs.
|
||
- The trigger channel for each `schedule` is exposed via a new `schedules` map on the `Service` struct.
|
||
|
||
### New endpoint `GET /trigger?name={:name}`
|
||
- The request handler first calls `ds.Lock` to check if the named schedule is locked.
|
||
- If the named schedule is unlocked, request handler sends a trigger signal on the schedule's
|
||
trigger channel and the server responds with status `202 Accepted`.
|
||
- If the named schedule is locked (presumably because the schedule is currently running), the server
|
||
responds with status `409 Conflict` and includes a message indicating the schedule is currently
|
||
locked. It is up to the user to retry. To facilitate retries, the response message could be expanded to
|
||
include additional status information, such as the expiration time of the current lock.
|
||
|
||
### Schedule locks
|
||
- Currently, lock duration is based on the schedule interval.
|
||
- Once an instance takes the lock, it will hold the lock for the duration of the interval, even
|
||
after it has completed the jobs in the schedule.
|
||
- For long-running jobs, the lock may expire before the current instance completes its run,
|
||
meaning that it is currently possible for another instance to start an overlapping job.
|
||
- If the lockholder instance is terminated or killed, locks are not released, which may frustrate
|
||
a user's attempt to configure a shorter schedule interval before the lock held by the dead
|
||
instance expires.
|
||
|
||
- Under this proposal, locks become more dynamic.
|
||
- Current lockholder releases its lock once it finishes running the schedule.
|
||
- Graceful shutdown process handles release of locks upon termination signals. Jobs are
|
||
preemptable and an interrupt function must be specified for each job,
|
||
e.g., `schedule.New(...).WithJob("job_name", jobFn, interruptFn)`.
|
||
- As a fallback for cases that can't be handled via graceful shutdown (e.g., `SIGKILL`), the
|
||
expiration for a lock is initially set to a relatively short default duration (e.g., 5 minutes).
|
||
The expiration is then periodically extended by the current instance so long as scheduled jobs
|
||
are running. If the current instance dies without graceful shutdown, the lock will only be held
|
||
by the dead instance for a short period.
|
||
|
||
### Additional UX considerations
|
||
- What are some potential options that would be useful for the `fleetctl trigger` command?
|
||
- Request the current status of the the named schedule without triggering a new run.
|
||
For example, `--status` could provide the user with the running time of the schedule (this would
|
||
require that we expand the `locks` table to include additional timestamp information, such as
|
||
lock start time and lock release time).
|
||
- Other useful options?
|
||
|
||
- What rules should determine when the interval ticker resets? Consider the following cases where
|
||
`s.scheduleInterval = 1*time.Hour`:
|
||
- The schedule is triggered at 55 minutes into the 1-hour interval and takes 1 minute to complete.
|
||
When should the schedule run again?
|
||
(a) after 4 minutes;
|
||
(b) after 1 hour;
|
||
(c) after 1 hour plus 4 minutes;
|
||
(d) other
|
||
- The schedule is triggered at 55 minutes into the 1-hour interval and takes 11 minutes to complete.
|
||
When should the schedule run again?
|
||
(a) immediately;
|
||
(b) after 1 hour;
|
||
(c) after 54 minutes;
|
||
(d) other
|
||
|
||
- What should be logged?
|
||
- Debug log if schedule runtime exceeds schedule interval to aid detection/troubleshooting of
|
||
long-running jobs.
|
||
- Other useful logs?
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|