fleet/proposals/Fleetctl-trigger.md
2022-12-02 10:05:16 -06:00

4.7 KiB
Raw Blame History

Fleetctl trigger

Goal

As a user, I would like to trigger a set of async jobs using fleetctl. For example, Id like to trigger a vuln scan, or an MDM dep sync.

The proposed solution to accomplish this goal enables a new CLI command: fleetctl trigger --name <NAME>.

Background

Currently, the Fleet server uses the schedule package to create sets of defined jobs that are run serially at defined intervals. The initial schedule interval must be specified at the point the schedule is instantiated via schedule.New. Optionally, schedule. WithConfigReloadInterval accepts a reload interval function. If specified, the reload interval function is periodically called and its return value becomes the new the schedule interval. This mechanism allows the schedule interval to be modified by user, for example, by changing the app config; however, there no mechanism currently to trigger async jobs on an ad hoc basis.

Proposal

New CLI command fleetctl trigger --name <NAME>

  • Upon this command, the CLI client makes a request to a new authenticated endpoint (see below) to trigger an ad hoc run of the named schedule.

New schedule option WithTrigger

  • This option adds a trigger channel on the schedule struct that will trigger an ad hoc run of the scheduled jobs.
  • The trigger channel for each schedule is exposed via a new schedules map on the Service struct.

New endpoint GET /trigger?name={:name}

  • The request handler first calls ds.Lock to check if the named schedule is locked.
  • If the named schedule is unlocked, request handler sends a trigger signal on the schedule's trigger channel and the server responds with status 202 Accepted.
  • If the named schedule is locked (presumably because the schedule is currently running), the server responds with status 409 Conflict and includes a message indicating the schedule is currently locked. It is up to the user to retry. To facilitate retries, the response message could be expanded to include additional status information, such as the expiration time of the current lock.

Schedule locks

  • Currently, lock duration is based on the schedule interval.

    • Once an instance takes the lock, it will hold the lock for the duration of the interval, even after it has completed the jobs in the schedule.
    • For long-running jobs, the lock may expire before the current instance completes its run, meaning that it is currently possible for another instance to start an overlapping job.
    • If the lockholder instance is terminated or killed, locks are not released, which may frustrate a user's attempt to configure a shorter schedule interval before the lock held by the dead instance expires.
  • Under this proposal, locks become more dynamic.

    • Current lockholder releases its lock once it finishes running the schedule.
    • Graceful shutdown process handles release of locks upon termination signals. Jobs are preemptable and an interrupt function must be specified for each job, e.g., schedule.New(...).WithJob("job_name", jobFn, interruptFn).
    • As a fallback for cases that can't be handled via graceful shutdown (e.g., SIGKILL), the expiration for a lock is initially set to a relatively short default duration (e.g., 5 minutes). The expiration is then periodically extended by the current instance so long as scheduled jobs are running. If the current instance dies without graceful shutdown, the lock will only be held by the dead instance for a short period.

Additional UX considerations

  • What are some potential options that would be useful for the fleetctl trigger command?

    • Request the current status of the the named schedule without triggering a new run. For example, --status could provide the user with the running time of the schedule (this would require that we expand the locks table to include additional timestamp information, such as lock start time and lock release time).
    • Other useful options?
  • What rules should determine when the interval ticker resets? Consider the following cases where s.scheduleInterval = 1*time.Hour:

    • The schedule is triggered at 55 minutes into the 1-hour interval and takes 1 minute to complete. When should the schedule run again? (a) after 4 minutes; (b) after 1 hour; (c) after 1 hour plus 4 minutes; (d) other
    • The schedule is triggered at 55 minutes into the 1-hour interval and takes 11 minutes to complete. When should the schedule run again? (a) immediately; (b) after 1 hour; (c) after 54 minutes; (d) other
  • What should be logged?

    • Debug log if schedule runtime exceeds schedule interval to aid detection/troubleshooting of long-running jobs.
    • Other useful logs?