mirror of
https://github.com/valitydev/salt.git
synced 2024-11-06 16:45:27 +00:00
Merge pull request #25856 from jfindlay/intro_scale
expand minion reauth scalability documentation
This commit is contained in:
commit
b6805b068a
@ -1,11 +1,11 @@
|
||||
===================
|
||||
Using salt at scale
|
||||
Using Salt at scale
|
||||
===================
|
||||
|
||||
The focus of this tutorial will be building a Salt infrastructure for handling
|
||||
large numbers of minions. This will include tuning, topology, and best practices.
|
||||
|
||||
For how to install the saltmaster please
|
||||
For how to install the Salt Master please
|
||||
go here: `Installing saltstack <http://docs.saltstack.com/topics/installation/index.html>`_
|
||||
|
||||
.. note::
|
||||
@ -17,12 +17,12 @@ go here: `Installing saltstack <http://docs.saltstack.com/topics/installation/in
|
||||
and 'a few' always means 500.
|
||||
|
||||
For simplicity reasons, this tutorial will default to the standard ports
|
||||
used by salt.
|
||||
used by Salt.
|
||||
|
||||
The Master
|
||||
==========
|
||||
|
||||
The most common problems on the salt-master are:
|
||||
The most common problems on the Salt Master are:
|
||||
|
||||
1. too many minions authing at once
|
||||
2. too many minions re-authing at once
|
||||
@ -31,7 +31,7 @@ The most common problems on the salt-master are:
|
||||
5. too few resources (CPU/HDD)
|
||||
|
||||
The first three are all "thundering herd" problems. To mitigate these issues
|
||||
we must configure the minions to back-off appropriately when the master is
|
||||
we must configure the minions to back-off appropriately when the Master is
|
||||
under heavy load.
|
||||
|
||||
The fourth is caused by masters with little hardware resources in combination
|
||||
@ -40,44 +40,51 @@ with a possible bug in ZeroMQ. At least thats what it looks like till today
|
||||
`Issue 5948 <https://github.com/saltstack/salt/issues/5948>`_,
|
||||
`Mail thread <https://groups.google.com/forum/#!searchin/salt-users/lots$20of$20minions/salt-users/WxothArv2Do/t12MigMQDFAJ>`_)
|
||||
|
||||
To fully understand each problem, it is important to understand, how salt works.
|
||||
To fully understand each problem, it is important to understand, how Salt works.
|
||||
|
||||
Very briefly, the saltmaster offers two services to the minions.
|
||||
Very briefly, the Salt Master offers two services to the minions.
|
||||
|
||||
- a job publisher on port 4505
|
||||
- an open port 4506 to receive the minions returns
|
||||
|
||||
All minions are always connected to the publisher on port 4505 and only connect
|
||||
to the open return port 4506 if necessary. On an idle master, there will only
|
||||
to the open return port 4506 if necessary. On an idle Master, there will only
|
||||
be connections on port 4505.
|
||||
|
||||
Too many minions authing
|
||||
------------------------
|
||||
When the minion service is first started up, it will connect to its master's publisher
|
||||
|
||||
When the Minion service is first started up, it will connect to its Master's publisher
|
||||
on port 4505. If too many minions are started at once, this can cause a "thundering herd".
|
||||
This can be avoided by not starting too many minions at once.
|
||||
|
||||
The connection itself usually isn't the culprit, the more likely cause of master-side
|
||||
issues is the authentication that the minion must do with the master. If the master
|
||||
is too heavily loaded to handle the auth request it will time it out. The minion
|
||||
issues is the authentication that the Minion must do with the Master. If the Master
|
||||
is too heavily loaded to handle the auth request it will time it out. The Minion
|
||||
will then wait `acceptance_wait_time` to retry. If `acceptance_wait_time_max` is
|
||||
set then the minion will increase its wait time by the `acceptance_wait_time` each
|
||||
set then the Minion will increase its wait time by the `acceptance_wait_time` each
|
||||
subsequent retry until reaching `acceptance_wait_time_max`.
|
||||
|
||||
|
||||
Too many minions re-authing
|
||||
---------------------------
|
||||
This is most likely to happen in the testing phase, when all minion keys have
|
||||
already been accepted, the framework is being tested and parameters change
|
||||
frequently in the masters configuration file.
|
||||
|
||||
In a few cases (master restart, remove minion key, etc.) the salt-master generates
|
||||
a new AES-key to encrypt its publications with. The minions aren't notified of
|
||||
this but will realize this on the next pub job they receive. When the minion
|
||||
receives such a job it will then re-auth with the master. Since Salt does minion-side
|
||||
filtering this means that all the minions will re-auth on the next command published
|
||||
on the master-- causing another "thundering herd". This can be avoided by
|
||||
setting the
|
||||
This is most likely to happen in the testing phase of a Salt deployment, when
|
||||
all Minion keys have already been accepted, but the framework is being tested
|
||||
and parameters are frequently changed in the Salt Master's configuration
|
||||
file(s).
|
||||
|
||||
The Salt Master generates a new AES key to encrypt its publications at certain
|
||||
events such as a Master restart or the removal of a Minion key. If you are
|
||||
encountering this problem of too many minions re-authing against the Master,
|
||||
you will need to recalibrate your setup to reduce the rate of events like a
|
||||
Master restart or Minion key removal (``salt-key -d``).
|
||||
|
||||
When the Master generates a new AES key, the minions aren't notified of this
|
||||
but will discover it on the next pub job they receive. When the Minion
|
||||
receives such a job it will then re-auth with the Master. Since Salt does
|
||||
minion-side filtering this means that all the minions will re-auth on the next
|
||||
command published on the master-- causing another "thundering herd". This can
|
||||
be avoided by setting the
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
@ -85,11 +92,11 @@ setting the
|
||||
|
||||
in the minions configuration file to a higher value and stagger the amount
|
||||
of re-auth attempts. Increasing this value will of course increase the time
|
||||
it takes until all minions are reachable via salt commands.
|
||||
|
||||
it takes until all minions are reachable via Salt commands.
|
||||
|
||||
Too many minions re-connecting
|
||||
------------------------------
|
||||
|
||||
By default the zmq socket will re-connect every 100ms which for some larger
|
||||
installations may be too quick. This will control how quickly the TCP session is
|
||||
re-established, but has no bearing on the auth load.
|
||||
@ -111,15 +118,15 @@ the sample configuration file (default values)
|
||||
To tune this values to an existing environment, a few decision have to be made.
|
||||
|
||||
|
||||
1. How long can one wait, before the minions should be online and reachable via salt?
|
||||
1. How long can one wait, before the minions should be online and reachable via Salt?
|
||||
|
||||
2. How many reconnects can the master handle without a syn flood?
|
||||
2. How many reconnects can the Master handle without a syn flood?
|
||||
|
||||
These questions can not be answered generally. Their answers depend on the
|
||||
hardware and the administrators requirements.
|
||||
|
||||
Here is an example scenario with the goal, to have all minions reconnect
|
||||
within a 60 second time-frame on a salt-master service restart.
|
||||
within a 60 second time-frame on a Salt Master service restart.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
@ -127,7 +134,7 @@ within a 60 second time-frame on a salt-master service restart.
|
||||
recon_max: 59000
|
||||
recon_randomize: True
|
||||
|
||||
Each minion will have a randomized reconnect value between 'recon_default'
|
||||
Each Minion will have a randomized reconnect value between 'recon_default'
|
||||
and 'recon_default + recon_max', which in this example means between 1000ms
|
||||
and 60000ms (or between 1 and 60 seconds). The generated random-value will
|
||||
be doubled after each attempt to reconnect (ZeroMQ default behavior).
|
||||
@ -157,7 +164,6 @@ round about 16 connection attempts a second. These values should be altered to
|
||||
values that match your environment. Keep in mind though, that it may grow over
|
||||
time and that more minions might raise the problem again.
|
||||
|
||||
|
||||
Too many minions returning at once
|
||||
----------------------------------
|
||||
|
||||
@ -168,11 +174,11 @@ once with
|
||||
|
||||
$ salt * test.ping
|
||||
|
||||
it may cause thousands of minions trying to return their data to the salt-master
|
||||
open port 4506. Also causing a flood of syn-flood if the master can't handle that many
|
||||
it may cause thousands of minions trying to return their data to the Salt Master
|
||||
open port 4506. Also causing a flood of syn-flood if the Master can't handle that many
|
||||
returns at once.
|
||||
|
||||
This can be easily avoided with salts batch mode:
|
||||
This can be easily avoided with Salt's batch mode:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -181,19 +187,18 @@ This can be easily avoided with salts batch mode:
|
||||
This will only address 50 minions at once while looping through all addressed
|
||||
minions.
|
||||
|
||||
|
||||
Too few resources
|
||||
=================
|
||||
|
||||
The masters resources always have to match the environment. There is no way
|
||||
to give good advise without knowing the environment the master is supposed to
|
||||
to give good advise without knowing the environment the Master is supposed to
|
||||
run in. But here are some general tuning tips for different situations:
|
||||
|
||||
The master is CPU bound
|
||||
The Master is CPU bound
|
||||
-----------------------
|
||||
|
||||
Salt uses RSA-Key-Pairs on the masters and minions end. Both generate 4096
|
||||
bit key-pairs on first start. While the key-size for the master is currently
|
||||
bit key-pairs on first start. While the key-size for the Master is currently
|
||||
not configurable, the minions keysize can be configured with different
|
||||
key-sizes. For example with a 2048 bit key:
|
||||
|
||||
@ -206,13 +211,13 @@ masters end should not be neglected. See here for reference:
|
||||
`Pull Request 9235 <https://github.com/saltstack/salt/pull/9235>`_ how much
|
||||
influence the key-size can have.
|
||||
|
||||
Downsizing the salt-masters key is not that important, because the minions
|
||||
do not encrypt as many messages as the master does.
|
||||
Downsizing the Salt Master's key is not that important, because the minions
|
||||
do not encrypt as many messages as the Master does.
|
||||
|
||||
The master is disk IO bound
|
||||
The Master is disk IO bound
|
||||
---------------------------
|
||||
|
||||
By default, the master saves every minion's return for every job in its
|
||||
By default, the Master saves every Minion's return for every job in its
|
||||
job-cache. The cache can then be used later, to lookup results for previous
|
||||
jobs. The default directory for this is:
|
||||
|
||||
@ -222,7 +227,7 @@ jobs. The default directory for this is:
|
||||
|
||||
and then in the ``/proc`` directory.
|
||||
|
||||
Each job return for every minion is saved in a single file. Over time this
|
||||
Each job return for every Minion is saved in a single file. Over time this
|
||||
directory can grow quite large, depending on the number of published jobs. The
|
||||
amount of files and directories will scale with the number of jobs published and
|
||||
the retention time defined by
|
||||
@ -245,6 +250,6 @@ If no job history is needed, the job cache can be disabled:
|
||||
If the job cache is necessary there are (currently) 2 options:
|
||||
|
||||
- ext_job_cache: this will have the minions store their return data directly
|
||||
into a returner (not sent through the master)
|
||||
- master_job_cache (New in `2014.7.0`): this will make the master store the job
|
||||
data using a returner (instead of the local job cache on disk).
|
||||
into a returner (not sent through the Master)
|
||||
- master_job_cache (New in `2014.7.0`): this will make the Master store the job
|
||||
data using a returner (instead of the local job cache on disk).
|
||||
|
Loading…
Reference in New Issue
Block a user