Merge pull request #25856 from jfindlay/intro_scale

expand minion reauth scalability documentation
2024-11-06 16:45:27 +00:00 · 2015-07-30 09:33:17 -06:00 · 2015-07-30 09:33:17 -06:00 · b6805b068a
commit b6805b068a
parent 423d528b73 5921461bb1
1 changed files with 50 additions and 45 deletions
--- a/doc/topics/tutorials/intro_scale.rst
+++ b/doc/topics/tutorials/intro_scale.rst
@ -1,11 +1,11 @@
 ===================
-Using salt at scale
+Using Salt at scale
 ===================

 The focus of this tutorial will be building a Salt infrastructure for handling
 large numbers of minions. This will include tuning, topology, and best practices.

-For how to install the saltmaster please
+For how to install the Salt Master please
 go here: `Installing saltstack <http://docs.saltstack.com/topics/installation/index.html>`_

 .. note::
@ -17,12 +17,12 @@ go here: `Installing saltstack <http://docs.saltstack.com/topics/installation/in
    and 'a few' always means 500.

    For simplicity reasons, this tutorial will default to the standard ports
-    used by salt.
+    used by Salt.

 The Master
 ==========

-The most common problems on the salt-master are:
+The most common problems on the Salt Master are:

 1. too many minions authing at once
 2. too many minions re-authing at once
@ -31,7 +31,7 @@ The most common problems on the salt-master are:
 5. too few resources (CPU/HDD)

 The first three are all "thundering herd" problems. To mitigate these issues
-we must configure the minions to back-off appropriately when the master is
+we must configure the minions to back-off appropriately when the Master is
 under heavy load.

 The fourth is caused by masters with little hardware resources in combination
@ -40,44 +40,51 @@ with a possible bug in ZeroMQ. At least thats what it looks like till today
 `Issue 5948 <https://github.com/saltstack/salt/issues/5948>`_,
 `Mail thread <https://groups.google.com/forum/#!searchin/salt-users/lots$20of$20minions/salt-users/WxothArv2Do/t12MigMQDFAJ>`_)

-To fully understand each problem, it is important to understand, how salt works.
+To fully understand each problem, it is important to understand, how Salt works.

-Very briefly, the saltmaster offers two services to the minions.
+Very briefly, the Salt Master offers two services to the minions.

 - a job publisher on port 4505
 - an open port 4506 to receive the minions returns

 All minions are always connected to the publisher on port 4505 and only connect
-to the open return port 4506 if necessary. On an idle master, there will only
+to the open return port 4506 if necessary. On an idle Master, there will only
 be connections on port 4505.

 Too many minions authing
 ------------------------
-When the minion service is first started up, it will connect to its master's publisher
+
+When the Minion service is first started up, it will connect to its Master's publisher
 on port 4505. If too many minions are started at once, this can cause a "thundering herd".
 This can be avoided by not starting too many minions at once.

 The connection itself usually isn't the culprit, the more likely cause of master-side
-issues is the authentication that the minion must do with the master. If the master
-is too heavily loaded to handle the auth request it will time it out. The minion
+issues is the authentication that the Minion must do with the Master. If the Master
+is too heavily loaded to handle the auth request it will time it out. The Minion
 will then wait `acceptance_wait_time` to retry. If `acceptance_wait_time_max` is
-set then the minion will increase its wait time by the `acceptance_wait_time` each
+set then the Minion will increase its wait time by the `acceptance_wait_time` each
 subsequent retry until reaching `acceptance_wait_time_max`.

-
 Too many minions re-authing
 ---------------------------
-This is most likely to happen in the testing phase, when all minion keys have
-already been accepted, the framework is being tested and parameters change
-frequently in the masters configuration file.

-In a few cases (master restart, remove minion key, etc.) the salt-master generates
-a new AES-key to encrypt its publications with. The minions aren't notified of
-this but will realize this on the next pub job they receive. When the minion
-receives such a job it will then re-auth with the master. Since Salt does minion-side
-filtering this means that all the minions will re-auth on the next command published
-on the master-- causing another "thundering herd". This can be avoided by
-setting the
+This is most likely to happen in the testing phase of a Salt deployment, when
+all Minion keys have already been accepted, but the framework is being tested
+and parameters are frequently changed in the Salt Master's configuration
+file(s).
+
+The Salt Master generates a new AES key to encrypt its publications at certain
+events such as a Master restart or the removal of a Minion key.  If you are
+encountering this problem of too many minions re-authing against the Master,
+you will need to recalibrate your setup to reduce the rate of events like a
+Master restart or Minion key removal (``salt-key -d``).
+
+When the Master generates a new AES key, the minions aren't notified of this
+but will discover it on the next pub job they receive. When the Minion
+receives such a job it will then re-auth with the Master. Since Salt does
+minion-side filtering this means that all the minions will re-auth on the next
+command published on the master-- causing another "thundering herd". This can
+be avoided by setting the

 .. code-block:: yaml

@ -85,11 +92,11 @@ setting the

 in the minions configuration file to a higher value and stagger the amount
 of re-auth attempts. Increasing this value will of course increase the time
-it takes until all minions are reachable via salt commands.
-
+it takes until all minions are reachable via Salt commands.

 Too many minions re-connecting
 ------------------------------
+
 By default the zmq socket will re-connect every 100ms which for some larger
 installations may be too quick. This will control how quickly the TCP session is
 re-established, but has no bearing on the auth load.
@ -111,15 +118,15 @@ the sample configuration file (default values)
 To tune this values to an existing environment, a few decision have to be made.


-1. How long can one wait, before the minions should be online and reachable via salt?
+1. How long can one wait, before the minions should be online and reachable via Salt?

-2. How many reconnects can the master handle without a syn flood?
+2. How many reconnects can the Master handle without a syn flood?

 These questions can not be answered generally. Their answers depend on the
 hardware and the administrators requirements.

 Here is an example scenario with the goal, to have all minions reconnect
-within a 60 second time-frame on a salt-master service restart.
+within a 60 second time-frame on a Salt Master service restart.

 .. code-block:: yaml

@ -127,7 +134,7 @@ within a 60 second time-frame on a salt-master service restart.
    recon_max: 59000
    recon_randomize: True

-Each minion will have a randomized reconnect value between 'recon_default'
+Each Minion will have a randomized reconnect value between 'recon_default'
 and 'recon_default + recon_max', which in this example means between 1000ms
 and 60000ms (or between 1 and 60 seconds). The generated random-value will
 be doubled after each attempt to reconnect (ZeroMQ default behavior).
@ -157,7 +164,6 @@ round about 16 connection attempts a second. These values should be altered to
 values that match your environment. Keep in mind though, that it may grow over
 time and that more minions might raise the problem again.

-
 Too many minions returning at once
 ----------------------------------

@ -168,11 +174,11 @@ once with

    $ salt * test.ping

-it may cause thousands of minions trying to return their data to the salt-master
-open port 4506. Also causing a flood of syn-flood if the master can't handle that many
+it may cause thousands of minions trying to return their data to the Salt Master
+open port 4506. Also causing a flood of syn-flood if the Master can't handle that many
 returns at once.

-This can be easily avoided with salts batch mode:
+This can be easily avoided with Salt's batch mode:

 .. code-block:: bash

@ -181,19 +187,18 @@ This can be easily avoided with salts batch mode:
 This will only address 50 minions at once while looping through all addressed
 minions.

-
 Too few resources
 =================

 The masters resources always have to match the environment. There is no way
-to give good advise without knowing the environment the master is supposed to
+to give good advise without knowing the environment the Master is supposed to
 run in.  But here are some general tuning tips for different situations:

-The master is CPU bound
+The Master is CPU bound
 -----------------------

 Salt uses RSA-Key-Pairs on the masters and minions end. Both generate 4096
-bit key-pairs on first start. While the key-size for the master is currently
+bit key-pairs on first start. While the key-size for the Master is currently
 not configurable, the minions keysize can be configured with different
 key-sizes. For example with a 2048 bit key:

@ -206,13 +211,13 @@ masters end should not be neglected. See here for reference:
 `Pull Request 9235 <https://github.com/saltstack/salt/pull/9235>`_ how much
 influence the key-size can have.

-Downsizing the salt-masters key is not that important, because the minions
-do not encrypt as many messages as the master does.
+Downsizing the Salt Master's key is not that important, because the minions
+do not encrypt as many messages as the Master does.

-The master is disk IO bound
+The Master is disk IO bound
 ---------------------------

-By default, the master saves every minion's return for every job in its
+By default, the Master saves every Minion's return for every job in its
 job-cache. The cache can then be used later, to lookup results for previous
 jobs. The default directory for this is:

@ -222,7 +227,7 @@ jobs. The default directory for this is:

 and then in the ``/proc`` directory.

-Each job return for every minion is saved in a single file. Over time this
+Each job return for every Minion is saved in a single file. Over time this
 directory can grow quite large, depending on the number of published jobs. The
 amount of files and directories will scale with the number of jobs published and
 the retention time defined by
@ -245,6 +250,6 @@ If no job history is needed, the job cache can be disabled:
 If the job cache is necessary there are (currently) 2 options:

 - ext_job_cache: this will have the minions store their return data directly
-  into a returner (not sent through the master)
- master_job_cache (New in `2014.7.0`): this will make the master store the job
-  data using a returner (instead of the local job cache on disk).
+  into a returner (not sent through the Master)
+- master_job_cache (New in `2014.7.0`): this will make the Master store the job
+  data using a returner (instead of the local job cache on disk).