On Wed, Sep 07, 2022 at 09:33:07PM +0300, Julian Anastasov wrote:
> On Mon, 5 Sep 2022, Jiri Wiesner wrote:
>
> > I believe allowing the kthread the possibility to block in each iteration -
> > for each estimator - introduces some problems:
> > 1. Non-preemptive kernels should be optimized for throughput not for latency
> > Using the figure reported earlier (50,000 services required 200 ms of
> > processing time) it takes roughly 3 ms to process one chain (32 * 1024 / 50
> > services). The processing time per chain will vary with the number of NUMA
> > nodes and CPUs. Nevertheless, this number comparable with the processing
> > time limit of __do_softirq() - 2 ms, which gets converted to 1 jiffy. In
> > term of latency of non-preemptive kernels, it is entirely resonable to let
> > one chain be processed without rescheduling the kthread.
>
> My worry is that IPVS_EST_MAX_COUNT is not a hard limit,
> we allow this limit to be exceeded when ip_vs_est_add_kthread()
> fails to create kthread (mostly on ENOMEM)
A design decision has been taken to allow more than IPVS_EST_MAX_COUNT
estimators in a kthread. The reason is to be able to add estimators even if
there is no memory to create a new kthread. Under such memory conditions, user
space would not be able to fork new processes - the shell could not execute
commands. The system is practically unusable until the OOM killer frees some
memory. No production system should be expected to function in a long term when
there is not enough memory to fork new processes. The solutions are: add more
RAM, decrease workload or resolve memory leak bugs. Not being able to create a
kthread can be expected to occur seldom, and it would have to occur while
services or destination are being reconfigured. So, why bother with allowing
more than IPVS_EST_MAX_COUNT estimators in a kthread. It is common to return
-ENOMEM to user space under such conditions.
> or if we add a limit
> for kthreads, so in such cases we can not refuse to add more
> estimators to existing kthreads.
As for limiting the number of kthreads, the limit could implemented as a sysctl
variable. The default should be large enough for most deployments. I guess it
could accomodate more than 900,000 estimators, which means 30 kthreads. The
maximum number of destinations I have seen on a system was 110,000, with 7,000
services (but I have not seen many production systems running IPVS).
The current design does not take into account estimators that have been
removed. As a result, kthreads that are not the last kthread could have fewer
than IPVS_EST_MAX_COUNT estimators. This could be prevented by changing
ipvs->est_add_ktid in ip_vs_stop_estimator() if the value of est->ktid is lower
than the current ipvs->est_add_ktid. Ip_vs_start_estimator() would have to
increment ipvs->est_add_ktid until it finds a kthread with fewer than
IPVS_EST_MAX_COUNT estimators or it has to create a new kthread.
> > 3. Blocking while in ip_vs_estimation_chain() will results in wrong
> > estimates for the remaining estimators in a chain. The 2 second interval
> > would not be kept, rate estimates would be overestimated in that interval
> > and underestimated in one of the future intervals.
> > In my opinion, any of the above reasons is sufficient to remove
> > ip_vs_est_cond_resched_rcu(), ip_vs_est_wait_resched() and kd->mutex.
>
> The chains can become too long. I'm not sure it is
> good to avoid cond_resched() for long time. Another solution
> would be to allocate more chains for a tick and apply some
> hard limit for these chains. cond_resched() will be called
> after such chain is processed. But it is difficult to
> calculate hard limit for these chains, it depends on the
> cycles we spend (CPU speed and the number of possible CPUs
> we walk in the loop). For example, we can have again 50 ticks
> but 16 chains per tick, so total 800 chains per kthread (50*16).
> 32*1024/800 means 40 estimators per chain before calling
> cond_resched().
OK, this seems workable. All the chains could be in one array
struct hlist_head chains[IPVS_EST_NCHAINS * 16];
and you would increment row by 16 in ip_vs_estimation_kthread().
> And a tick can attach more than 16 chains
> if the est_max_count is exceeded. It will cost some memory
> to provide more chains but it will avoid the kd->mutex
> games.
>
> struct hlist_head *tick_chain = &kd->tick_chains[row];
>
> rcu_read_lock();
> /* tick_chains has no limit of chains */
> hlist_for_each_entry_rcu(chain, tick_chains, list) {
> /* This list below is limited */
> hlist_for_each_entry_rcu(e, &chain->ests, list) {
> ...
> if (kthread_should_stop())
> goto end;
> ...
> }
> cond_resched_rcu();
> }
So, you set a limit - IPVS_EST_MAX_COUNT - but you also allow this limit to be
ignored because of a corner case. As a result, tick_chains must not have a
limit of chains because of that. I lean towards keeping the maximum of
IPVS_EST_MAX_COUNT estimators per kthread.
There is an alternative design where you could increase kd->est_max_count for
all kthreads once all of the available kthreads have kd->est_max_count
estimators. Nevertheless, there would also have to be a limit to the value of
kd->est_max_count. Imagine the estimation during a single tick would take so
long that the gap variable in ip_vs_estimation_kthread() would become negative.
You would need to have circa 250,000 estimators per kthread. Since you are
already measuring the timeout you need for schedule_timeout() in
ip_vs_estimation_kthread(), it should be possible to set the kd->est_max_count
limit based on the maximum processing time per chain. It could be half a
IPVS_EST_TICK, for example.
But it seems to me that the alternative design - increasing kd->est_max_count -
should have some support in what is used in production. Are there servers with
more than 983,040 estimators (which would be IPVS_EST_MAX_COUNT * 30 kthreads)
or even one third of that?
> If we change ip_vs_start_estimator() to return int err
> we can safely allocate new chains and track for ENOMEM.
> I think, this is possible, with some reordering of the
> ip_vs_start_estimator() calls.
I think ip_vs_start_estimator() should return an int.
> > > Kthread data:
> > >
> > > - every kthread works over its own data structure and all
> > > such structures are attached to array
> >
> > It seems to me there is no upper bound on the number of kthreads that could
> > be forked. Therefore, it should be possible to devise an attack that would
> > force the system to run out of PIDs:
> > 1. Start adding services so that all chains of kthread A would be used.
> > 2. Add one more service to force the forking of kthread B, thus advancing
> > ipvs->est_add_ktid.
> > 3. Remove all but one service from kthread A.
> > 4. Repeat steps 1-3 but with kthread B.
> > I think I could come up with a reproducer if need be.
>
> Agreed, the chains management can be made more robust.
> This patchset is initial version that can serve as basis.
> We should consider such things:
>
> - we do not know how many estimators will be added, if we
> limit the number of kthreads, then they will be overloaded
Not all kthreads would be overloaded. The current desing would add all the
excess estimators into the last kthread.
> - add/del can be made to allocate memory for chains, if needed.
> We should not spend many cycles in adding/deleting estimators.
The fewer allocations the better. I would not use a linked list as described
above - kd->tick_chains.
> - try more hard to spread estimators to chains, even with
> the risk of applying initially a smaller timer for the newly
> added estimator.
Yes.
> > Unbalanced chains would not be fatal to the system but there is risk of
> > introducing scheduling latencies tens or even hundreds of milliseconds
> > long. There are patterns of adding and removing chains that would results
> > in chain imbalance getting so severe that a handful of chains would have
> > estimators in them while the rest would be empty or almost empty. Some
> > examples:
> > 1. Adding a new service each second after sleeping for 1 second. This would
> > use the add_row value at the time of adding the estimator, which would
> > result in 2 chains holding all the estimators.
> > 2. Repeated addition and removal of services. There would always be more
> > services added than removed. The additions would be carried out in bursts.
> > The forking of a new kthread would not be triggered because the number of
> > services would stay under IPVS_EST_MAX_COUNT (32 * 1024).
> >
> > The problem is that the code does not guarantee that the length of chains
> > always stays under IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS (32 * 1024 / 50)
> > estimators. A check and a cycle iterating over available rows could be
> > added to ip_vs_start_estimator() to find the rows that still have fewer
> > estimators than IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS. This would come at
> > the expense of having inaccurate estimates for a few intervals but I think
> > the trade-off is worth it. Also, the estimates will be inaccurate when
> > estimators are added in bursts. If, depending on how services are added and
> > removed, the code could introduce scheduling latencies there would be
> > someone running into this sooner or later. The probability of severe chain
> > imbalance being low is not good enough there should be a guarantee.
>
> About balancing the chains: not sure if it is possible
> while serving some row for current tick, to look ahead and
> advance the current tick work with estimators from the next
> tick. Such decisions can be made every tick. I.e. we do not
> move entries between the chains but we can walk one chain and
> possibly part of the other chain. If we have more chains per
> tick, it will be more easy, for example, if current tick
> processes 16 chains by default, it can process some from the
> next 16 chains, say 16+8 in total. After 2 seconds, this tick can
> process 16+16+8, for example. It will depend on the currently
> added estimators per tick and its chains. Every 2 seconds
> we can advance with one tick, so that an estimator is
> served every 2 seconds or atleast after 1960ms if such
> chain stealing happens. Such logic will allow the
> estimators to be spread in time even while they are attached
> to chains and ticks with different length. As result, we will
> process equal number of estimators per tick by slowly
> adjusting to their current number and chain lengths.
I am definitely not suggesting to implement balancing. I propose that each
chain have a hard limit on the number of estimators.
--
Jiri Wiesner
SUSE Labs
|