Re: [RFC PATCH 0/4] Use kthreads for stats

To: Julian Anastasov <ja@xxxxxx>
Subject: Re: [RFC PATCH 0/4] Use kthreads for stats
Cc: Simon Horman <horms@xxxxxxxxxxxx>, lvs-devel@xxxxxxxxxxxxxxx, yunhong-cgl jiang <xintian1976@xxxxxxxxx>, yunhjiang@xxxxxxxx,, tangyang@xxxxxxxxx
From: Jiri Wiesner <jwiesner@xxxxxxx>
Date: Mon, 5 Sep 2022 10:26:42 +0200
On Sat, Aug 27, 2022 at 08:41:50PM +0300, Julian Anastasov wrote:
>       This patchset implements stats estimation in
> kthread context. Simple tests do not show any problem.
> Please review, comment, test, etc.

Thank you for this, Julian.

> RCU Locking:
> - when RCU preemption is enabled the kthreads use just RCU
> lock for walking the chains and we do not need to reschedule.
> May be this is the common case for distribution kernels.
> In this case ip_vs_stop_estimator() is completely lockless.

Yes, it is the case for SUSE since SLES 15.4, which is a recent release.

> - when RCU preemption is not enabled, we reschedule by using
> refcnt for every estimator to track if the currently removed
> estimator is used at the same time by kthread for estimation.
> As RCU lock is unlocked during rescheduling, the deletion
> should wait kd->mutex, so that a new RCU lock is applied
> before the estimator is freed with RCU callback.

I believe allowing the kthread the possibility to block in each iteration - for 
each estimator - introduces some problems:
1. Non-preemptive kernels should be optimized for throughput not for latency
Using the figure reported earlier (50,000 services required 200 ms of 
processing time) it takes roughly 3 ms to process one chain (32 * 1024 / 50 
services). The processing time per chain will vary with the number of NUMA 
nodes and CPUs. Nevertheless, this number comparable with the processing time 
limit of __do_softirq() - 2 ms, which gets converted to 1 jiffy. In term of 
latency of non-preemptive kernels, it is entirely resonable to let one chain be 
processed without rescheduling the kthread.
2. The priority of the kthreads could be set to lower values than the default 
priority for SCHED_OTHER. If a user space process was trying to stop an 
estimator the pointer to which is held by a currently sleeping kthread this 
would constitute priority inversion. The kd->mutex would not be released until 
the lower priority thread, the kthread, has started running again. AFAIK, the 
mutex() locking primitive does not implement priority inheritance on 
non-preemptive kernels.
3. Blocking while in ip_vs_estimation_chain() will results in wrong estimates 
for the remaining estimators in a chain. The 2 second interval would not be 
kept, rate estimates would be overestimated in that interval and underestimated 
in one of the future intervals.
In my opinion, any of the above reasons is sufficient to remove 
ip_vs_est_cond_resched_rcu(), ip_vs_est_wait_resched() and kd->mutex.

> - As stats are now RCU-locked, tot_stats, svc and dest which
> hold estimator structures are now always freed from RCU
> callback. This ensures RCU grace period after the
> ip_vs_stop_estimator() call.

I think this is sound.

> Kthread data:
> - every kthread works over its own data structure and all
> such structures are attached to array

It seems to me there is no upper bound on the number of kthreads that could be 
forked. Therefore, it should be possible to devise an attack that would force 
the system to run out of PIDs:
1. Start adding services so that all chains of kthread A would be used.
2. Add one more service to force the forking of kthread B, thus advancing 
3. Remove all but one service from kthread A.
4. Repeat steps 1-3 but with kthread B.
I think I could come up with a reproducer if need be.

> - to add new estimators we use the last added kthread
> context (est_add_ktid). The new estimators are linked to
> the chain just before the estimated one, based on add_row.
> This ensures their estimation will start after 2 seconds.
> If estimators are added in bursts, common case if all
> services and dests are initially configured, we may
> spread the estimators to more chains. This will reduce
> the chain imbalance.
> - the chain imbalance is not so fatal when we use
> kthreads. We design each kthread for part of the
> possible CPU usage, so even if some chain exceeds its
> time slot it would happen all the time or sporadic
> depending on the scheduling but still keeping the
> 2-second interval. The cpulist isolation can make
> the things more stable as a 2-second time interval
> per estimator.

Unbalanced chains would not be fatal to the system but there is risk of 
introducing scheduling latencies tens or even hundreds of milliseconds long. 
There are patterns of adding and removing chains that would results in chain 
imbalance getting so severe that a handful of chains would have estimators in 
them while the rest would be empty or almost empty. Some examples:
1. Adding a new service each second after sleeping for 1 second. This would use 
the add_row value at the time of adding the estimator, which would result in 2 
chains holding all the estimators.
2. Repeated addition and removal of services. There would always be more 
services added than removed. The additions would be carried out in bursts. The 
forking of a new kthread would not be triggered because the number of services 
would stay under IPVS_EST_MAX_COUNT (32 * 1024).

The problem is that the code does not guarantee that the length of chains 
always stays under IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS (32 * 1024 / 50) 
estimators. A check and a cycle iterating over available rows could be added to 
ip_vs_start_estimator() to find the rows that still have fewer estimators than 
IPVS_EST_MAX_COUNT / IPVS_EST_NCHAINS. This would come at the expense of having 
inaccurate estimates for a few intervals but I think the trade-off is worth it. 
Also, the estimates will be inaccurate when estimators are added in bursts. If, 
depending on how services are added and removed, the code could introduce 
scheduling latencies there would be someone running into this sooner or later. 
The probability of severe chain imbalance being low is not good enough there 
should be a guarantee.
Jiri Wiesner

<Prev in Thread] Current Thread [Next in Thread>