Hello,
On Fri, 9 Sep 2022, Jiri Wiesner wrote:
> On Fri, Sep 09, 2022 at 01:21:05AM +0300, Julian Anastasov wrote:
> > It is interesting to know what value for
> > IPVS_EST_TICK_CHAINS to use, it is used for the
> > IPVS_EST_MAX_COUNT calculation. We should determine
> > it from tests once the loops are in final form.
> > Now the limit increased a little bit to 38400.
> > Tomorrow I'll check again the patches for possible
> > problems.
>
> I couldn't wait so I have run tests on various machines and used the
> sched_switch tracepoint to measure the time needed to process one chain. The
> table contains a median time for processing one chain, the maximum time
> measured, the median divided by the number of CPUs and the time needed to
> process one chain if there were 1024 CPUs of that type in a machine:
> > NR CPU Time(ms) Max(ms) Time/CPU(ms) 1024
> > CPUs(ms)
> > 48 Intel Xeon CPU E52670 v3, 2 nodes 1.220 1.343 0.025
> > 26.027
> > 64 Intel Xeon Gold 6326, 2 nodes 0.920 1.494 0.014
> > 14.720
> > 192 Intel Xeon Gold 6330H, 4 nodes 3.957 4.153 0.021
> > 21.104
> > 256 AMD EPYC 7713, 2 NUMA nodes 3.927 5.464 0.015
> > 15.708
> > 80 ARM NeoverseN1, 1 NUMA node 1.833 2.502 0.023
> > 23.462
> > 128 ARM Kunpeng 920, 4 NUMA nodes 3.822 4.635 0.030
> > 30.576
> I have to admit I was hoping the current IPVS_EST_CHAIN_DEPTH would work on
> machines with more than 1024 CPUs. If the max time values are used the time
> needed to process one chain on a 1024 CPU machine gets even closer to 40 ms,
> which it must not reach lest the estimates become inaccurate. I also have
> profiling data so I intend to look at the disassembly of
> ip_vs_estimation_kthread() to see which instructions take the most time. I
> will take a look at the v2 of the code on Monday.
v2 uses find_next_bit in for_each_set_bit which has
cost. But we should not be surprised, if 268ms are for 50000
estimators on 104 CPUs (I guess this is also the number of
possible CPUs we actually use), one estimator reads from
104 CPUs for 5.36 microsecs, we can conclude for 1024 CPUs
the following:
Num Est 104 CPU 1024 CPU
========================================
1 5.36us 53us
4 21us 211us
16 86us 845us
The v2 algorithm allows IPVS_EST_CHAIN_DEPTH to
be changed to var which we can determine based on the CPU
count, more CPUs will need more threads and we have CPUs
for them:
kd>chain_depth = max(1800 / num_possible_cpus(), 2);
Goals:
 chain time: sub100 usec cond_resched rate
 tick time: 10% of max 40ms
CPUs Depth est_max_count Chain Time Tick Time
=========================================================
4 450 1080000 93us 4453us
16 112 268800 92us 4433us
104 17 40800 91us 4374us
1024 2 4800 106us 5066us
4096 2 4800 422us 20265us
Summary:
 For 4096 CPUs we can start 208 kthreads for 1000000 ests,
crazy :)
 4096 CPUs need to be fast to go below these 20ms or we
should use chain with 1 estimator for 2048+ CPUs
If we track somehow when a stats is updated,
may be we can skip estimators that are idle for
some time, this can save CPU cycles for estimating
unused dests.
Also, I'm investigating the idea to use
task_rlimit(current, RLIMIT_NPROC) as kthread limit when
first service is added and to save it into
ipvs>est_max_threads.
Another idea is ip_vs_estimation_kthread not to
change add_row but ip_vs_start_estimator to consider instead
est_row for the same purpose but only when kd>est_count
becomes large, say 2 * IPVS_EST_TICK_CHAINS * kd>chain_depth.
The idea is to fill 2 ticks completely when small number
of estimators are added and to prefer est_row when
we exceed this threshold and prefer to spread the
estimators to more ticks by honouring the 2second
initial timer.
For example:
if (kd>est_count >= 2 * IPVS_EST_TICK_CHAINS *
kd>chain_depth)
crow = READ_ONCE(kd>est_row);
else
crow = READ_ONCE(kd>add_row);
crow;
...
Regards

Julian Anastasov <ja@xxxxxx>
