Hi Yunhong & Julian, any updates ?
We've encountered the same problem. With lots of ipvs
services plus many CPUs, it's easy to reproduce this issue.
I have a simple script to reproduce:
First add many ipvs services:
ipvsadm -A -t 10.10.10.10:$((2000+$i))
Then, check the latency of estimation_timer() using bpftrace:
@enter = nsecs;
$exit = nsecs;
printf("latency: %ld us\n", (nsecs - @enter)/1000);
I observed about 268ms delay on my 104 CPUs test server.
Attaching 2 probes...
latency: 268807 us
latency: 268519 us
latency: 269263 us
And I tried moving estimation_timer() into a delayed
workqueue, this do make things better. But since the
estimation won't give up CPU, it can run for pretty
long without scheduling on a server which don't have
preempt enabled, so tasks on that CPU can't get executed
during that period.
Since the estimation repeated every 2s, we can't call
cond_resched() to give up CPU in the middle of iterating the
est_list, or the estimation will be quite inaccurate.
Besides the est_list needs to be protected.
I haven't found any ideal solution yet, currently, we just
moved the estimation into kworker and add sysctl to allow
us to disable the estimation, since we don't need the
Our patches is pretty simple now, if you think it's useful,
I can paste them
Do you guys have any suggestions or solutions ?
Thanks a lot !
On 4/18/20 12:56 AM, yunhong-cgl jiang wrote:
Thanks for reply.
Yes, our patch changes the est_list to a RCU list. Will do more testing and
send out the patch.
On Apr 17, 2020, at 12:47 AM, Julian Anastasov <ja@xxxxxx> wrote:
On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:
Hi, Simon & Julian,
We noticed that on our kubernetes node utilizing IPVS, the
estimation_timer() takes very long (>200sm as shown below). Such long delay on
timer softirq causes long packet latency.
<idle>-0  dNH. 25652945.670814: softirq_raise: vec=1
<idle>-0  .Ns. 25652945.992273: softirq_exit: vec=1
The long latency is caused by the big service number (>50k) and large CPU
number (>80 CPUs),
We tried to move the timer function into a kernel thread so that it
will not block the system and seems solves our problem. Is this the right
direction? If yes, we will do more testing and send out the RFC patch. If not,
can you give us some suggestion?
Using kernel thread is a good idea. For this to work, we can
also remove the est_lock and to use RCU for est_list.
The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
run under common mutex __ip_vs_mutex, so they not need any
synchronization. We need _bh lock usage in estimation_timer().
Let me know if you need any help with the patch.
Julian Anastasov <ja@xxxxxx>