Hello,
On Sat, 12 Nov 2022, Jiri Wiesner wrote:
> On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote:
> > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote:
> > > Use the provided u64_stats_t type to avoid
> > > load/store tearing.
>
> I think only 32-bit machines have a problem storing a 64-bit counter
> atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take
> care of that.
Some compilers for 64-bit can use two 32-bit
loads (load tearing) while we require atomic 64-bit read
in case reader is interrupted by updater. That is why READ_ONCE
is used now. I don't know which compilers can do this. That is
why I switched to u64_stats_t to correctly use the u64_stats
interface. And our data is 8-byte aligned. 32-bit are using
seqcount_t, so they are protected for this problem.
> > Converting the per cpu stat to u64_stats_t means that the compiler cannot
> > optimize the memory access and addition on x86_64. Previously, the
> > summation of per cpu counters in ip_vs_chain_estimation() looked like:
> > 15b65: add (%rsi),%r14
> > 15b68: add 0x8(%rsi),%r15
> > 15b6c: add 0x10(%rsi),%r13
> > 15b70: add 0x18(%rsi),%r12
> > 15b74: add 0x20(%rsi),%rbp
> > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into:
> > 159d5: mov (%rcx),%r11
> > 159d8: mov 0x8(%rcx),%r10
> > 159dc: mov 0x10(%rcx),%r9
> > 159e0: mov 0x18(%rcx),%rdi
> > 159e4: mov 0x20(%rcx),%rdx
> > 159e8: add %r11,%r14
> > 159eb: add %r10,%r13
> > 159ee: add %r9,%r12
> > 159f1: add %rdi,%rbp
> > 159f4: add %rdx,%rbx
> > I guess that is not a big deal because the mov should be the instruction
> > taking the most time on account of accessing per cpu regions of other CPUs.
> > The add will be fast.
>
> Another concern is the number of registers needed for the summation.
> Previously, 5 registers were needed. Now, 10 registers are needed. This would
> matter mostly if the size of the stack frame of ip_vs_chain_estimation() and
> the number of callee-saved registers stored to the stack changed, but
> introducing u64_stats_read() in ip_vs_chain_estimation() did not change that.
It happens in this way probably because compiler should not
reorder two or more successive instances of READ_ONCE (rwonce.h),
something that hurts u64_stats_read(). As result, it multiplies
up to 5 times the needed temp storage (registers or stack).
Another option would be to use something like
this below, try on top of v6. It is rude to add ifdefs here but
we should avoid unneeded code for 64-bit. On 32-bit, code is
more but operations are same. On 64-bit it should order
read+add one by one which can run in parallel. In my compile
tests it uses 3 times movq(READ_ONCE)+addq(to sum) with same temp
register which defeats fully parallel operations and for the
last 2 counters uses movq,movq,addq,addq with different
registers which can run in parallel. It seems there are no free
registers to make it for all 5 counters.
With this patch the benefit is that compiler should
not hold all 5 values before adding them but on
x86 this patch shows some problem: it does not allocate
5 new temp registers to read and add the values in parallel.
Without this patch, as you show it, it somehow allocates 5+5
registers which allows parallel operations.
If you think we are better with such change, we
can include it in patch 4. It will be better in case one day
we add more counters and there are more available registers.
You can notice the difference probably by comparing the
chain_max value in performance mode.
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 4489a3dbad1e..52d77715f7ea 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -83,7 +83,9 @@ static void ip_vs_chain_estimation(struct hlist_head *chain)
u64 rate;
hlist_for_each_entry_rcu(e, chain, list) {
+#if BITS_PER_LONG == 32
u64 conns, inpkts, outpkts, inbytes, outbytes;
+#endif
u64 kconns = 0, kinpkts = 0, koutpkts = 0;
u64 kinbytes = 0, koutbytes = 0;
unsigned int start;
@@ -95,19 +97,30 @@ static void ip_vs_chain_estimation(struct hlist_head *chain)
s = container_of(e, struct ip_vs_stats, est);
for_each_possible_cpu(i) {
c = per_cpu_ptr(s->cpustats, i);
- do {
- start = u64_stats_fetch_begin(&c->syncp);
- conns = u64_stats_read(&c->cnt.conns);
- inpkts = u64_stats_read(&c->cnt.inpkts);
- outpkts = u64_stats_read(&c->cnt.outpkts);
- inbytes = u64_stats_read(&c->cnt.inbytes);
- outbytes = u64_stats_read(&c->cnt.outbytes);
- } while (u64_stats_fetch_retry(&c->syncp, start));
- kconns += conns;
- kinpkts += inpkts;
- koutpkts += outpkts;
- kinbytes += inbytes;
- koutbytes += outbytes;
+#if BITS_PER_LONG == 32
+ conns = kconns;
+ inpkts = kinpkts;
+ outpkts = koutpkts;
+ inbytes = kinbytes;
+ outbytes = koutbytes;
+#endif
+ for (;;) {
+ start += u64_stats_fetch_begin(&c->syncp);
+ kconns += u64_stats_read(&c->cnt.conns);
+ kinpkts += u64_stats_read(&c->cnt.inpkts);
+ koutpkts += u64_stats_read(&c->cnt.outpkts);
+ kinbytes += u64_stats_read(&c->cnt.inbytes);
+ koutbytes += u64_stats_read(&c->cnt.outbytes);
+ if (!u64_stats_fetch_retry(&c->syncp, start))
+ break;
+#if BITS_PER_LONG == 32
+ kconns = conns;
+ kinpkts = inpkts;
+ koutpkts = outpkts;
+ kinbytes = inbytes;
+ koutbytes = outbytes;
+#endif
+ }
}
spin_lock(&s->lock);
Regards
--
Julian Anastasov <ja@xxxxxx>
|