Hello,
On Tue, 15 Nov 2022, Jiri Wiesner wrote:
> On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote:
> >
> > Some compilers for 64-bit can use two 32-bit
> > loads (load tearing) while we require atomic 64-bit read
> > in case reader is interrupted by updater. That is why READ_ONCE
> > is used now. I don't know which compilers can do this. That is
> > why I switched to u64_stats_t to correctly use the u64_stats
> > interface. And our data is 8-byte aligned. 32-bit are using
> > seqcount_t, so they are protected for this problem.
>
> All right. It is counter-intuitive but I guess those compiles have their
> reasons. I think it would not hurt to expand the description of the patch
> with the explanation above.
OK. I relied on the mentioned commit to explain the problem
as other similar patches do.
> > If you think we are better with such change, we
> > can include it in patch 4. It will be better in case one day
> > we add more counters and there are more available registers.
> > You can notice the difference probably by comparing the
> > chain_max value in performance mode.
>
> I have tested the change to ip_vs_chain_estimation() and ran profiling with
> bus-cycles, which are proportional to time spent executing code. The number
> of estimator per kthread was the same in both tests - 88800. It turns out the
> change made the code slower:
> > # Util1 Util2 Diff Command
> > Shared Object Symbol CPU
> > 1,124,932,796 1,143,867,951 18,935,155 ( 1.7%) ipvs-e:0:0
> > [ip_vs] all all
chain_max is probably 37 here and 1.7% is low
enough (below 1/37) not to be accounted in chain_max.
> > 16,480,274 16,853,752 373,478 ( 2.3%) ipvs-e:0:1
> > [ip_vs] all all
> > 0 21,987 21,987 (100.0%) ipvs-e:0:0
> > [kvm] all all
> > 4,622,992 4,366,150 -256,842 ( 5.6%) ipvs-e:0:1
> > [kernel.kallsyms] all all
> > 180,226,857 164,665,271 -15,561,586 ( 8.6%) ipvs-e:0:0
> > [kernel.kallsyms] all all
> The most significant component was the time spent in ip_vs_chain_estimation():
> > # Util1 Util2 Diff Command
> > Shared Object Symbol CPU
> > 1,124,414,110 1,143,352,233 18,938,123 ( 1.7%) ipvs-e:0:0
> > [ip_vs] ip_vs_chain_estimation all
> > 16,291,044 16,574,498 283,454 ( 1.7%) ipvs-e:0:1
> > [ip_vs] ip_vs_chain_estimation all
> > 189,230 279,254 90,024 ( 47.6%) ipvs-e:0:1
> > [ip_vs] ip_vs_estimation_kthread all
> > 518,686 515,718 -2,968 ( 0.6%) ipvs-e:0:0
> > [ip_vs] ip_vs_estimation_kthread all
> > 1,562,512 1,377,609 -184,903 ( 11.8%) ipvs-e:0:0
> > [kernel.kallsyms] kthread_should_stop all
> > 2,435,803 2,138,404 -297,399 ( 12.2%) ipvs-e:0:1
> > [kernel.kallsyms] _find_next_bit all
> > 16,304,577 15,786,553 -518,024 ( 3.2%) ipvs-e:0:0
> > [kernel.kallsyms] _raw_spin_lock all
> > 151,110,969 137,172,160 -13,938,809 ( 9.2%) ipvs-e:0:0
> > [kernel.kallsyms] _find_next_bit all
>
> The disassembly shows it is the mov instructions (there is a skid of 1
> instruction) what takes the most time:
> >Percent | Source code & Disassembly of kcore for bus-cycles (55354
> >samples, percent: local period)
> > : ffffffffc0bad980 <ip_vs_chain_estimation>:
> > v6 patchset v6 patchset +
> > ip_vs_chain_estimation() change
> >-------------------------------------------------------------------------------------------------------
> > 0.00 : add -0x72ead560(,%rdx,8),%rcx 0.00 : add
> > -0x5ccad560(,%rdx,8),%rcx
> > 4.07 : mov (%rcx),%r11 3.84 : mov
> > (%rcx),%rdx
> > 34.28 : mov 0x8(%rcx),%r10 34.17 : add
> > %rdx,%r14
> > 10.35 : mov 0x10(%rcx),%r9 2.62 : mov
> > 0x8(%rcx),%rdx
> > 7.28 : mov 0x18(%rcx),%rdi 9.33 : add
> > %rdx,%r13
> > 7.70 : mov 0x20(%rcx),%rdx 1.82 : mov
> > 0x10(%rcx),%rdx
> > 6.80 : add %r11,%r14 6.11 : add
> > %rdx,%r12
> > 0.27 : add %r10,%r13 1.26 : mov
> > 0x18(%rcx),%rdx
> > 0.36 : add %r9,%r12 6.00 : add
> > %rdx,%rbp
> > 2.24 : add %rdi,%rbp 2.97 : mov
> > 0x20(%rcx),%rdx
Bad, same register RDX used. I guess, this is
CONFIG_GENERIC_CPU=y kernel which is suitable for different
CPUs. My example was tested with CONFIG_MCORE2=y. But the
optimizations are compiler's job, there can be a cost
to free registers for our operations. Who knows, change
can be faster on other platforms with less available
registers to hold the 5 counters and their sum in registers.
Probably, we can force it to add mem8 into reg by using
something like this:
static inline void u64_stats_read_add(u64_stats_t *p, u64 *sum)
{
local64_add_to(&p->v, (long *) sum);
}
static inline void local_add_to(local_t *l, long *sum)
{
asm(_ASM_ADD "%1,%0"
: "+r" (*sum)
: "m" (l->a.counter));
}
But it deserves more testing.
> The results indicate that the extra add instructions should not matter -
> memory accesses (the mov instructions) are the bottleneck.
Yep, then lets work without this change for now,
no need to deviate much from the general u64_stats usage
if difference is negligible.
Regards
--
Julian Anastasov <ja@xxxxxx>
|