Le dimanche 06 mars 2011 à 14:18 +0200, Julian Anastasov a écrit :
> Hello,
>
> On Sun, 6 Mar 2011, Eric Dumazet wrote:
>
> >> Zero the new percpu stats because we copy from there.
> >>
> >> Signed-off-by: Julian Anastasov <ja@xxxxxx>
> >> Signed-off-by: Simon Horman <horms@xxxxxxxxxxxx>
> >> ---
> >> net/netfilter/ipvs/ip_vs_ctl.c | 17 +++++++++++++++++
> >> 1 files changed, 17 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c
> >> b/net/netfilter/ipvs/ip_vs_ctl.c
> >> index a2a67ad..fd74527 100644
> >> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> >> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> >> @@ -715,8 +715,25 @@ static void ip_vs_trash_cleanup(struct net *net)
> >> static void
> >> ip_vs_zero_stats(struct ip_vs_stats *stats)
> >> {
> >> + struct ip_vs_cpu_stats *cpustats = stats->cpustats;
> >> + int i;
> >> +
> >> spin_lock_bh(&stats->lock);
> >>
> >> + for_each_possible_cpu(i) {
> >> + struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
> >> + unsigned int start;
> >> +
> >> + /* Do not pretend to be writer, it is enough to
> >> + * sync with writers that modify the u64 counters
> >> + * because under stats->lock we are the only reader.
> >> + */
> >> + do {
> >> + start = u64_stats_fetch_begin(&u->syncp);
> >> + memset(&u->ustats, 0, sizeof(u->ustats));
> >> + } while (u64_stats_fetch_retry(&u->syncp, start));
> >
> >
> > Sorry this makes no sense to me.
>
> Hm, yes, the comment is a little bit misleading.
> I fixed it below...
>
> > This code _is_ a writer, and hardly a hot path.
>
> Yes, the picture is as follows:
>
> - in 2.6.38-rc we remove the global spin lock (stats->lock)
> from packet processing which is a hot path, adding percpu
> counters instead
>
> - we need protection for percpu counters and for the sum
>
> - the chain is: interrupts increment percpu counters, the
> estimation timer reads them and creates sum every 2 seconds,
> then user context can read the sum or even to show the percpu
> counters, not to forget the zeroing of sum and counters
>
> The players in detail:
>
> - packet processing:
> - softirq context, hot path
> - increments counters by using u64_stats_update_begin and
> u64_stats_update_end, does not wait readers or zeroing
> - sum not touched, stats->lock usage removed in 2.6.38-rc
>
> - 2-second estimation timer:
> - funcs: estimation_timer()
> - timer context, softirq
> - reads percpu counters with u64_stats_fetch_begin and
> u64_stats_fetch_retry to sync with counter incrementing
> - uses spin_lock (stats->lock) to protect the written sum
> which is later read by user context: provides
> at least u64 atomicity but additionally the relation
> between packets and bytes
>
> - sum readers:
> - funcs: ip_vs_stats_show(), ip_vs_stats_percpu_show(),
> ip_vs_copy_stats(), ip_vs_genl_fill_stats()
> - user context, not a hot path
> - uses spin_lock_bh (stats->lock) for atomic reading of
> the sum created by estimation_timer()
>
> - show percpu counters:
> - funcs: ip_vs_stats_percpu_show()
> - user context, not a hot path
> - uses u64_stats_fetch_begin_bh and u64_stats_fetch_retry_bh
> to synchronize with counter incrementing
> - still missing: should use spin_lock_bh (stats->lock)
> to synchronize with ip_vs_zero_stats() that modifies
> percpu counters.
>
> - zero stats and percpu counters
> - funcs: ip_vs_zero_stats()
> - user context, not a hot path
> - uses spin_lock_bh (stats->lock) while modifying
> sum but also while zeroing percpu counters because
> we are a hidden writer which does not allow other
> percpu counter readers at the same time but we are
> still synchronized with percpu counter incrementing
> without delaying it
>
> To summarize, I see 2 solutions, in order of preference:
>
> 1. all players except packet processing should use stats->lock
> when reading/writing sum or when reading/zeroing percpu
> counters. Use u64_stats to avoid delays in incrementing.
>
> 2. Use seqlock instead of u64_stats if we want to treat the
> percpu counters zeroing as writer. This returns us before
> 2.6.38-rc where we used global stats->lock even for counter
> incrementing. Except that now we can use percpu seqlock
> just to register the zeroing as writer.
>
> > Why try to pretend its a reader and confuse people ?
> >
> > Either :
> >
> > - Another writer can modify the counters in same time, and we must
> > synchronize with them (we are a writer after all)
>
> Global mutex allows only one zeroing at a time.
> But zeroing runs in parallel with incrementing, so we
> have 2 writers for a per-CPU state. This sounds like
> above solution 2 with percpu seqlock? But it adds extra
> spin_lock in hot path, even if it is percpu. It only
> saves the spin_lock_bh while reading percpu counters in
> ip_vs_stats_percpu_show(). That is why a prefer solution 1.
>
> > - Another reader can read the counters in same time, and we must let
> > them catch we mihjt have cleared half of their values.
>
> Yes, zeroing can run in parallel with /proc reading,
> that is why I now try to serialize all readers with the
> stats spin lock to guarantee u64 atomicity.
>
> > - No reader or writer can access data, no synch is needed, a pure
> > memset() is OK.
>
> Packet processing can damage the counters while we
> do memset, so we need at least u64_stats_fetch_* to sync
> with incrementing.
>
OK I now understand what you wanted to do.
Problem is you do synchronize your memset() with a concurrent writer but
one way only. (You detect a writer did some changes on the counters
while you memset() them), but a writer has no way to detect your writes
(could be partially committed to main memory) : It could read a
corrupted value.
I feel memory barriers are wrong and not really fixable without slowing
down the hot path.
As implied in include/linux/u64_stats_sync.h file, a "writer" should be
alone :)
One other way to handle that (and let hot path packet processing without
extra locking) would be to never memset() this data, but use a separate
"summed" value as a relative point, and substract this sum to the
current one (all this in slow path, so not a problem)
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
|