From nobody@FreeBSD.org Sat Sep 24 20:11:52 2011 Return-Path: Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 259DF1065670 for ; Sat, 24 Sep 2011 20:11:52 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22]) by mx1.freebsd.org (Postfix) with ESMTP id 15AD58FC14 for ; Sat, 24 Sep 2011 20:11:52 +0000 (UTC) Received: from red.freebsd.org (localhost [127.0.0.1]) by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p8OKBpvE069059 for ; Sat, 24 Sep 2011 20:11:51 GMT (envelope-from nobody@red.freebsd.org) Received: (from nobody@localhost) by red.freebsd.org (8.14.4/8.14.4/Submit) id p8OKBpfo069058; Sat, 24 Sep 2011 20:11:51 GMT (envelope-from nobody) Message-Id: <201109242011.p8OKBpfo069058@red.freebsd.org> Date: Sat, 24 Sep 2011 20:11:51 GMT From: Arnaud Lacombe To: freebsd-gnats-submit@FreeBSD.org Subject: buf_ring(9) statistics accounting not MPSAFE X-Send-Pr-Version: www-3.1 X-GNATS-Notify: >Number: 160992 >Category: kern >Synopsis: buf_ring(9) statistics accounting not MPSAFE >Confidential: no >Severity: non-critical >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sat Sep 24 20:20:10 UTC 2011 >Closed-Date: >Last-Modified: Wed Oct 05 05:47:58 UTC 2011 >Originator: Arnaud Lacombe >Release: 9-CURRENT >Organization: n/a >Environment: >Description: The following block of code, in `sys/sys/buf_ring.h': /* * If there are other enqueues in progress * that preceeded us, we need to wait for them * to complete */ while (br->br_prod_tail != prod_head) cpu_spinwait(); br->br_prod_bufs++; br->br_prod_bytes += nbytes; br->br_prod_tail = prod_next; critical_exit(); can be seen at runtime, memory-wise as: while (br->br_prod_tail != prod_head) cpu_spinwait(); br->br_prod_tail = prod_next; br->br_prod_bufs++; br->br_prod_bytes += nbytes; critical_exit(); That is, there is no memory barrier to enforce completion of the load/increment/store/load/load/addition/store operations before updating what other thread spin on. Even if `br_prod_tail' is marked `volatile', there is no guarantee that it will not be re-ordered wrt. non-volatile write (to `br_prod_bufs' and `br_prod_bytes'). Confirmed by Kip Macy (kmacy@) in http://lists.freebsd.org/pipermail/freebsd-hackers/2011-September/036454.html. >How-To-Repeat: code review. >Fix: >Release-Note: >Audit-Trail: From: Bruce Evans To: Arnaud Lacombe Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org Subject: Re: kern/160992: buf_ring(9) statistics accounting not MPSAFE Date: Sun, 25 Sep 2011 13:01:04 +1000 (EST) On Sat, 24 Sep 2011, Arnaud Lacombe wrote: >> Description: > The following block of code, in `sys/sys/buf_ring.h': > > > /* > * If there are other enqueues in progress > * that preceeded us, we need to wait for them > * to complete > */ > while (br->br_prod_tail != prod_head) > cpu_spinwait(); > br->br_prod_bufs++; > br->br_prod_bytes += nbytes; > br->br_prod_tail = prod_next; > critical_exit(); > > > can be seen at runtime, memory-wise as: > > while (br->br_prod_tail != prod_head) > cpu_spinwait(); > br->br_prod_tail = prod_next; > br->br_prod_bufs++; > br->br_prod_bytes += nbytes; > critical_exit(); > > That is, there is no memory barrier to enforce completion of the > load/increment/store/load/load/addition/store operations before > updating what other thread spin on. The counters are 64 bits, so it also does non-atomic increments of them no 32-bit arches. > Even if `br_prod_tail' is marked `volatile', there is no guarantee > that it will not be re-ordered wrt. non-volatile write (to `br_prod_bufs' > and `br_prod_bytes'). Using volatile is generally bogus. Here it seems to mainly give pessimizations and more opportunities for bad memory orders. The i386 code for incrementing a 64-bit volatile x is: movl x, %eax movl x+4, %edx addl $1, %eax adcl $0, %edx movl %eax, x movl %edx, x+4 while for a 64-bit non-volatile it is: addl $1, x adcl $0, x+4 so volatile gives more caching in registers instead of less. The following are some of the bad memory orders possible: with volatile: lo = br->br_prod_bytes.lo; hi = br->br_prod_bytes.hi; br->br_prod_tail = prod_next; br->br_prod_bufs++; lo += nbytes; hi += carry; br->br_prod_bytes.hi = hi; br->br_prod_bytes.lo = lo; without volatile: br->br_prod_bytes.lo += nbytes; br->br_prod_tail = prod_next; br->br_prod_bufs++; br->br_prod_bytes.hi += carry; I think the token method would make the nonatomic accesses to the counters sufficiently atomic if it worked. The necessary memory barriers would probably have a memory clobber which effectively makes all memory variables transiently volatile (where volatile actually means non-volatile with respect to them changing -- holding the token prevents them changing -- but actually means volatile with respect to their memory accesses) so the effect of declaring the counters permanently volatile would be reduced to a pessimization. Even reads of them in sysctls must hold the token to get a consistent snapshot. Bruce >Unformatted: