From mi@aldan.algebra.com Thu Dec 14 21:46:42 2006 Return-Path: Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A8A9916A407 for ; Thu, 14 Dec 2006 21:46:42 +0000 (UTC) (envelope-from mi@aldan.algebra.com) Received: from aldan.algebra.com (aldan.algebra.com [216.254.65.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5E4C043D5E for ; Thu, 14 Dec 2006 21:44:55 +0000 (GMT) (envelope-from mi@aldan.algebra.com) Received: from aldan.algebra.com (aldan [127.0.0.1]) by aldan.algebra.com (8.13.8/8.13.7) with ESMTP id kBELkUeK024276 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 14 Dec 2006 16:46:31 -0500 (EST) (envelope-from mi@aldan.algebra.com) Received: (from mi@localhost) by aldan.algebra.com (8.13.8/8.13.7/Submit) id kBELkUxO024275; Thu, 14 Dec 2006 16:46:30 -0500 (EST) (envelope-from mi) Message-Id: <200612142146.kBELkUxO024275@aldan.algebra.com> Date: Thu, 14 Dec 2006 16:46:30 -0500 (EST) From: "Mikhail T." To: FreeBSD-gnats-submit@freebsd.org Subject: SSE2 optimization for bzip2/libbz2 X-Send-Pr-Version: 3.113 X-GNATS-Notify: >Number: 106734 >Category: bin >Synopsis: [patch] [request] bzip2(1): SSE2 optimization for bzip2/libbz2 >Confidential: no >Severity: non-critical >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: change-request >Submitter-Id: current-users >Arrival-Date: Thu Dec 14 21:50:06 GMT 2006 >Closed-Date: >Last-Modified: Sun Jan 27 10:00:28 UTC 2008 >Originator: Mikhail T. >Release: FreeBSD 6.2-PRERELEASE amd64 >Organization: Virtual Estates, Inc. >Environment: Intel's and AMD chips with SSE2 instructions. >Description: The patch below makes bzip2's blocksort routines use SSE2-registers to compare 16 bytes at a time. On both i386 and AMD chips I tested, the performance improvement ranges from 5% for the already compressed (.gz) files to 20% for the highly compressible system logs. The compressed files are byte-for-byte identical with those produced by the original bzip2. The changes are ifdef-ed by __SSE2__ and relies on the intrinsics available in GNU, Intel's, and Microsoft's compilers. No changes to Makefile(s) are necessary -- when targeting an SSE2-capable CPU (i.e. ``-march=opteron'' or ``-march=pentium4''), the __SSE2__ is set by the compiler. >How-To-Repeat: >Fix: The patch is available from http://aldan.algebra.com/~mi/bz/ The patch is not FreeBSD-specific, but was developed, tested, and timed on FreeBSD-6.x using both i386 and amd64. Feedback welcome. >Release-Note: >Audit-Trail: From: Julian Seward To: bug-followup@freebsd.org, mi@aldan.algebra.com Cc: Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Date: Sat, 6 Jan 2007 08:16:28 +0000 --Boundary-00=_dr1nFvPs1zvnDlW Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline According to Valgrind on amd64-linux, this patch causes bzip2 to do comparisons based on uninitialised memory when compressing, for the attached file (PLIST). ==16140== Conditional jump or move depends on uninitialised value(s) ==16140== at 0x414ECF: mainGtU (blocksort.c:538) ==16140== by 0x414B08: mainSimpleSort (blocksort.c:690) ==16140== by 0x4150E1: mainQSort3 (blocksort.c:805) ==16140== by 0x415FAD: mainSort (blocksort.c:1063) ==16140== by 0x4163F8: BZ2_blockSort (blocksort.c:1244) ==16140== by 0x40DE46: BZ2_compressBlock (compress.c:660) ==16140== by 0x4054B6: handle_compress (bzlib.c:431) ==16140== by 0x40571A: BZ2_bzCompress (bzlib.c:501) ==16140== by 0x407B0D: BZ2_bzWriteClose64 (bzlib.c:1093) ==16140== by 0x4014E1: compressStream (bzip2.c:451) ==16140== by 0x402E20: compress (bzip2.c:1368) ==16140== by 0x40455C: main (bzip2.c:2034) --Boundary-00=_dr1nFvPs1zvnDlW Content-Type: text/plain; charset="us-ascii"; name="PLIST" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="PLIST" share/sgml/docbook/dsssl/modular/README share/sgml/docbook/dsssl/modular/VERSION share/sgml/docbook/dsssl/modular/BUGS share/sgml/docbook/dsssl/modular/TODO [... most of the attached PLIST cleaned-out for brevety -mi ...] @unexec rmdir %D/share/sgml/docbook/dsssl 2>/dev/null || true --Boundary-00=_dr1nFvPs1zvnDlW-- From: Mikhail Teterin To: jseward@acm.org (Julian Seward) Cc: bug-followup@freebsd.org Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Date: Sat, 6 Jan 2007 21:37:20 -0500 (EST) > According to Valgrind on amd64-linux, this patch causes bzip2 to > do comparisons based on uninitialised memory when compressing, for the > attached file (PLIST). > > ==16140== Conditional jump or move depends on uninitialised value(s) > ==16140== at 0x414ECF: mainGtU (blocksort.c:538) > ==16140== by 0x414B08: mainSimpleSort (blocksort.c:690) > ==16140== by 0x4150E1: mainQSort3 (blocksort.c:805) Julian! You informed me of this issue in a direct e-mail from Dec 19. I responded the next day on Dec 20: ------------------------------------------------------------ = From my 5-minute investigation, I *think* this is because the loads = /* Load the bytes: */ = n1 = (__m128i)_mm_loadu_pd((double *)(block + i1)); = n2 = (__m128i)_mm_loadu_pd((double *)(block + i2)); I doubt it -- Valgrind's diagnostics says "Conditional jump or move depends on uninitialised value(s)". There are no jumps in the above code, there are two machine instructions (executed in parallel). ------------------------------------------------------------ Have you not received the e-mail with above text? (Message-Id: <200612200615.22819@Misha>). Yours, -mi From: Julian Seward To: Mikhail Teterin Cc: bug-followup@freebsd.org Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Date: Sun, 7 Jan 2007 05:08:43 +0000 I believe this analysis is correct: > /* Load the bytes: */ > n1 = (__m128i)_mm_loadu_pd((double *)(block + i1)); > n2 = (__m128i)_mm_loadu_pd((double *)(block + i2)); > > read beyond the end of the defined area of block. block is > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2], > hence loading 15 bytes of garbage. Valgrind doesn't complain about the out-of-range access, because you are still accessing inside a valid malloc-allocated block. But it does know that the read data is uninitialised, hence it complains when you do a comparison with that data followed by a conditional branch (or move) based on the result of the comparison. > This is possible... You think, the loop should exit earlier and test > the last (up to) 15 bytes one-by-one? Certainly the loop-end stuff needs to be fixed up somehow to reflect the 16 byte loads, but without further investigation I'm not sure how. From: Mikhail Teterin To: Julian Seward Cc: bug-followup@freebsd.org Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2 Date: Tue, 9 Jan 2007 18:34:36 -0500 On Sunday 07 January 2007 00:08, Julian Seward wrote: = > /* Load the bytes: */ = > n1 = (__m128i)_mm_loadu_pd((double *)(block + i1)); = > n2 = (__m128i)_mm_loadu_pd((double *)(block + i2)); = > read beyond the end of the defined area of block. block is = > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think = > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2], = > hence loading 15 bytes of garbage. I don't think, that's quite right... Instead of processing 8 bytes at a time, as the non-SSE code is doing, I'm comparing 16 at a time. Thus it is possible for me to be over by exactly 8 sometimes... Anyway, the problem was stemming from my bumping i1 and i2 by 16 instead of 8 after the _initial check_ (which, in the quadrant-less case should not need to be separate at all, actually). Sometimes _that_ would bring them over... I think, the solution is to either bump up BZ_N_OVERSHOOT even further or check and adjust i1 and i2: if (i1 >= nblock) i1 -= nblock; if (i2 >= nblock) i2 -= nblock; at the beginning, rather than the end of the loop. Having done that, I no longer peek beyond the end of the block (according to gdb's conditional breakpoints, at least). Please, check the new http://aldan.algebra.com/~mi/bz/blocksort-SSE2-patch-2 Yours, -mi P.S. The following gdb-script is what I used. Run as: gdb -x x.txt bzip2 x.txt: break blocksort.c:516 cond 1 (i1 > nblock) || (i2 > nblock) run -9 < /tmp/PLIST > /dev/null andjust the compression level, the input's location, and be sure to have blocksort.o compiled with debug information, of course... >Unformatted: