From jdc@koitsu.dyndns.org Wed Sep 20 16:06:54 2006 Return-Path: Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A826216A416 for ; Wed, 20 Sep 2006 16:06:54 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from sccrmhc11.comcast.net (sccrmhc11.comcast.net [204.127.200.81]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4769243D8C for ; Wed, 20 Sep 2006 16:06:53 +0000 (GMT) (envelope-from jdc@koitsu.dyndns.org) Received: from icarus.home.lan (c-24-6-182-130.hsd1.ca.comcast.net[24.6.182.130]) by comcast.net (sccrmhc11) with ESMTP id <2006092016065201100acn30e>; Wed, 20 Sep 2006 16:06:52 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id C79AC1FA035; Wed, 20 Sep 2006 09:06:51 -0700 (PDT) Message-Id: <20060920160651.C79AC1FA035@icarus.home.lan> Date: Wed, 20 Sep 2006 09:06:51 -0700 (PDT) From: Jeremy Chadwick Reply-To: Jeremy Chadwick To: FreeBSD-gnats-submit@freebsd.org Cc: Subject: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) X-Send-Pr-Version: 3.113 X-GNATS-Notify: >Number: 103435 >Category: i386 >Synopsis: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-i386 >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Wed Sep 20 16:10:30 GMT 2006 >Closed-Date: Sat Nov 11 09:19:22 GMT 2006 >Last-Modified: Sat Nov 11 09:19:22 GMT 2006 >Originator: Jeremy Chadwick >Release: FreeBSD 6.2-PRERELEASE i386 >Organization: Parodius Networking >Environment: System: FreeBSD icarus.home.lan 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #0: Mon Sep 18 03:38:31 PDT 2006 root@icarus.home.lan:/usr/obj/usr/src/sys/ICARUS i386 >Description: Sometime between August 4th and September 12th, someone changed something in the FreeBSD code which is breaking things badly. Particularly the following: ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=41171803 ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=51392291 ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=31011999 em0: watchdog timeout -- resetting em0: link state changed to DOWN em0: link state changed to UP ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=84946719 The em0 timeouts happen at the same time (but not always!) as the ATA timeouts: Sep 20 08:47:42 icarus kernel: em0: watchdog timeout -- resetting Sep 20 08:47:42 icarus kernel: em0: link state changed to DOWN Sep 20 08:47:51 icarus kernel: em0: link state changed to UP Sep 20 08:47:51 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 08:47:51 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=84946719 The hardware in this box hasn't changed -- but because of the ATA errors I was seeing, I decided to swap the disk out on ad12 with a completely different (and brand new) disk. It does the same thing you see above. I also have a disk on ad14 (SATA port #1), which can induce the same thing. The controller being used is an ICH5-based controller. There is no RAID being used (all pure JBOD). The motherboard is an Intel D865GLC, running the previous-to- latest BIOS. (The latest version only fixes some VGA adapter issues). Hyperthreading is enabled in the BIOS, but the kernel itself is NOT using SMP. (But DOES have the apic device enabled) As far as IRQs go, it looks as if the ICH5 and the em0 are sharing an IRQ. This is bizzare, as I would expect the APIC to pick separate IRQs for these devices: atapci2: port 0xe800-0xe807,0xe400-0xe403,0xe000-0xe007,0xdc00-0xdc03,0xd800-0xd80f irq 18 at device 31.2 on pci0 em0: port 0xac00-0xac1f mem 0xff800000-0xff81ffff irq 18 at device 1.0 on pci1 I've also built a kernel as of the 18th (you can see the above uname output), and it has the same problem. >How-To-Repeat: I can reproduce this problem easily: during heavy disk activity, the system will "stall" as if the kernel is spending too much time doing something (deadlocked). The best way I've found to do this is to pick a FreeBSD port that relies on a lot of dependancies and do a 'make clean' over and over: cd /usr/ports/databases/phpmyadmin make clean & ; make clean & ; make clean & {watch above problem occur} Control-C to intercept applications doesn't work when this is going on. >Fix: I haven't tried a different motherboard (I won't deny there's a chance the MB is going bad -- hardware goes bad all the time in this day and age), but I didn't have this problem until I built the September 12th kernel. I also have not tried booting without ACPI. >Release-Note: >Audit-Trail: From: Jeremy Chadwick To: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org Cc: Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Wed, 20 Sep 2006 09:32:07 -0700 I'd like to note that this may be similar to the issue reported last year, which only just got grabbed on the 11th: http://www.freebsd.org/cgi/query-pr.cgi?pr=i386/86364 Worth cross-referencing -- although in my case I'm using the native ICH5 SATA controller and not a Promise controller. (Though I do have a Promise controller in the box, it has no drives connected to it at all). -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From: Jeremy Chadwick To: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org Cc: Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Wed, 20 Sep 2006 14:57:01 -0700 Follow-up: The PR I referenced has indeed now happened to me. This ultimately resulted in a hard system lock -- then the kernel panic'd in some UFS2 code. When I came home from work, I was able to play around a bit on the console of the box. Naturally I did "panic" to make a memory dump -- but it couldn't. It tried to write 1MB of the 1014MB (or so), and simply sat there doing nothing; disk LED wasn't even lit. When I brought the machine back up, I saw the following: Sep 20 14:48:19 icarus kernel: Trying to mount root from ufs:/dev/ad12s1a Sep 20 14:48:19 icarus kernel: WARNING: / was not properly dismounted Sep 20 14:48:19 icarus kernel: WARNING: /var was not properly dismounted Sep 20 14:48:19 icarus kernel: /var: mount pending error: blocks 4 files 1 Sep 20 14:48:19 icarus kernel: WARNING: /tmp was not properly dismounted Sep 20 14:48:19 icarus kernel: WARNING: /usr was not properly dismounted Sep 20 14:48:19 icarus kernel: WARNING: /home was not properly dismounted Sep 20 14:48:19 icarus kernel: /home: mount pending error: blocks 4 files 1 Sep 20 14:48:19 icarus kernel: WARNING: /storage/ad14 was not properly dismounted Sep 20 14:48:19 icarus savecore: no dumps found I was able to go through /var/log/messages to find all of this -- none of which bodes well for 6.2-RELEASE, if you ask me. Has anyone contacted Soren and/or Kris about this? Sep 20 14:48:19 icarus syslogd: kernel boot file is /boot/kernel/kernel Sep 20 14:48:19 icarus kernel: anged to UP Sep 20 14:48:19 icarus kernel: g_vfs_done():ad12s1g[WRITE(offset=56459116544, length=16384)]error = 5 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=287 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=524639 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=287 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=524639 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: FAILURE - WRITE_DMA timed out LBA=287 Sep 20 14:48:19 icarus kernel: g_vfs_done():ad12s1a[WRITE(offset=114688, length=16384)]error = 5 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: FAILURE - WRITE_DMA timed out LBA=524639 Sep 20 14:48:19 icarus kernel: g_vfs_done():ad12s1a[WRITE(offset=268582912, length=16384)]error = 5 {...and so on...} Sep 20 14:48:19 icarus kernel: em0: watchdog timeout -- resetting Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=30998367 Sep 20 14:48:19 icarus kernel: em0: 2 link states coalesced Sep 20 14:48:19 icarus kernel: em0: link state changed to UP Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=17837855 Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: em0: watchdog timeout -- resetting Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=30998367 Sep 20 14:48:19 icarus kernel: em0: link state changed to DOWN Sep 20 14:48:19 icarus kernel: em0: link state changed to UP Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: FAILURE - WRITE_DMA timed out LBA=17837855 Sep 20 14:48:19 icarus kernel: g_vfs_done():ad12s1d[WRITE(offset=6144000, length=2048)]error = 5 Sep 20 14:48:19 icarus kernel: em0: watchdog timeout -- resetting {...lots more...} Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: WARNING - SET_MULTI taskqueue timeout - completing request directly Sep 20 14:48:19 icarus kernel: ad12: FAILURE - WRITE_DMA timed out LBA=41160159 Sep 20 14:48:19 icarus kernel: g_vfs_done():ad12s1d[WRITE(offset=11947163648, length=16384)]error = 5 Sep 20 14:48:19 icarus kernel: panic: initiate_write_inodeblock_ufs2: already started Sep 20 14:48:19 icarus kernel: KDB: enter: panic Sep 20 14:48:19 icarus kernel: panic: from debugger Sep 20 14:48:19 icarus kernel: Uptime: 2d5h35m17s Sep 20 14:48:19 icarus kernel: Dumping 1014 MB (2 chunks) Sep 20 14:48:19 icarus kernel: chunk 0: 1MB (159 pages) (CTRL-C to abort) (CTRL-C to abort) (CTRL-C to abort) (CTRL-C to abort) (CTRL-C to abort) Sep 20 14:48:19 icarus kernel: Dump aborted Sep 20 14:48:19 icarus kernel: Automatic reboot in 15 seconds - press a key on the console to abort Sep 20 14:48:19 icarus kernel: --> Press a key on the console to reboot, Sep 20 14:48:19 icarus kernel: --> or switch off the system now. Sep 20 14:48:19 icarus kernel: Rebooting... -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From: Jeremy Chadwick To: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org Cc: Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Fri, 22 Sep 2006 17:22:01 -0700 Follow-up to this PR (requesting closure): The system this PR was based on was an Intel D865GLC with a P4 CPU; no issues up until my PR was filed. I am now using an Asus A8N-E board with an AMD X2 CPU, and am not experiencing any issues what-so-ever. I've managed to make -j2 buildkernel as well, which is a pretty good test. I'm left believing there's either some recent FreeBSD ATA code weirdness going on which affects the ICH5 on the Intel board (no idea) that isn't affecting the nVidia MCP9 SATA controller, or that the ICH5 on the Intel board has simply gone bad. Since I'm not the only one reporting this problem (but no one recently has reported it), I'm left to believe there's some general kernel oddness going on -- which I do not have the capability/knowledge to debug. If someone who is willing to work on this problem wishes to receive the board + CPU + RAM for it, free of charge (shipped within the USA only), I will gladly donate it to whomever wishes to debug the problem. Otherwise, please close this PR. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | From: Rich Wales To: bug-followup@FreeBSD.org Cc: freebsd@jdc.parodius.com Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Sun, 01 Oct 2006 21:10:09 -0700 I'm seeing a similar problem on an old (800-MHz original "Slot A" Athlon) system with a Promise SATA300 TX4 card (PDC40718) and two Seagate 300GB SATA drives (ST3300831AS), running 6.1-RELEASE-p9. This experimental system locked up twice with the same kinds of messages that Jeremy Chadwick reported -- once when I did a "sync" after copying the /usr file system from an old PATA disk to one of the new SATA disks, and again as I was trying to synchronize the second SATA drive to the first via "gmirror". This is admittedly an old CPU, motherboard, and RAM, but the system has been running quite reliably until now. Rich Wales, Palo Alto, CA, USA, richw@richw.org, http://www.richw.org From: Rich Wales To: bug-followup@FreeBSD.org Cc: freebsd@jdc.parodius.com Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Sun, 01 Oct 2006 22:01:50 -0700 Following up on my earlier comments: I seem to be able to reproduce the problem reliably by running two concurrent "dd" commands (copying data from each of my two SATA drives to /dev/null simultaneously), and waiting a few minutes. Reading off just one of the drives wasn't enough to generate any timeout warnings, but reading from both at the same time apparently did the trick. Rich Wales richw@richw.org http://www.richw.org From: Rich Wales To: Rich Wales Cc: bug-followup@FreeBSD.org, freebsd@jdc.parodius.com Subject: Re: i386/103435: Kernel appears somewhat deadlocked during heavy ATA I/O (post-August 4th) Date: Sat, 07 Oct 2006 18:11:06 -0700 Per a suggestion from Søren Schmidt, I rechecked my hardware and my BIOS settings. After a bunch of experimentation, I found that if I disabled PCI master burst mode in the BIOS setup, the timeout problems all went away. Disk I/O speed was -- not surprisingly -- noticeably slower with burst mode turned off, but the system was stable and reliable. I'm not sure if the primary culprit here is my Promise card or my motherboard, but it does suggest that some systems with iffy PCI bus tolerances might require suboptimal BIOS settings in order to work properly with some Promise controllers. As for whether any effective steps can be taken in the driver to make it more robust against this problem, I'll have to leave that to people who are more familiar with the hardware and the code. Rich Wales richw@richw.org http://www.richw.org State-Changed-From-To: open->closed State-Changed-By: remko State-Changed-When: Sat Nov 11 09:18:44 UTC 2006 State-Changed-Why: Close the PR as per request of the submitter, thanks for the large audit trail, if needed we can reopen this anduse your information which might be very valueable. Closed at: EuroBSDCon 2006 http://www.freebsd.org/cgi/query-pr.cgi?pr=103435 >Unformatted: