From root@osprey.grizzly.com Sat Jan 4 08:44:52 1997 Received: from scruz.net (nic.scruz.net [165.227.1.2]) by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id IAA01315 for ; Sat, 4 Jan 1997 08:44:50 -0800 (PST) Received: from osprey.grizzly.com by scruz.net (8.7.3/1.34) id IAA05485; Sat, 4 Jan 1997 08:44:45 -0800 (PST) Received: (from root@localhost) by osprey.grizzly.com (8.8.4/8.7.3) id IAA00351; Sat, 4 Jan 1997 08:45:17 -0800 (PST) Message-Id: <199701041645.IAA00351@osprey.grizzly.com> Date: Sat, 4 Jan 1997 08:45:17 -0800 (PST) From: markd@grizzly.com Reply-To: markd@grizzly.com To: FreeBSD-gnats-submit@freebsd.org Subject: SCSI disk corruption X-Send-Pr-Version: 3.2 >Number: 2371 >Category: kern >Synopsis: SCSI disk corruption >Confidential: no >Severity: critical >Priority: high >Responsible: gibbs >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sat Jan 4 08:50:01 PST 1997 >Closed-Date: Sun Mar 22 09:03:01 PST 1998 >Last-Modified: Sun Mar 22 09:04:47 PST 1998 >Originator: Mark Diekhans >Release: FreeBSD 2.2-BETA_A i386 >Organization: >Environment: FreeBSD 2.2-BETA_A #0: Sun Dec 29 20:13:01 PST 1996 markd@osprey.grizzly.com:/usr/src/sys/compile/OSPREY Calibrating clock(s) relative to mc146818A clock ... i586 clock: 166188714 Hz, i8254 clock: 1193152 Hz CPU: Pentium (166.19-MHz 586-class CPU) Origin = "GenuineIntel" Id = 0x52c Stepping=12 Features=0x1bf real memory = 67108864 (65536K bytes) avail memory = 64073728 (62572K bytes) Probing for devices on PCI bus 0: chip0 rev 1 on pci0:0 chip1 rev 0 on pci0:7:0 chip2 rev 0 on pci0:7:1 vga0 rev 1 int a irq 10 on pci0:9 ahc0 rev 0 int a irq 11 on pci0:12 ahc0: aic7870 Single Channel, SCSI Id=7, 16/255 SCBs ahc0 waiting for scsi devices to settle ahc0: target 0 Tagged Queuing Device (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2 sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors) ahc0: target 1 Tagged Queuing Device (ahc0:1:0): "SEAGATE ST32550N 0015" type 0 fixed SCSI 2 sd1(ahc0:1:0): Direct-Access 2047MB (4194058 512 byte sectors) ahc0:A:5: refuses syncronous negotiation. Using asyncronous transfers (ahc0:5:0): "IOMEGA ZIP 100 D.09" type 0 removable SCSI 2 sd2(ahc0:5:0): Direct-Access sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB sd2 could not mode sense (4). Using ficticious geometry sd2(ahc0:5:0): NOT READY asc:3a,0 Medium not present sd2: could not get size 0MB (0 512 byte sectors) Probing for devices on the ISA bus: sc0 at 0x60-0x6f irq 1 on motherboard sc0: VGA color <4 virtual consoles, flags=0x0> sio0 at 0x3f8-0x3ff irq 4 on isa sio0: type 16550A sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A sio2 not found at 0x3e8 lpt0 at 0x378-0x37f irq 7 on isa lpt0: Interrupt-driven port lp0: TCP/IP capable interface psm0 at 0x60-0x64 irq 12 on motherboard psm0: device ID 0, 2 buttons? fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: NEC 72065B fd0: 1.44MB 3.5in npx0 on motherboard npx0: INT 16 interface AHC options: controller ahc0 options AHC_TAGENABLE options AHC_SCBPAGING_ENABLE options AHC_ALLOW_MEMIO >Description: As with many reports of disk corruption, this might not be very useful due to lack of solid data... System was discovered crashed, on reboot, the second scsi disk was corrupt. Fsck reported a large number of errors and it was eventually run with -y. When remounted, all files where in lost+found. The last major activity before the crash was formatting of sd2, an IOMega Zip drive. SCSI format, partition and newfs had generated the following message: Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): UNIT ATTENTION asc:28,0 Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): Not ready to ready transition, medium may have changed Jan 3 22:47:13 osprey /kernel: Debugger("sdopen: no slices") called. Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB Jan 3 22:47:13 osprey /kernel: sd2 could not mode sense (4). Using ficticious geometry Jan 3 22:47:13 osprey /kernel: sd2: invalid primary partition table: no magic The system was alive for two more hours, but only doing routine tasks. The day befofe the following errors where reported in the log: Jan 2 02:34:51 osprey /kernel: st0(ahc0:2:0): timed out in command phase, SCSISIGI == 0x84 Jan 2 02:34:51 osprey /kernel: SEQADDR == 0x5a Jan 2 02:34:51 osprey /kernel: st0(ahc0:2:0): abort message in message buffer Jan 2 02:34:53 osprey /kernel: st0(ahc0:2:0): timed out in dataout phase, SCSISIGI == 0x0 Jan 2 02:34:53 osprey /kernel: SEQADDR == 0xe Jan 2 02:34:53 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted Jan 2 02:34:53 osprey /kernel: st0(ahc0:2:0): Target Busy Jan 2 02:34:53 osprey /kernel: st0(ahc0:2:0): Target Busy Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): UNIT ATTENTION info?:4020040 asc:29,0 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): Power on, reset, or bus device reset occurred Jan 2 02:34:53 osprey /kernel: , retries:4 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:34:53 osprey /kernel: , retries:3 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): ABORTED COMMAND info?:4020040 asc:4e,0 Overlapped commands attempted Jan 2 02:34:53 osprey /kernel: , retries:2 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:34:53 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:34:53 osprey /kernel: , retries:1 Jan 2 02:35:03 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0 Jan 2 02:35:03 osprey /kernel: SEQADDR == 0xb Jan 2 02:35:17 osprey /kernel: SEQADDR == 0xf Jan 2 02:35:17 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted Jan 2 02:35:17 osprey /kernel: spec_getpages: I/O read error Jan 2 02:35:17 osprey /kernel: vm_fault: pager input (probably hardware) error, PID 4026 failure Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:17 osprey /kernel: , retries:4 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:17 osprey /kernel: , retries:3 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:17 osprey /kernel: , retries:2 Jan 2 02:35:17 osprey /kernel: pid 4026 (bash), uid 0: exited on signal 11 (core dumped) Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0 Jan 2 02:35:17 osprey /kernel: SEQADDR == 0xf Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0 Jan 2 02:35:17 osprey /kernel: SEQADDR == 0xe Jan 2 02:35:17 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): UNIT ATTENTION info?:4020040 asc:29,0 Jan 2 02:35:17 osprey /kernel: sd0(ahc0:0:0): Power on, reset, or bus device reset occurred Jan 2 02:35:17 osprey /kernel: , retries:3 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:18 osprey /kernel: , retries:2 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:18 osprey /kernel: , retries:1 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1 Jan 2 02:35:18 osprey /kernel: sd0(ahc0:0:0): Logical unit is in process of becoming ready Jan 2 02:35:18 osprey /kernel: , FAILURE Jan 2 02:36:42 osprey /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,0 Jan 2 02:36:42 osprey /kernel: sd1(ahc0:1:0): Power on, reset, or bus device reset occurred field replaceable unit: 1 Jan 2 02:36:42 osprey /kernel: , retries:4 Note: AHC_TAGENABLE and AHC_SCBPAGING_ENABLE and AHC_ALLOW_MEMIO where enabled. Since I gave up driver writing before SCSI was invented, I can't very useful in the analysis. >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: From: j@uriah.heep.sax.de (J Wunsch) To: markd@grizzly.com Cc: FreeBSD-gnats-submit@freebsd.org Subject: Re: kern/2371: SCSI disk corruption Date: Sat, 4 Jan 1997 23:43:21 +0100 As markd@grizzly.com wrote: > AHC options: > controller ahc0 > options AHC_TAGENABLE > options AHC_SCBPAGING_ENABLE ^^^^^^^^^^^^^^^^^^^^ Turn this off, and see if your problem persists. I think SCB paging is not yet ready for prime-time. > options AHC_ALLOW_MEMIO I think problems with AHC_ALLOW_MEMIO are rather `binary' in nature: it either works well, or starts to go afoul very early in the game. Problems with tagged commands usually end up in ``Overlapped commands attempted'' or something like this. Eeek, no: > ahc0: target 0 Tagged Queuing Device > (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2 > sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors) This one is known to choke on tagged commands! > The last major activity before the crash was formatting of sd2, an IOMega > Zip drive. SCSI format, partition and newfs had generated the following > message: > > Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): UNIT ATTENTION asc:28,0 > Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): Not ready to ready transition, medium may have changed > Jan 3 22:47:13 osprey /kernel: Debugger("sdopen: no slices") called. > Jan 3 22:47:13 osprey /kernel: sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB > Jan 3 22:47:13 osprey /kernel: sd2 could not mode sense (4). Using ficticious geometry > Jan 3 22:47:13 osprey /kernel: sd2: invalid primary partition table: no magic That's something different, and basically a known problem. You might get around this by creating a `quirk' record for this device in /sys/scsi/scsiconf.c, and assign it to the `od' driver. This one has a better understanding of removable media devices. -- cheers, J"org joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE Never trust an operating system you don't have sources for. ;-) From: Mark Diekhans To: joerg_wunsch@uriah.heep.sax.de Cc: FreeBSD-gnats-submit@freebsd.org Subject: Re: kern/2371: SCSI disk corruption Date: Sat, 4 Jan 1997 20:19:39 -0800 (PST) >From: j@uriah.heep.sax.de (J Wunsch) Hey mate, thanks for all the info. >Eeek, no: > >> ahc0: target 0 Tagged Queuing Device >> (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2 >> sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors) > >This one is known to choke on tagged commands! Ack.... Perhaps there should be a warning in LINT about this. Not booting after adding a option is no big deal, but destorying a disk after many days of use is rather painful. Thanks, I owe you a beer, Mark State-Changed-From-To: open->closed State-Changed-By: joerg State-Changed-When: Sun Jan 5 17:13:54 MET 1997 State-Changed-Why: I've added a few warnings about the dangerousness of some SCSI-related options to the LINT file in the 2.2 release candidate. State-Changed-From-To: closed->open State-Changed-By: gibbs State-Changed-When: Sun Jan 5 15:55:45 PST 1997 State-Changed-Why: There are two problems here: 1) The Adaptec driver in current and in 2.2 has a few bugs in it that cause the controller to wedge. Although data is lost, data that successfully reaches the disk is just fine. 2) The aic7xxx driver uses simple queue tags when tagged queueing is turned on. This means that I/O transactions queued to the target can be re-ordered at the targets discretion (except that a read following a write must return the data from the last queued write transaction). Unfortunately, this can cause problems when syncronous write requests are issued. There is nothing preventing a syncronous write from returning prior to some of the transactions queued before it. I believe that this situation, when combined with a hang (meaning there are multiple outstanding transactions) is the cause of your "corruption" since meta-data state can be unstable. The first problem is being actively worked on by me, and should be addressed in the next two weeks. There is already a preliminary framework for fixing problem two in the 'SCSI' branch. The basic approach is to take syncronous requests and use ordered tagged queueing on them so that synchronous requests work as they are intended. Responsible-Changed-From-To: freebsd-bugs->gibbs Responsible-Changed-By: gibbs Responsible-Changed-When: Sun Jan 5 15:55:45 PST 1997 Responsible-Changed-Why: I will be addressing these issues. State-Changed-From-To: open->feedback State-Changed-By: gibbs State-Changed-When: Sat Apr 19 08:17:29 PDT 1997 State-Changed-Why: Please retest using the latest (as of 4/18/97) FreeBSD-2.2 or current driver and see if the problem persists. I believe it has been corrected. From: Mark Diekhans To: freebsd-gnats-submit@freebsd.org Cc: Subject: Re: kern/2371 SCSI disk corruption Date: Sun, 22 Mar 1998 01:29:02 -0800 (PST) This problem was caused by enabled the AHC_TAGENABLE. It turns out that I have a HP C3725S, which bogusly claims to support tagged cmds. This problem was addressed by adding a warning to LINT. The PR should probably be closed. State-Changed-From-To: feedback->closed State-Changed-By: steve State-Changed-When: Sun Mar 22 09:03:01 PST 1998 State-Changed-Why: Originator states this was caused by a drive that claimed to features which it did not and was corrected by a wardning added to LINT. >Unformatted: