From cjohnson@camelot.com Sun Mar 19 22:31:51 2000 Return-Path: Received: from galahad.camelot.com (Galahad.Camelot.com [192.55.203.135]) by hub.freebsd.org (Postfix) with ESMTP id 0B1DD37B882 for ; Sun, 19 Mar 2000 22:31:37 -0800 (PST) (envelope-from cjohnson@camelot.com) Received: by galahad.camelot.com (Postfix, from userid 2455) id 3D74212C30; Mon, 20 Mar 2000 01:31:32 -0500 (EST) Message-Id: <20000320063132.3D74212C30@galahad.camelot.com> Date: Mon, 20 Mar 2000 01:31:32 -0500 (EST) From: cjohnson@camelot.com Reply-To: cjohnson@camelot.com To: FreeBSD-gnats-submit@freebsd.org Subject: Can't revive VINUM RAID5 X-Send-Pr-Version: 3.2 >Number: 17499 >Category: kern >Synopsis: Can't revive VINUM RAID5 >Confidential: no >Severity: serious >Priority: high >Responsible: grog >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sun Mar 19 22:40:01 PST 2000 >Closed-Date: Thu May 3 01:48:06 PDT 2001 >Last-Modified: Thu May 03 01:50:35 PDT 2001 >Originator: Christopher T. Johnson >Release: FreeBSD 4.0-RELEASE i386 >Organization: Paladin Software >Environment: SMP FreeBSD 4.0-RELEASE 128Mbytes of Memory Dual 233MMX Pentiums su-2.03# camcontrol devlist at scbus0 target 0 lun 0 (pass0,da0) at scbus0 target 4 lun 0 (pass1,sa0) at scbus0 target 5 lun 0 (pass2,cd0) at scbus0 target 6 lun 0 (pass3,cd1) at scbus1 target 0 lun 0 (pass4,da1) at scbus1 target 1 lun 0 (pass5,da2) at scbus1 target 2 lun 0 (pass6,da3) at scbus1 target 3 lun 0 (pass7,da4) at scbus1 target 5 lun 0 (pass8,da5) at scbus1 target 6 lun 0 (pass9,da6) su-2.03# vinum list 5 drives: D drv3 State: up Device /dev/da1s1e Avail: 1372/4149 MB (33%) D drv1 State: up Device /dev/da3s1a Avail: 0/2777 MB (0%) D drv2 State: up Device /dev/da4s1a Avail: 0/2777 MB (0%) D drv4 State: up Device /dev/da5s1e Avail: 0/2014 MB (0%) D drv5 State: up Device /dev/da6s1e Avail: 0/2776 MB (0%) 2 volumes: V quick State: up Plexes: 1 Size: 2013 MB V myraid State: up Plexes: 1 Size: 8329 MB 2 plexes: P quick.p0 C State: up Subdisks: 1 Size: 2013 MB P myraid.p0 R5 State: degraded Subdisks: 4 Size: 8329 MB 5 subdisks: S quick.p0.s0 State: up PO: 0 B Size: 2013 MB S myraid.p0.s0 State: up PO: 0 B Size: 2776 MB S myraid.p0.s1 State: up PO: 128 kB Size: 2776 MB S myraid.p0.s2 State: R 57% PO: 256 kB Size: 2776 MB *** Revive process for myraid.p0.s2 has died *** S myraid.p0.s3 State: up PO: 384 kB Size: 2776 MB >Description: After replacing a bad drive in my RAID5 I've been unable to get the SubDisk to revive. After reinstalling the disk drive I used "fdisk -I da1" to put a slice map on the drive and when disklabel -w -r da1s1 auto failed I used sysinstall to put a disklabel on the drive. Then I used disklabel -e da1s1 to add an "e" partition of type vinum (da1s1e) With that done I unmounted all my vinum drives and did "vinum stop" followed by "vinum start". Once vinum discovered the new drive, I used "vinum create" to set "drive name drv3 device /dev/da1s1e". At this point drv3 came up and I did "start myraid.p0.s2" to start the revive. After about a 30 minutes the vinum reported: can't revive myraid.p0.s2: Invalid argument When I tried to restart the revive with another "start myraid.p0.s2" the start locked up trying to get a "vlock" or "vrlock" (according to ctrl-t). I was unable to interupt vinum with ctrl-C or to suspend it with ctrl-Z or to kill it with a SIGKILL. After shutdown -r plus pressing the reset button. I did a low level format on the drive, reinstalled slice and partition maps. Started Vinum again and had same failure mode. This time a "vinum stop" and 'vinum start" got things running after the failure. The drive was then tested by" dd if=/dev/zero of=/dev/da1 bs=1024k This generated NO errors. Reinstalled slice/partitions and restart vinum. Start revive. Same failure at the samish point. Tried again with a different block size and had samefailure at the samish point After reading source for a while I checked for debug and when I found it was turned on, I turned on all the debug flags but for the "big drive" and "drop into the debugger". "debug 635(?)" then "start -w myraid.p0.s2". After verifying that it was logging I aborted the revive. set debug 0 and started reviving again. when we were close I stopped the revive. Set the debug flags and started the revive once more. This looks correct, dev 91,1 is /dev/vinum/myraid and the offset looks good and the length is correct also. the 13,131104 looks "ok" but I don't know if it is "right" Offset looks good, sd numbers are good. We read three "good drives" and write to the "reviving drive". Everything cool here. Mar 19 19:18:21 galahad /kernel: Read dev 91.1, offset 0x96c780, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:21 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324280, devoffset 0x324389, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324280, devoffset 0x324389, length 65536 The next block looks good two. The offset goes up by 64K which is right. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca00, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324300, devoffset 0x324409, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324300, devoffset 0x324409, length 65536 And again, everything looks good. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca80, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324380, devoffset 0x324489, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324380, devoffset 0x324489, length 65536 And again Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce00, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324400, devoffset 0x324509, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324400, devoffset 0x324509, length 65536 One more good block. Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80 Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce80, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324480, devoffset 0x324589, length 65536 Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324480, devoffset 0x324589, length 65536 Where is the "Request: .*" log mesage? Where is the 91.1 Read request? 0x96cf00 looks like the next block to fix in the PLEX but why are we reading it for the SUBDISK? And where did this LENGTH come from! Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:18:22 galahad /kernel: Read dev 13.131084, sd 3, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x96cf00, devoffset 0x96d009, length 2146172928 Mar 19 19:17:53 galahad vinum[963]: reviving myraid.p0.s2 Mar 19 19:18:15 galahad vinum[963]: can't revive myraid.p0.s2: Invalid argument >How-To-Repeat: I don't know if this will repeat anywhere else but I can offer ssh access to a vouched for player that wants to test on galahad. >Fix: >Release-Note: >Audit-Trail: Responsible-Changed-From-To: freebsd-bugs->grog Responsible-Changed-By: sheldonh Responsible-Changed-When: Mon Mar 20 02:15:31 PST 2000 Responsible-Changed-Why: Over to the vinum maintainer. From: "Christopher T. Johnson" To: freebsd-gnats-submit@FreeBSD.org, cjohnson@camelot.com Cc: Subject: Re: kern/17499: Can't revive VINUM RAID5 Date: Sun, 02 Apr 2000 15:19:14 -0400 Because of delays I tried something a bit different. I dumped the disk to tape and then init myraid.p0. The initialize has run to completion with no errors. I'll know if the filesystem reloads in 10 or so hours.... Chris Greg Lehey, 3 Mary 2001 Multiple problems relating to RAID-5 revive have been fixed since this PR was entered. No further problems are known. My apologies for not checking this PR to find which problem this was. PR closed. State-Changed-From-To: open->closed State-Changed-By: grog State-Changed-When: Thu May 3 01:48:06 PDT 2001 State-Changed-Why: Problem fixed some time in the past. http://www.freebsd.org/cgi/query-pr.cgi?pr=17499 >Unformatted: