From nobody@FreeBSD.org Fri Jun 22 21:05:16 2007 Return-Path: Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B194C16A468 for ; Fri, 22 Jun 2007 21:05:16 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (www.freebsd.org [69.147.83.33]) by mx1.freebsd.org (Postfix) with ESMTP id A4C8E13C44B for ; Fri, 22 Jun 2007 21:05:16 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (localhost [127.0.0.1]) by www.freebsd.org (8.13.1/8.13.1) with ESMTP id l5ML5GUX076353 for ; Fri, 22 Jun 2007 21:05:16 GMT (envelope-from nobody@www.freebsd.org) Received: (from nobody@localhost) by www.freebsd.org (8.13.1/8.13.1/Submit) id l5ML5GP9076352; Fri, 22 Jun 2007 21:05:16 GMT (envelope-from nobody) Message-Id: <200706222105.l5ML5GP9076352@www.freebsd.org> Date: Fri, 22 Jun 2007 21:05:16 GMT From: Andrew Yochum To: freebsd-gnats-submit@FreeBSD.org Subject: gmirror is intermittently reporting a degraded mirror array upon reboot. X-Send-Pr-Version: www-3.0 >Number: 113957 >Category: kern >Synopsis: [gmirror] gmirror is intermittently reporting a degraded mirror array upon reboot. >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-geom >State: feedback >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Jun 22 21:10:02 GMT 2007 >Closed-Date: >Last-Modified: Tue Mar 23 19:20:01 UTC 2010 >Originator: Andrew Yochum >Release: 6.2-RELEASE-p1 >Organization: pair Networks >Environment: FreeBSD 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #1: Mon May 7 14:30:09 EDT 2007 support@pair.com:/usr/obj/usr/src/sys/6PAIRc-SMP i386 >Description: Overview: It has been observed that gmirror is intermittently (roughly 20% of the time) reporting the first volume of a mirror as degraded upon reboot. We were able to predictably reproduce the behavior on two different FreeBSD 6.2 machines using different types of hard disks (SCSI and SATA). Files: Attached to this report are two dmesg samples (one from each test machine) that display the full startup information for one of the degraded reboot attempts. dmesg_machine1.txt -10-disk SCSI mirror + stripe array dmesg_machine2.txt -2-disk SATA mirror array >How-To-Repeat: Tests: On the first machine, we originally had set up a stripe + mirror array composed of 10 SCSI hard disks. This was the setup that first brought the odd behavior to our attention. When a degraded array would occur, it was always the same stripe device that needed to be rebuilt (In this case, stripe1). This suggested that either we had some kind of hardware failure on that stripe or something else was occurring to cause the device to be unstable. This was the setup procedure used to build the stripe + mirror array: ----------------------------------------------------------------------------- gstripe label -v -s 1024 stripe1 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 gstripe label -v -s 1024 stripe2 /dev/da5 /dev/da6 /dev/da7 /dev/da8 /dev/da9 gmirror label -v -b load raid1 /dev/stripe/stripe1 /dev/stripe/stripe2 gmirror load newfs /dev/mirror/raid1 mkdir /mirror mount /dev/mirror/raid1 /mirror # Add the following to /boot/loader.conf geom_stripe_load="YES" geom_mirror_load="YES" # And the mount to /etc/fstab /dev/mirror/raid1 /mirror ufs rw 2 2 ----------------------------------------------------------------------------- Though it does not indicate the reason, here is the GEOM portion of the dmesg dump reporting that stripe1 is being rebuilt: -------------------------------------------------------------- GEOM_STRIPE: Device stripe1 created (id=127607522). GEOM_STRIPE: Disk da0 attached to stripe1. GEOM_STRIPE: Disk da1 attached to stripe1. GEOM_STRIPE: Disk da2 attached to stripe1. GEOM_STRIPE: Disk da3 attached to stripe1. GEOM_STRIPE: Disk da4 attached to stripe1. GEOM_STRIPE: Device stripe1 activated. GEOM_STRIPE: Device stripe2 created (id=2058765235). GEOM_STRIPE: Disk da5 attached to stripe2. GEOM_STRIPE: Disk da6 attached to stripe2. GEOM_STRIPE: Disk da7 attached to stripe2. GEOM_STRIPE: Disk da8 attached to stripe2. GEOM_STRIPE: Disk da9 attached to stripe2. GEOM_STRIPE: Device stripe2 activated. GEOM_MIRROR: Device raid1 created (id=702664467). GEOM_MIRROR: Device raid1: provider stripe/stripe1 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe2 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe2 activated. GEOM_MIRROR: Device raid1: provider mirror/raid1 launched. GEOM_MIRROR: Device raid1: rebuilding provider stripe/stripe1. -------------------------------------------------------------- Usually, a successful reboot would report: 'provider stripe/stripe1 activated' right after stripe2 is activated. But as seen in the above case, that line is obviously missing. Instead it reports the rebuild of stripe1. Our first test was to see if perhaps we had some sort of hardware failure. We did this by tearing down the array and swapping the devices that defined the two stripes. If there was a failing hard disk, stripe2 would have exhibited the same symptoms rather than stripe1. Setup procedure with swaped stripe devices: ----------------------------------------------------------------------------- gstripe label -v -s 1024 stripe2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 gstripe label -v -s 1024 stripe1 /dev/da5 /dev/da6 /dev/da7 /dev/da8 /dev/da9 gmirror label -v -b load raid1 /dev/stripe/stripe1 /dev/stripe/stripe2 ----------------------------------------------------------------------------- Unfortunately, stripe1 continued to display the symptoms; signifying that the problem was not localized to a list of devices and probably not a bad hard disk. The dmesg dump from this attempt: -------------------------------------------------------------- GEOM_STRIPE: Device stripe2 created (id=3454907582). GEOM_STRIPE: Disk da0 attached to stripe2. GEOM_STRIPE: Disk da1 attached to stripe2. GEOM_STRIPE: Disk da2 attached to stripe2. GEOM_STRIPE: Disk da3 attached to stripe2. GEOM_STRIPE: Disk da4 attached to stripe2. GEOM_STRIPE: Device stripe2 activated. GEOM_STRIPE: Device stripe1 created (id=3886319334). GEOM_STRIPE: Disk da5 attached to stripe1. GEOM_STRIPE: Disk da6 attached to stripe1. GEOM_STRIPE: Disk da7 attached to stripe1. GEOM_STRIPE: Disk da8 attached to stripe1. GEOM_STRIPE: Disk da9 attached to stripe1. GEOM_STRIPE: Device stripe1 activated. GEOM_MIRROR: Device raid1 created (id=2591439382). GEOM_MIRROR: Device raid1: provider stripe/stripe2 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe1 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe2 activated. GEOM_MIRROR: Device raid1: provider mirror/raid1 launched. GEOM_MIRROR: Device raid1: rebuilding provider stripe/stripe1. -------------------------------------------------------------- Interesting point to note: The order in which gmirror detects the stripe devices seems to resemble the order in which they are built by gstripe, but the order in which they are 'activated' by gmirror remains the same as in the first test. This prompted a 2nd test case... Our 2nd test involved actually swapping the order in which we listed the stripes in setting up the mirror. Keeping the same stripe configuration as in our 1st test, we tried: ----------------------------------------------------------------------------- gstripe label -v -s 1024 stripe2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 gstripe label -v -s 1024 stripe1 /dev/da5 /dev/da6 /dev/da7 /dev/da8 /dev/da9 gmirror label -v -b load raid1 /dev/stripe/stripe2 /dev/stripe/stripe1 ----------------------------------------------------------------------------- This time, stripe2 began to exhibit the symptoms instead of stripe1. This indicated that the problem could, in some way, be related to the order in which the mirror devices are defined at the time of setup. It can be seen in the dmesg dump that gmirror now activates stripe1 first instead of stripe2: -------------------------------------------------------------- GEOM_STRIPE: Device stripe2 created (id=1700277137). GEOM_STRIPE: Disk da0 attached to stripe2. GEOM_STRIPE: Disk da1 attached to stripe2. GEOM_STRIPE: Disk da2 attached to stripe2. GEOM_STRIPE: Disk da3 attached to stripe2. GEOM_STRIPE: Disk da4 attached to stripe2. GEOM_STRIPE: Device stripe2 activated. GEOM_STRIPE: Device stripe1 created (id=1012042920). GEOM_STRIPE: Disk da5 attached to stripe1. GEOM_STRIPE: Disk da6 attached to stripe1. GEOM_STRIPE: Disk da7 attached to stripe1. GEOM_STRIPE: Disk da8 attached to stripe1. GEOM_STRIPE: Disk da9 attached to stripe1. GEOM_STRIPE: Device stripe1 activated. GEOM_MIRROR: Device raid1 created (id=3479269816). GEOM_MIRROR: Device raid1: provider stripe/stripe2 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe1 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe1 activated. GEOM_MIRROR: Device raid1: provider mirror/raid1 launched. GEOM_MIRROR: Device raid1: rebuilding provider stripe/stripe2. -------------------------------------------------------------- For our 3rd test, we tried to set up a mirror array with plain hard drive devices instead of stripes. This was to see if perhaps the stripes themselves were actually causing the problem. Setup procedure: ----------------------------------------------------------------------------- gmirror label -v -b load raid1 /dev/da0 /dev/da1 gmirror load newfs /dev/mirror/raid1 mkdir /mirror mount /dev/mirror/raid1 /mirror ----------------------------------------------------------------------------- Unfortunately, this was not the case and the mirror continued to intermittently degrade upon reboot tests. dmesg dump: ---------------------------------------------------------- GEOM_MIRROR: Device raid1 created (id=2593903422). GEOM_MIRROR: Device raid1: provider da0 detected. GEOM_MIRROR: Device raid1: provider da1 detected. GEOM_MIRROR: Device raid1: provider da1 activated. GEOM_MIRROR: Device raid1: provider mirror/raid1 launched. GEOM_MIRROR: Device raid1: rebuilding provider da0. ---------------------------------------------------------- As a final test to rule out any hardware architecture problems, we set up a completely new server using the same FreeBSD kernel and GEOM verions as before, only this time the hard drives were SATA instead of SCSI. The setup procedure: --------------------------------------------------- %gmirror label -v -b load mirror1 /dev/ad5 /dev/ad7 %gmirror load mkdir /mount mount /dev/mirror/mirror1 /mirror # /boot/loader.conf: geom_mirror_load="YES" # /etc/fstab: /dev/mirror/mirror1 /mirror ufs rw 2 2 --------------------------------------------------- As on the previous machine, the first device defined in the mirror at setup time ('/dev/ad5') caused the degraded array: -------------------------------------------------------------- GEOM_MIRROR: Device mirror1 created (id=79998820). GEOM_MIRROR: Device mirror1: provider ad5 detected. GEOM_MIRROR: Device mirror1: provider ad7 detected. GEOM_MIRROR: Device mirror1: provider ad7 activated. GEOM_MIRROR: Device mirror1: provider mirror/mirror1 launched. GEOM_MIRROR: Device mirror1: rebuilding provider ad5. -------------------------------------------------------------- As before, we reversed the order of devices on the mirror to confirm the behavior: --------------------------------------------------- %gmirror label -v -b load mirror1 /dev/ad7 /dev/ad5 --------------------------------------------------- As expected, '/dev/ad7' caused the degraded array this time. dmesg: -------------------------------------------------------------- GEOM_MIRROR: Device mirror1 created (id=2912844912). GEOM_MIRROR: Device mirror1: provider ad5 detected. GEOM_MIRROR: Device mirror1: provider ad7 detected. GEOM_MIRROR: Device mirror1: provider ad5 activated. GEOM_MIRROR: Device mirror1: provider mirror/mirror1 launched. GEOM_MIRROR: Device mirror1: rebuilding provider ad7. -------------------------------------------------------------- >Fix: Patch attached with submission follows: dmesg dump from one of the failed reboot attempts on the first machine. 10 SCSI drives over a RAID01 stripe+mirror array. This should be read as reference to the findings in geom_writeup1.txt. ------------------------------------------------------------------- Copyright (c) 1992-2007 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.2-RELEASE-p1 #1: Mon May 7 14:30:09 EDT 2007 Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3065.81-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf25 Stepping = 5 Features=0xbfebfbff Features2=0x4400> Logical CPUs per core: 2 real memory = 4160225280 (3967 MB) avail memory = 4072763392 (3884 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 6 cpu3 (AP): APIC ID: 7 ioapic0 irqs 0-23 on motherboard ioapic1 irqs 24-47 on motherboard ioapic2 irqs 48-71 on motherboard ioapic3 irqs 72-95 on motherboard ioapic4 irqs 96-119 on motherboard kbd1 at kbdmux0 acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 cpu0: on acpi0 cpu1: on acpi0 cpu2: on acpi0 cpu3: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pci0: at device 0.1 (no driver attached) pcib1: at device 2.0 on pci0 pci1: on pcib1 pci1: at device 28.0 (no driver attached) pcib2: at device 29.0 on pci1 pci2: on pcib2 pci1: at device 30.0 (no driver attached) pcib3: at device 31.0 on pci1 pci3: on pcib3 em0: port 0x3000-0x303f mem 0xfc200000-0xfc21ffff irq 28 at device 2.0 on pci3 em0: Ethernet address: 00:30:48:2a:4a:74 em1: port 0x3040-0x307f mem 0xfc220000-0xfc23ffff irq 29 at device 2.1 on pci3 em1: Ethernet address: 00:30:48:2a:4a:75 pcib4: at device 3.0 on pci0 pci4: on pcib4 pci4: at device 28.0 (no driver attached) pcib5: at device 29.0 on pci4 pci5: on pcib5 fxp0: port 0x4000-0x403f mem 0xfc400000-0xfc400fff,0xfc420000-0xfc43ffff irq 100 at device 2.0 on pci5 miibus0: on fxp0 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:02:b3:ea:eb:cd pci4: at device 30.0 (no driver attached) pcib6: at device 31.0 on pci4 pci6: on pcib6 ahd0: port 0x5400-0x54ff,0x5000-0x50ff mem 0xfc500000-0xfc501fff irq 76 at device 2.0 on pci6 ahd0: [GIANT-LOCKED] aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs ahd1: port 0x5c00-0x5cff,0x5800-0x58ff mem 0xfc502000-0xfc503fff irq 77 at device 2.1 on pci6 ahd1: [GIANT-LOCKED] aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs pci0: at device 29.0 (no driver attached) pci0: at device 29.1 (no driver attached) pci0: at device 29.2 (no driver attached) pcib7: at device 30.0 on pci0 pci7: on pcib7 pci7: at device 1.0 (no driver attached) isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x2060-0x206f at device 31.1 on pci0 ata0: on atapci0 ata1: on atapci0 pci0: at device 31.3 (no driver attached) acpi_button0: on acpi0 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FAST] pmtimer0 on isa0 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xd2000-0xd37ff,0xe0000-0xe3fff on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec ipfw2 (+ipv6) initialized, divert loadable, rule-based forwarding enabled, default to accept, logging limited to 100 packets/entry by default Waiting 5 seconds for SCSI devices to settle ad0: 114473MB at ata0-master UDMA100 ad2: 114473MB at ata1-master UDMA100 ses0 at ahd0 bus 0 target 6 lun 0 ses0: Fixed Processor SCSI-2 device ses0: 3.300MB/s transfers ses0: SAF-TE Compliant Device ses1 at ahd1 bus 0 target 6 lun 0 ses1: Fixed Processor SCSI-2 device ses1: 3.300MB/s transfers ses1: SAF-TE Compliant Device da5 at ahd1 bus 0 target 0 lun 0 da5: Fixed Direct Access SCSI-3 device da5: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da5: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da6 at ahd1 bus 0 target 1 lun 0 da6: Fixed Direct Access SCSI-3 device da6: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da6: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da7 at ahd1 bus 0 target 2 lun 0 da7: Fixed Direct Access SCSI-3 device da7: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da7: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da8 at ahd1 bus 0 target 3 lun 0 da8: Fixed Direct Access SCSI-3 device da8: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da8: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da9 at ahd1 bus 0 target 4 lun 0 da9: Fixed Direct Access SCSI-3 device da9: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da9: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da0 at ahd0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-3 device da0: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da0: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da1 at ahd0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da1: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da2 at ahd0 bus 0 target 2 lun 0 da2: Fixed Direct Access SCSI-3 device da2: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da2: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da3 at ahd0 bus 0 target 3 lun 0 da3: Fixed Direct Access SCSI-3 device da3: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da3: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) da4 at ahd0 bus 0 target 4 lun 0 da4: Fixed Direct Access SCSI-3 device da4: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da4: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C) SMP: AP CPU #1 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #3 Launched! GEOM_STRIPE: Device stripe1 created (id=127607522). GEOM_STRIPE: Disk da0 attached to stripe1. GEOM_STRIPE: Disk da1 attached to stripe1. GEOM_STRIPE: Disk da2 attached to stripe1. GEOM_STRIPE: Disk da3 attached to stripe1. GEOM_STRIPE: Disk da4 attached to stripe1. GEOM_STRIPE: Device stripe1 activated. GEOM_STRIPE: Device stripe2 created (id=2058765235). GEOM_STRIPE: Disk da5 attached to stripe2. GEOM_STRIPE: Disk da6 attached to stripe2. GEOM_STRIPE: Disk da7 attached to stripe2. GEOM_STRIPE: Disk da8 attached to stripe2. GEOM_STRIPE: Disk da9 attached to stripe2. GEOM_STRIPE: Device stripe2 activated. GEOM_MIRROR: Device raid1 created (id=702664467). GEOM_MIRROR: Device raid1: provider stripe/stripe1 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe2 detected. GEOM_MIRROR: Device raid1: provider stripe/stripe2 activated. GEOM_MIRROR: Device raid1: provider mirror/raid1 launched. GEOM_MIRROR: Device raid1: rebuilding provider stripe/stripe1. Trying to mount root from ufs:/dev/ad0s1a WARNING: / was not properly dismounted /: mount pending error: blocks 8 files 1 WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted WARNING: /hot/ad2 was not properly dismounted WARNING: /hot/ad2/usr was not properly dismounted WARNING: /hot/ad2/var was not properly dismounted >Release-Note: >Audit-Trail: Responsible-Changed-From-To: freebsd-bugs->freebsd-geom Responsible-Changed-By: linimon Responsible-Changed-When: Mon Jun 25 07:48:11 UTC 2007 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=113957 From: Arne "Wörner" To: bug-followup@FreeBSD.org Cc: ayochum@pair.com, freebsd-geom@FreeBSD.org Subject: Re: kern/113957: [gmirror] gmirror is intermittently reporting a degraded mirror array upon reboot. Date: Mon, 25 Jun 2007 01:04:24 -0700 (PDT) How did u reboot? Pulling the power plug? ;-)) I mean: The dmesg shows, that all file systems r dirty, so that I think, that ur reboot procedure is the problem and not gmirror... -Arne ____________________________________________________________________________________ Need a vacation? Get great deals to amazing places on Yahoo! Travel. http://travel.yahoo.com/ From: Kyle Lanclos To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/113957: [gmirror] gmirror is intermittently reporting a degraded mirror array upon reboot. Date: Thu, 10 Jan 2008 16:55:06 -0800 I see the same behavior on a Sun X2100 M2 system with two 250GB SATA drives. If I reboot the machine, it comes up with the first drive in the mirror (ad4) degraded, and commences a re-synch. The percentage varies somewhat, but it generally starts synch'ing at 0% complete. I am running FreeBSD 7.0-RC1, 32-bit. Similar to the behavior noted in the original submission, only the ad4 device (the one used to 'seed' this gm0 mirror) comes up dirty on reboot. Unlike the original submission, I see this behavior every time I reboot, not 20% of the time. In each case, I reboot the machine by issuing the 'reboot' command, as opposed to yanking the power cord from the back of the server. The console messages at boot time indicate that all filesystems are clean. Here is the relevant section from dmesg: ad4: 238471MB at ata2-master SATA300 ad6: 238471MB at ata3-master SATA300 SMP: AP CPU #1 Launched! GEOM_MIRROR: Device mirror/gm0 launched (1/2). GEOM_MIRROR: Device gm0: rebuilding provider ad4. Trying to mount root from ufs:/dev/mirror/gm0s1a Sample gmirror output: # gmirror status Name Status Components mirror/gm0 DEGRADED ad4 (26%) ad6 # gmirror list Geom name: gm0 State: DEGRADED Components: 2 Balance: load Slice: 4096 Flags: NONE GenID: 0 SyncID: 1 ID: 1395794316 Providers: 1. Name: mirror/gm0 Mediasize: 250055999488 (233G) Sectorsize: 512 Mode: r8w7e8 Consumers: 1. Name: ad4 Mediasize: 250056000000 (233G) Sectorsize: 512 Mode: r1w1e1 State: SYNCHRONIZING Priority: 0 Flags: DIRTY, SYNCHRONIZING GenID: 0 SyncID: 1 Synchronized: 26% ID: 1222153146 2. Name: ad6 Mediasize: 250056000000 (233G) Sectorsize: 512 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: NONE GenID: 0 SyncID: 1 ID: 1772037181 The mirror was established via the following set of operations: gmirror label -v -b load gm0 /dev/ad4 echo geom_mirror_load=YES >> /boot/loader.conf # Revise all instances of /dev/ad4 to /dev/mirror/gm0 in /etc/fstab. reboot gmirror insert gm0 /dev/ad6 Please let me know if there is any additional information I can provide. --Kyle From: Guido van Rooij To: bug-followup@FreeBSD.org, ayochum@pair.com Cc: Subject: Re: kern/113957: [gmirror] gmirror is intermittently reporting a degraded mirror array upon reboot. Date: Tue, 10 Mar 2009 16:27:55 +0100 Does http://www.freebsd.org/cgi/cvsweb.cgi/src/etc/rc.d/swap1.diff?r1=1.11;r2=1.12 fix the problem you reported? (see http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/113957) -Guido State-Changed-From-To: open->feedback State-Changed-By: jh State-Changed-When: Fri Mar 12 07:41:22 UTC 2010 State-Changed-Why: Note that submitter has been asked for feedback. http://www.freebsd.org/cgi/query-pr.cgi?pr=113957 From: Christopher Key To: bug-followup@FreeBSD.org, ayochum@pair.com Cc: Subject: Re: kern/113957: [gmirror] gmirror is intermittently reporting a degraded mirror array upon reboot. Date: Tue, 23 Mar 2010 19:17:59 +0000 I am suffering very similar symptoms. I have two mirrors, gm0 (ad4+ad6) and gm1 (ad8+ad10). On every restart, gm1 gets rebuilt, but never gm0. The problem appears to stem from gm1 not being closed cleanly at shutdown. I consistently see: GEOM_MIRROR: Device gm0: provider mirror/gm0 destroyed GEOM_MIRROR: Device gm0 destroyed. but never a corresponding message for gm1. I think that this is a problem with reference counts, and that something is keeping gm1 open. Immediately before shutdown, I have (trimmed): # gmirror list Geom name: gm0 State: COMPLETE Components: 2 Providers: 1. Name: mirror/gm0 Mode: r5w5e9 Geom name: gm1 State: COMPLETE Components: 2 Balance: round-robin Providers: 1. Name: mirror/gm1 Mode: r3w3e4 gm0 has 6 GPT partitions. Four have mounted UFS filesystems one is used for swap and one is a boot partition. I assume that the mounted partitions contribute r1w1e2 to the ref count, and the swap r1w1e1, leading to r5w5e9. When shutting down with kern.geom.mirror.debug=2, I get: GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w-1e-2 GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w-1e-2 GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w-1e-2 GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w-1e-2 GEOM_MIRROR[2]: Access request for mirror/gm0: r-1w-1e-1 which appears to confirm this. After this, the refcount becomes r0w0e0, and the devices gets destroyed. gm1 has 3 MBR slices. Two have mounted UFS filesystems, and the third slice is used for a ZFS pool. Here, I assume that the mounted partitions contribute r1w1e1 to the ref count, and the partition used by ZFS contributes r1w1e2. I'm not sure why UFS mounted from a GPT partition should contribute r1w1e2, but mounted from a MBR slice should contribute r1w1e1, but testing with memory disks does appear to confirm that this is the case. When shutting down with kern.geom.mirror.debug=2, I get: GEOM_MIRROR[2]: Access request for mirror/gm1: r-1w-1e-1 GEOM_MIRROR[2]: Access request for mirror/gm1: r-1w-1e-1 This seems to suggest that the problem is ZFS not releasing gm1 at shutdown. Testing with memory disk based mirrors appears to show the same result, with ZFS not releasing the mirror, resulting in the mirror not being cleanly destroyed. >Unformatted: