LSI  LOGIC ©
Engineering Release Notice
Component: SAS_FW_Image
Release Date: 03-28-2007
OEM: LSI
Version: SAS_FW_Image_APP-1.03.20-0225_MPT-01.18.74.00-IT_MPT2-01.18.74.00-IT_BB-R.2.3.12_BIOS-MT30_WEBBIOS-1.03-08_02_CTRLR-1.04-017A_2007_03_28
Package: 5.1.1-0047
FW_MPT_1068 01.18.74.00-IT
FW_SAS 1.03.20-0225
FW_MPT_1068_b1 01.18.74.00-IT


FW_MPT_1068
Component: FW_MPT_1068
Stream: FW_MPT_1068_Proj_Integration
Version: 01.18.74.00-IT
Baseline From: FW_MPT_1068_Release-MPTFW-01.18.73.00-IT-2007_01_26
Baseline To: FW_MPT_1068_Release-MPTFW-01.18.74.00-IT-2007_03_26
CHANGE SUMMARY:
LSID100066810 (TASK) Release MPT 01.18.74
LSID100066174 (DFCT) Unable to reliably flash dasiy chained encl.
DEFECT RECORDS (Total Defects=1, Number Duplicate=0):
FW_MPT_1068 DEFECTS
DFCT ID: LSID100066174
Headline: Unable to reliably flash dasiy chained encl.
Description: flashing enclosures through a 8480E SAS card with varied configurations is unreliable.

Flashing 1 is fine.
Flashing 4 or 5 may or may not pass.
The enclosure being populated or unpopulated does not seem to matter
Version of Bug Reported: 211
Steps to Reproduce: The utility was provided by OEM
I have attached the ISO image to this defect.

The Encl need to be set to a particular OEM.
Create a CD form image and then boot from CD and flash up and down.
Child Tasks: LSID100066810
UCM ACTIVITY / TASK RECORDS (1):
FW_MPT_1068 UCM TASKS
Task ID: LSID100066810
Headline: Release MPT 01.18.74
Description: MPT 01.18.74 code release
State: Open
Change Set Files: 0
References:   LSID100066174(DFCT)    


FW_SAS
Component: FW_SAS
Stream: SAS_1.0_Dev
Version: 1.03.20-0225
Baseline From: FW_SAS_Release_Dobson-1.03.20-0220_2007_03_06
Baseline To: FW_SAS_Release_Dobson-1.03.20-0225_2007_03_16
CHANGE SUMMARY:
LSID100066361 (TASK) update version.c
LSID100066139 (TASK) Limit Spinupdelay to maximum 15 for MPT
LSID100066154 (TASK) Intermittent link failure causes HDD marked dead
LSID100066364 (TASK) FW_SAS Release Version: 1.03.20-0225
LSID100065823 (TASK) FW_SAS Release Version: 1.03.21-0221
LSID100066143 (TASK) Flush Cache before making a Rebuilt drive online
LSID100056527 (DFCT) HDD Spin-up setting values are not possible.
LSID100065371 (DFCT) Stop error after HotRebuild
LSID100065366 (DFCT) Intermittent link failure causes HDD marked dead
DEFECT RECORDS (Total Defects=3, Number Duplicate=0):
FW_SAS DEFECTS
DFCT ID: LSID100056527
Customer DFCT No: 12319699
Headline: HDD Spin-up setting values are not possible.
Description: 06/27/06 still open with latest FW
6/16/06: SCM 34550 is closed. Need to reopen if still an issue.
2/16/06: retest fail? Setting is don‚t care.
12/22/05: according to notes in scm 34550, this issue verified fixed and closed. retest FW71
12/15/05: restest with FW 69 failed – setting in WebBios seems to be don‚t care.
11/03/05: retest with .66
10/14/05: Pl. add support for Spinup delay. No of drives per spinup is working. Fixed in next FW
10/3/05: duplicated. Issue is with delay setting.
9/20/05: I can modify to HDD spin-up setting values by WebBIOS and GAM, but these settings are not possible.
If I changed these parameters, spin-up has been started as follows:
test 1. check the behaviour with controller default settings
setting:
- disk per spins: 2
- delay between spins: 6 sec
behaviour:
- spin-up for target 1, 2, 4 and 5 --> spin-up for target 0 and 3
- spin-up for 6 disks are completed within 12 seconds
test 2. check the behaviour with setting changed.
setting:
- disk per spins: 1
- delay between spins: 30 sec
behaviour:
- spin-up for target 1, 2, 4 and 5 --> spin-up for target 0 and 3
- spin-up for 6 disks are completed within 12 seconds
This behaviour is same as test No.1.
Adaptor's behaviour should depend on settings.
note: FW Drop .0055 has not this problem.
Version of Bug Reported: 96
Version of Bug Fixed: 1.03.20-0225
Steps to Reproduce: I can modify to HDD spin-up setting values by WebBIOS and GAM, but these settings are not possible.
If I changed these parameters, spin-up has been started as follows:
test 1. check the behaviour with controller default settings
setting:
- disk per spins: 2
- delay between spins: 6 sec
behaviour:
- spin-up for target 1, 2, 4 and 5 --> spin-up for target 0 and 3
- spin-up for 6 disks are completed within 12 seconds
test 2. check the behaviour with setting changed.
setting:
- disk per spins: 1
- delay between spins: 30 sec
behaviour:
- spin-up for target 1, 2, 4 and 5 --> spin-up for target 0 and 3
- spin-up for 6 disks are completed within 12 seconds
This behaviour is same as test No.1.
Adaptor's behaviour should depend on settings.
note: FW Drop .0055 has not this problem.
Resolution: Fixed Indirectly
Resolution Description: The maximum value allowed for Spinup Delay is 15, as MPT allows only 4 bits in its spinup delay field. Earlier, the original 8 bit value from the utility was simply truncated and used in the 4 bit field. Now we set it to 15, the maximun 4 bit value, if the value from the utility is more than that.
Customer Defect Track No: 12319699
Customer List: FSC -- FSC
Child Tasks: LSID100066139
FW_SAS DEFECTS
DFCT ID: LSID100065371
Customer DFCT No: 520PR067
Headline: Stop error after HotRebuild
Description: Stop error happens when reboot the system after performing hot rebuild using PCP
Version of Bug Reported: 1N41
Steps to Reproduce: [Configuration]
H/W: MegaRAID SCSI
FW: 1N41
RAID: RAID 5 with 3 HDD
OS: W2K
DB: Oracle10G

[Step]
1. "Offline" HDD at ID0 by PCP, then run HotRebuild
2. "Offline" HDD at ID1 by PCP, then run HotRebuild
3. "Offline" HDD at ID2 by PCP, then run HotRebuild
4. Reboot the system
5. Stop error happens at OS boot

*No application is running.
Customer Defect Track No: 520PR067
Customer List: NEC -- NEC
Child Tasks: LSID100066143
FW_SAS DEFECTS
DFCT ID: LSID100065366
Customer DFCT No: HSA0086
Headline: Intermittent link failure causes HDD marked dead
Description: Intermittent link failure causes HDD marked dead
Version of Bug Reported: 1.03.00-0177
Steps to Reproduce:
1.Make RAID1 using Port.0, 1 SATA HDD.
2.Remove Port.1 HDD.
3.Re-insert removed drive very slowly to simulate "intermittent link failure".
4.It detect Link failure intermittently. And one Device Removed (Link Failure) event cause drive marked dead.

[System]
Platform (OS)     Windows Server 2003 R2 x86
Processors     Xeon 3.60GHz
BIOS     Phoenix BIOS 9IVDTH-E16
Memory     1GB
     
Driver Name & Version Number     Msas2k3.sys 1.21.0.32
Utility Name & Version Number     MegaRAID Storage Manager 1.18-00
     
RAID Adapter-1     MegaRAID SAS 8308ELP & ROMB
Series #     MegaRAID SAS 8308ELP & ROMB
Channels     8 Port
BIOS     MT28
FIRMWARE     1.03.00-0177

[PD]
A     C     T     Manufacturer     Model     FW Rev.     Size
1          0     HGST     KUROFUNE          500GB
1          1     HGST     KUROFUNE          500GB

[LD]
Adapter     Logical     Array     Size     FST     Vol. Name     RAID     SS     WP     RP     CP     VS     E
1     RAID 1          500GB     NTFS     -     1     -     WT     RA     DIO     500GB     



Resolution: Fixed
Resolution Description: increase missing delay to 15 seconds.
Customer Defect Track No: HSA0086
Customer List: Hitachi -- Hitachi
Child Tasks: LSID100066154
UCM ACTIVITY / TASK RECORDS (6):
FW_SAS UCM TASKS
Task ID: LSID100066361
Headline: update version.c
Description: VER_MAINTENANCE_BOARD 3
State: Open
Change Set Files: 0
References:  
FW_SAS UCM TASKS
Task ID: LSID100066139
Headline: Limit Spinupdelay to maximum 15 for MPT
Description: Defect 56527: HDD Spin-up setting values are not possible.

Analysis:
=======
The Spinup delay is handled by the MPT chip, and in the interface of MPT, the spinupdelay value is of 4 bits, and hence the maximum value possible is 15. However, the FW was trying to directly assign the 8 bit value from utility to the 4 bit value which was thereby getting truncated. A value of 30 will get truncated to 14 and a value of 40 will get truncated to 8.

Fix:
===
Limit the spinupdelay to 15, the maximum value possible, and so it will behave better. However, the user has to know the maximum possible value to set it appropriately.
State: Completed
Change Set Files: 0
References:   LSID100056527(DFCT)    
FW_SAS UCM TASKS
Task ID: LSID100066154
Headline: Intermittent link failure causes HDD marked dead
Description: Fix DF65366:
If it is Hitach OEM, we are making MPT_DEVICE_PORT_MISSING_DELAY to 15 seconds and MPT_IO_DEVICE_MISSING_DELAY to 15 seconds.
State: Open
Change Set Files: 0
References:   LSID100065366(DFCT)    
FW_SAS UCM TASKS
Task ID: LSID100066364
Headline: FW_SAS Release Version: 1.03.20-0225
Description: FW_SAS Release Version: 1.03.20-0225
State: Open
Change Set Files: 0
References:  
FW_SAS UCM TASKS
Task ID: LSID100065823
Headline: FW_SAS Release Version: 1.03.21-0221
Description: FW_SAS Release Version: 1.03.21-0221
State: Completed
Change Set Files: 0
References:  
FW_SAS UCM TASKS
Task ID: LSID100066143
Headline: Flush Cache before making a Rebuilt drive online
Description: Issue Title:
=========
Potential for data corruption after 2nd drive failure following completion of drive rebuild on RAID 5 Logical Drives configured with Write-Back caching

Products Affected:
===============
All LSI Megaraid Adapters, excluding Megaraid SAS 1.1 FW

Background on RAID 5 Rebuilds
===========================
A RAID 5 Logical Drive (LD) can survive single drive failures by maintaining redundant parity data across all drive members of the LD. When any single drive fails, missing data for the failed drive can be reconstructed as needed from the surviving drive members. The RAID adapter can later return the LD to full redundancy by performing a complete “rebuild” operation, which entails reconstructing the entire data set of the failed/missing drive onto a replacement drive. Returning the LD to full redundancy allows it to survive another (subsequent) single-drive failure.

Issue Description
==============
Megaraid performs rebuilds sequentially from start to finish, on a per-stripe-row basis. For a given row, missing data (or parity) for the rebuilding drive is reconstructed using a bitwise XOR operation of data read from the surviving drives. If none of the rebuilding drive‚s data is dirty in cache for a given row, the reconstructed data is immediately written to the rebuilding disk, making the data and parity consistent on all disks for that row.

If dirty host data exists in cache for the rebuilding strip on a given row, special care is taken to only regenerate data that is not dirty, since dirty data represents newer data written by the host that supersedes existing (regenerated) data. After the rebuild logic has reconstructed all non-dirty sectors within a dirty strip, FW marks the entire strip as dirty, AND DOES NOT WRITE THE DATA IMMEDIATELY TO DISK. FW instead relies on the write-back flusher to later write the data to disk. Deferral of this write is necessary because the dirty host data for the rebuilding strip must be made consistent with the parity drive for that row, an operation implemented only in our write-back flusher.

When the rebuild operation completes, the rebuilding drive is marked ONLINE, signaling to the user that the LD is fully-redundant and ready to survive another single-drive failure. Even though the rebuild is complete and the LD is marked consistent, Megaraid‚s cache may still contain a certain number of dirty rebuilt-lines that have yet to be flushed to disk. Until these lines are flushed, the data on the rebuilt disk for these dirty rows is undefined (not yet written) and inconsistent with parity. If the LD suffers a 2nd drive failure before these lines have been flushed, data on the 2nd failed drive for these rows will be unrecoverable because the rebuilt disk has not yet been made consistent with the parity, thus making reconstruction of the failed drive‚s data impossible for the affected dirty rows.

Resolution
========
The fix for this issue is to flush the entire contents of the write-back cache following the completion of a rebuild, waiting for this flush to complete before allowing the drive‚s state to be changed from REBUILD to ONLINE.

Workaround
==========
In lieu of corrected FW there is an alternate procedure available to users that will ensure the flushing of all dirty data following completion of a rebuild. After a rebuild has completed, the user should switch the write caching mode of the LD from Write Back to Write Through, which will trigger flushing of all dirty data for the LD. The user can then immediately switch the write mode back to Write Back.

Probability of Data Loss
===================
This issue affects only RAID 5 LDs configured with Write-Back caching, since Write-Through LDs never contain dirty cache data. The probability of data loss resulting from this issue correlates to the volume of dirty REBUILT data remaining in cache after completion of the rebuild, in relation to the probability of experiencing a 2nd drive failure before the dirty data has been flushed to disk. There are specific factors which affect both of these probabilities.

Probability of Dirty Data
===================
The likelihood of the rebuild process encountering dirty data for any given row depends on the level of write activity from the host, the amount of adapter memory available for write caching on the specific LD, the user-configured write flush time, and the speed at which the adapter is able to flush dirty data to disk.

Aside from these factors, there is a specific I/O load scenario which produces the highest probability of dirty data. In situations in which the host is continually (re)writing to primarily a small subset of blocks within the LD, the cache lines associated with those blocks will remain perpetually dirty in cache. The reasons are two fold: 1) if the amount of unique data written by the host fits entirely in cache, the cache will never experience the forced flushing needed to make room for new, unique data, and 2) whenever a write request is received from the host, Megaraid resets the interval flush timer, thus the typical periodic “sweep-flush” will not be triggered. In this scenario we have observed dirty data remaining in cache for very durations, sometimes exceeding 10 minutes.

Probability of 2nd drive failure
=======================
The likelihood of a 2nd drive failure in the interim between a completed rebuild and the flushing of dirty rebuilt cache data is very small and is dependent on the number of disks in the LD in relation to the MTBF of those disks. A more likely scenario leading to 2nd drive unavailability is the prospect of a user-initiated manual “copy back” operation. In Megaraid configurations containing hotspares, the failed drive‚s data is automatically rebuilt onto an available hotspare drive .Once this rebuild is complete, some users are inclined to relocate the rebuilt data back into the failing drive‚s “slot” within the enclosure. This can be accomplished by shutting down the system and moving the hotspare into the slot occupied by the failed drive. To avoid the need for a shutdown, some users will replace the failed drive with a fresh drive, then manually fail the hotspare drive after its rebuild has completed, triggering a rebuild operation onto the fresh drive placed in the original failed slot.
State: Completed
Change Set Files: 0
References:   LSID100065371(DFCT)    


FW_MPT_1068_b1
Component: FW_MPT_1068_b1
Stream: FW_MPT_1068_B1_Integration
Version: 01.18.74.00-IT
Baseline From: FW_MPT_1068_b1_Release-MPTFW-01.18.73.00-IT-2007_01_26
Baseline To: FW_MPT_1068_b1_Release-MPTFW-01.18.74.00-IT-2007_03_26
CHANGE SUMMARY:
LSID100066818 (TASK) Release MPT 01.18.74 for B1
UCM ACTIVITY / TASK RECORDS (1):
FW_MPT_1068_b1 UCM TASKS
Task ID: LSID100066818
Headline: Release MPT 01.18.74 for B1
Description: MPT 01.18.74 code release
State: Open
Change Set Files: 0
References: