Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 07 Nov 2013 12:56:16 -0500
From:      Charles Owens <cowens@greatbaysoftware.com>
To:        Mark Johnston <markj@freebsd.org>
Cc:        Jason Damron <jdamron@greatbaysoftware.com>, freebsd-scsi@freebsd.org, Steve McCoy <smccoy@greatbaysoftware.com>
Subject:   Re: adding BBU relearn support to mfiutil
Message-ID:  <527BD440.8010701@greatbaysoftware.com>
In-Reply-To: <20131106230356.GA86666@charmander.sandvine.com>
References:  <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11/6/13 6:03 PM, Mark Johnston wrote:
> On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote:
>> Hi, we've been playing with this patch in the context of 8.4-RELEASE-p4
>> (we extracted r250483 and r250497 from stable/8 and applied to
>> releng/8.4).  I'm seeing some results that make me question whether or
>> not caching is really working correctly after a BBU relearn operation
>> has completed -- or maybe whether or not the new BBU patch is talking to
>> LSI controller properly.
>>
>> Our test system had a BBU in the failed state (relearn needed).  We used
>> the "start learn command" and it seemed to go well, but strangely, when
>> process is seems to have completed, and now several days later, status
>> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery").
>> This may be entirely normal -- maybe it says that because the autolearn
>> feature is now enabled?
> I suspect that the status is bogus and that the battery is in fact dead.
> There seem to be a few firmware bugs in the BBU status reporting, at
> least with iBBU07. In your output below, I see:
>
>          Design Capacity: 1215 mAh
>     Full Charge Capacity: 65262 mAh
>         Current Capacity: 61543 mAh
>
> which clearly isn't right. I've seen this problem before as well: over
> time, the full charge capacity decreases, and eventually it seems to
> wrap around to 65535. MegaCli (LSI's binary RAID management tool) reports
> exactly the same thing, so it's a problem with the controller firmware.
> If you look at MegaCli output you get things like "Absolute charge: 6000%".
> So I suspect that the status is incorrect as well; when I've run into
> this problem, I still see "status: normal".
>
>> The "cache" status command also suggests also is a bit strange. Here is
>> the raw output of these status commands:
>>
>> # mfiutil cache mfid0
>> mfi0 volume mfid0 cache settings:
>>                I/O caching: disabled
>>              write caching: write-back
>> write cache with bad BBU: disabled
>>                 read ahead: adaptive
>>          drive write cache: enabled
>> Cache disabled due to dead battery or ongoing battery relearn
>>
>>
>> # ./mfiutil show battery
>> mfi0: Battery State:
>>        Manufacture Date: 3/18/2010
>>           Serial Number: 77
>>            Manufacturer: LS1111001A
>>                   Model: 3598501
>>               Chemistry: LION
>>         Design Capacity: 1215 mAh
>>    Full Charge Capacity: 65262 mAh
>>        Current Capacity: 61543 mAh
>>           Charge Cycles: 120
>>          Current Charge: 94%
>>          Design Voltage: 3700 mV
>>         Current Voltage: 4081 mV
>>             Temperature: 23 C
>>        Autolearn period: 30 days
>>         Next learn time: Tue Nov 26 20:06:40 2013
>>    Learn delay interval: 0 hours
>>          Autolearn mode: enabled
>>                  Status: LEARN_CYCLE_REQUESTED
>>
>>
>> /Why does cache status now say  "Cache disabled due to dead battery or
>> ongoing battery relearn"/?  Shouldn't this no longer be the case since
>> I've run the "learn" operation?  Does this indicate that the I/O caching
>> is really disabled?
> I believe so. You can try changing the write caching policy to write-back
> with bad BBU and see if that re-enables the cache. If it does, that's
> more evidence that the BBU is dead and needs to be replaced.
>
>> I'd appreciate any and all assistance.  Here's a bit of other info that
>> might be of interest:
>>
>> # mfiutil show adapter
>> mfi0 Adapter:
>>       Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2
>>      Serial Number:
>>           Firmware: 11.0.1-0036
>>        RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
>>     Battery Backup: present
>>              NVRAM: 32K
>>     Onboard Memory: 512M
>>     Minimum Stripe: 8k
>>     Maximum Stripe: 1M
>>
>> # mfiutil show drives
>> mfi0 Physical Drives:
>>    1 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005JE> SAS E1:S0
>>    2 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005JV> SAS E1:S1
>>    3 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005KD> SAS E1:S4
>>    4 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005BQ> SAS E1:S2
>>    5 (  136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=6TB005FJ> SAS E1:S3
>>
>> The storage volume is 4-drives, RAID10.  System has 16GB RAM, dual Xeon
>> E5530 CPUs, on an Intel S5520UR motherboard.
> It might be useful to check the output of "mfiutil show events -c info".
>
>

This is good info, thank you.

The "show events" command tells us when the battery first was detected 
as "failed":

49336 (Sun Mar  3 21:53:40 UTC 2013/BATTERY/info) - Battery charge complete
49340 (boot + 4s/BATTERY/info) - Battery Present
49341 (boot + 4s/BATTERY/FATAL) - Battery has failed and cannot support data retention. Please replace the battery
49365 (boot + 45s/BATTERY/WARN) - BBU disabled; changing WB virtual disks to WT
49367 (Mon Mar  4 05:13:09 UTC 2013/BATTERY/info) - Battery temperature is normal



So, given this strong indication that the BBU is really dead, and that 
I'd still like to test the effects of write-caching, I used this 
command:   mfiutil cache mfid0 bad-bbu-write-cache enable

Now the "cached disabled" messages is gone:

# mfiutil cache mfid0
mfi0 volume mfid0 cache settings:
              I/O caching: writes
            write caching: write-back
write cache with bad BBU: enabled
               read ahead: adaptive
        drive write cache: enabled


The obvious interpretation is that write-caching is now operational (in 
the preferred write-back mode).  Strangely, though, my performance tests 
(with both pgbench and bonnie) still showed no meaningful effect from 
having the cache operational!  I toggled between caching / no-caching 
with these commands:

# mfiutil cache mfid0 writes
Setting write cache policy to write-back

# mfiutil cache mfid0 disable
Disabling caching of I/O writes


Again, no difference in performance was seen.

On a whim, I also tried write-through mode, and to my surprise, bonnie 
showed significantly reduced performance! (consistent over multiple 
samples)  This is really confusing.  To me it suggests that there's some 
kind of disconnect between caching-status as seen with mfiutil and 
caching-status in reality.  Chief exhibits being that write-caching 
appears to have still been happening even:

  * after the "cache mfid0 disable" command was issued, and
  * earlier, before the "cache mfid0 bad-bbu-write-cache enable" command
    was issued (when "mfiutil cache mfid0" still showed "Cache disabled
    due to dead battery or ongoing battery relearn").

** If this is the case then it suggests that the system before today was 
in a dangerous state... actively doing write-back caching with a bad BBU 
(despite what mfiutil claimed about the cache being disabled)! **

Your thoughts?  Is there any other way to explain this?


Here is the data from bonnie:

*****  write-through caching (2 samples)

# bonnie -s 2000
File './Bonnie.1351', size: 2097152000
...
               -------Sequential Output-------- ---Sequential Input-- --Random--
               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          2000 61515 21.3 46388  4.3 57432 16.0 247823 99.9 1629696 100.0 55687.0 212.4

Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          2000 60001 20.7 51828  4.9 51666 13.9 247501 100.0 1657454 100.0 53136.4 251.0

*****  write-back caching (2 samples)

               -------Sequential Output-------- ---Sequential Input-- --Random--
               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          2000 128564 44.6 90065  8.7 245325 47.8 248492 100.0 1558747 99.7 61967.5 179.1

Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          2000 184059 64.0 141360 13.8 129801 22.2 246222 99.2 1556723 100.0 51728.4 159.7

(and, again... same performance is seen after issuing "cache disable" 
command)


Thanks much,

Charles Owens
Great Bay Software




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?527BD440.8010701>