Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Nov 2013 14:44:03 -0500
From:      Mark Johnston <markj@freebsd.org>
To:        Charles Owens <cowens@greatbaysoftware.com>
Cc:        Jason Damron <jdamron@greatbaysoftware.com>, freebsd-scsi@freebsd.org, Steve McCoy <smccoy@greatbaysoftware.com>
Subject:   Re: adding BBU relearn support to mfiutil
Message-ID:  <20131107194402.GA1695@charmander.sandvine.com>
In-Reply-To: <527BD440.8010701@greatbaysoftware.com>
References:  <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com> <527BD440.8010701@greatbaysoftware.com>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
On Thu, Nov 07, 2013 at 12:56:16PM -0500, Charles Owens wrote:
> On 11/6/13 6:03 PM, Mark Johnston wrote:
> > On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote:
> >> Hi, we've been playing with this patch in the context of 8.4-RELEASE-p4
> >> (we extracted r250483 and r250497 from stable/8 and applied to
> >> releng/8.4).  I'm seeing some results that make me question whether or
> >> not caching is really working correctly after a BBU relearn operation
> >> has completed -- or maybe whether or not the new BBU patch is talking to
> >> LSI controller properly.
> >>
> >> Our test system had a BBU in the failed state (relearn needed).  We used
> >> the "start learn command" and it seemed to go well, but strangely, when
> >> process is seems to have completed, and now several days later, status
> >> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery").
> >> This may be entirely normal -- maybe it says that because the autolearn
> >> feature is now enabled?
> > I suspect that the status is bogus and that the battery is in fact dead.
> > There seem to be a few firmware bugs in the BBU status reporting, at
> > least with iBBU07. In your output below, I see:
> >
> >          Design Capacity: 1215 mAh
> >     Full Charge Capacity: 65262 mAh
> >         Current Capacity: 61543 mAh
> >
> > which clearly isn't right. I've seen this problem before as well: over
> > time, the full charge capacity decreases, and eventually it seems to
> > wrap around to 65535. MegaCli (LSI's binary RAID management tool) reports
> > exactly the same thing, so it's a problem with the controller firmware.
> > If you look at MegaCli output you get things like "Absolute charge: 6000%".
> > So I suspect that the status is incorrect as well; when I've run into
> > this problem, I still see "status: normal".
> >
> >> The "cache" status command also suggests also is a bit strange. Here is
> >> the raw output of these status commands:
> >>
> >> # mfiutil cache mfid0
> >> mfi0 volume mfid0 cache settings:
> >>                I/O caching: disabled
> >>              write caching: write-back
> >> write cache with bad BBU: disabled
> >>                 read ahead: adaptive
> >>          drive write cache: enabled
> >> Cache disabled due to dead battery or ongoing battery relearn
> >>
> >>
> >> # ./mfiutil show battery
> >> mfi0: Battery State:
> >>        Manufacture Date: 3/18/2010
> >>           Serial Number: 77
> >>            Manufacturer: LS1111001A
> >>                   Model: 3598501
> >>               Chemistry: LION
> >>         Design Capacity: 1215 mAh
> >>    Full Charge Capacity: 65262 mAh
> >>        Current Capacity: 61543 mAh
> >>           Charge Cycles: 120
> >>          Current Charge: 94%
> >>          Design Voltage: 3700 mV
> >>         Current Voltage: 4081 mV
> >>             Temperature: 23 C
> >>        Autolearn period: 30 days
> >>         Next learn time: Tue Nov 26 20:06:40 2013
> >>    Learn delay interval: 0 hours
> >>          Autolearn mode: enabled
> >>                  Status: LEARN_CYCLE_REQUESTED
> >>
> >>
> >> /Why does cache status now say  "Cache disabled due to dead battery or
> >> ongoing battery relearn"/?  Shouldn't this no longer be the case since
> >> I've run the "learn" operation?  Does this indicate that the I/O caching
> >> is really disabled?
> > I believe so. You can try changing the write caching policy to write-back
> > with bad BBU and see if that re-enables the cache. If it does, that's
> > more evidence that the BBU is dead and needs to be replaced.
> >
> >> I'd appreciate any and all assistance.  Here's a bit of other info that
> >> might be of interest:
> >>
> >> # mfiutil show adapter
> >> mfi0 Adapter:
> >>       Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2
> >>      Serial Number:
> >>           Firmware: 11.0.1-0036
> >>        RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
> >>     Battery Backup: present
> >>              NVRAM: 32K
> >>     Onboard Memory: 512M
> >>     Minimum Stripe: 8k
> >>     Maximum Stripe: 1M
> >>
> >> # mfiutil show drives
> >> mfi0 Physical Drives:
> >>    1 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005JE> SAS E1:S0
> >>    2 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005JV> SAS E1:S1
> >>    3 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005KD> SAS E1:S4
> >>    4 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=6TB005BQ> SAS E1:S2
> >>    5 (  136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=6TB005FJ> SAS E1:S3
> >>
> >> The storage volume is 4-drives, RAID10.  System has 16GB RAM, dual Xeon
> >> E5530 CPUs, on an Intel S5520UR motherboard.
> > It might be useful to check the output of "mfiutil show events -c info".
> >
> >
> 
> This is good info, thank you.
> 
> The "show events" command tells us when the battery first was detected 
> as "failed":
> 
> 49336 (Sun Mar  3 21:53:40 UTC 2013/BATTERY/info) - Battery charge complete
> 49340 (boot + 4s/BATTERY/info) - Battery Present
> 49341 (boot + 4s/BATTERY/FATAL) - Battery has failed and cannot support data retention. Please replace the battery
> 49365 (boot + 45s/BATTERY/WARN) - BBU disabled; changing WB virtual disks to WT
> 49367 (Mon Mar  4 05:13:09 UTC 2013/BATTERY/info) - Battery temperature is normal
> 
> 
> 
> So, given this strong indication that the BBU is really dead, and that 
> I'd still like to test the effects of write-caching, I used this 
> command:   mfiutil cache mfid0 bad-bbu-write-cache enable
> 
> Now the "cached disabled" messages is gone:
> 
> # mfiutil cache mfid0
> mfi0 volume mfid0 cache settings:
>               I/O caching: writes
>             write caching: write-back
> write cache with bad BBU: enabled
>                read ahead: adaptive
>         drive write cache: enabled
> 
> 
> The obvious interpretation is that write-caching is now operational (in 
> the preferred write-back mode).  Strangely, though, my performance tests 
> (with both pgbench and bonnie) still showed no meaningful effect from 
> having the cache operational!  I toggled between caching / no-caching 
> with these commands:
> 
> # mfiutil cache mfid0 writes
> Setting write cache policy to write-back
> 
> # mfiutil cache mfid0 disable
> Disabling caching of I/O writes
> 
> 
> Again, no difference in performance was seen.
> 
> On a whim, I also tried write-through mode, and to my surprise, bonnie 
> showed significantly reduced performance! (consistent over multiple 
> samples)  This is really confusing.  To me it suggests that there's some 
> kind of disconnect between caching-status as seen with mfiutil and 
> caching-status in reality.  Chief exhibits being that write-caching 
> appears to have still been happening even:
> 
>   * after the "cache mfid0 disable" command was issued, and
>   * earlier, before the "cache mfid0 bad-bbu-write-cache enable" command
>     was issued (when "mfiutil cache mfid0" still showed "Cache disabled
>     due to dead battery or ongoing battery relearn").
> 
> ** If this is the case then it suggests that the system before today was 
> in a dangerous state... actively doing write-back caching with a bad BBU 
> (despite what mfiutil claimed about the cache being disabled)! **

Yup. That's rather frightening. :(

> 
> Your thoughts?  Is there any other way to explain this?

Nothing that comes to mind. The reason I did some work to improve LSI BBU
reporting was because we were noticing intermittent performance problems
that turned out to be caused by the controller flipping to write-through
mode during BBU relearn cycles.

However, I've never bothered verifying that the cache is actually in
write-through mode when the battery is dead. I think there's a machine
in my lab which shows similar problems, so I will try to take a look at
it soon, do some write perf testing and see what MegaCli reports. It'll
take me a few days at least to get to this though.

I'm not sure how this might be fixed in the case that it turns out to be
another firmware bug.

-Mark

> 
> 
> Here is the data from bonnie:
> 
> *****  write-through caching (2 samples)
> 
> # bonnie -s 2000
> File './Bonnie.1351', size: 2097152000
> ...
>                -------Sequential Output-------- ---Sequential Input-- --Random--
>                -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
>           2000 61515 21.3 46388  4.3 57432 16.0 247823 99.9 1629696 100.0 55687.0 212.4
> 
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
>           2000 60001 20.7 51828  4.9 51666 13.9 247501 100.0 1657454 100.0 53136.4 251.0
> 
> *****  write-back caching (2 samples)
> 
>                -------Sequential Output-------- ---Sequential Input-- --Random--
>                -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
>           2000 128564 44.6 90065  8.7 245325 47.8 248492 100.0 1558747 99.7 61967.5 179.1
> 
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
>           2000 184059 64.0 141360 13.8 129801 22.2 246222 99.2 1556723 100.0 51728.4 159.7
> 
> (and, again... same performance is seen after issuing "cache disable" 
> command)
> 
> 
> Thanks much,
> 
> Charles Owens
> Great Bay Software
> 



Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?20131107194402.GA1695>