Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Jul 2013 13:23:43 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Dieter BSD <dieterbsd@gmail.com>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: Reset Problem with SATA Port Multiplier
Message-ID:  <51F4F12F.80003@FreeBSD.org>
In-Reply-To: <CAA3ZYrBWuztq9z0AddeJP0dnod40yAe%2BVQSZBwO330AKazf3eg@mail.gmail.com>
References:  <CAA3ZYrBWuztq9z0AddeJP0dnod40yAe%2BVQSZBwO330AKazf3eg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 28.07.2013 03:08, Dieter BSD wrote:
> Bob writes:
>> After a few hours of a database-like workload
>
> A faster way to trigger the problem would be useful.
>
>> We're actually more interested in archive type workloads than this
>> database workload and we have not observed the problem with an archive
>> workload.
>
> So perhaps something about the timing triggers the bug?
>
> Sam writes
>> if you have a script or a way to build a kernel to help debug this I will
>> run it if you post it here... I have the same issue on a 3 port multiplier
>> using -HEAD
>
> Can you share the make and model of this 3 port multiplier?
> If it is happening with more than one model of pm, it is more likely
> some generic problem, rather than triggering some model-specific quirk/bug.
> Has anyone seen this problem with an older OS release? (say 7.x or 8.x?)
> If the problem was introduced recently, we might be able to find it
> by looking at what changed in the source code. I haven't seen the
> problem with 8.2 or earlier.
>
> Looks like a verbose boot will give a little more info.
> But I suspect that adding more log(9) statements will be needed.
> Unless mav has a better idea?

There are two sides of this problem: original issue and imperfect error 
recovery. First one is a big question. I can't say what is actually 
going on there that causes the problem. Just recently I've made one more 
attempt to get some documentation on SATA controllers from Marvell. But 
even after signing NDA process again stopped since I am neither buying 
thousands of their chips as vendor nor they are supporting for 
end-users. The alike situation is with other vendors.

What's about the recovery, problem is that neither CAM nor mvs driver 
now track faulty status of the devices. So if some disk's firmware stuck 
and start causing infinite timeouts, that will substantially interrupt 
operation of other devices sharing that SATA port. Probably the 
mechanism of dropping faulty device could be improved somehow.

What is about SAS, mentioned here -- that is quite different more 
expensive market. And even while protocols are much more sophisticated 
and hardware, firmware and software there are much better tested, there 
also situations happen sometimes when single misbehaving device may put 
down whole fabric.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51F4F12F.80003>