Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Dec 2008 13:18:00 +0100
From:      Arnaud Houdelette <arnaud.houdelette@tzim.net>
To:        Victor Balada Diaz <victor@bsdes.net>
Cc:        freebsd-stable@freebsd.org, freebsd-amd64@freebsd.org
Subject:   Re: [ATA] and re(4) stability issues
Message-ID:  <493FB378.5030106@tzim.net>
In-Reply-To: <20081209185236.GA1320@alf.bsdes.net>
References:  <20081209185236.GA1320@alf.bsdes.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Victor Balada Diaz a écrit :
> Hello,
>
> I got various machines[1] at hetzner.de and I've been having problems
> with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
> been trying to narrow the problem so someone more knowledgeable than me
> is able to fix it. This mail is an other attempt to ask a question
> with regards ATA code to see if this time i got something.
>
> For the ones that don't actually know what happened:
>
> With FreeBSD 7.0 -RELEASE for amd64 and default kernel
> the system shared re0 interrupt with OHCI and this caused
> re(4) to corrupt packets and create interrupt storms. Tried
> updating to 7.1 -BETA2 and still had some problems with it.
>
> I've opened the PR kern/128287[2] and Remko quickly answered
> with a workaround: that workaround was removing USB support from
> my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
> and the interrupt storms were gone. Now sometime later the interface
> goes up and down from time to time, but less often. Also sometimes
> the machine losts the network interface but continues to work.
>
> I know it continues to work because some days later i can see that
> it tried to deliver the status reports but was unable to resolve the
> aliases hostnames. I can't ping the machine and i know the network
> is OK. If i reboot the machine everything is working again.
>
> When switched from 7.0 to 7.1 BETA2 i also found that under load
> after some hours the machine created interrupt storms on ATA disks.
>
> Digging at linux source code i've found that they do some special things
> for this chipset that i've been unable to find on our code. This is
> linux code for my chipset:
>
> 371                 AHCI_HFLAGS     (AHCI_HFLAG_IGN_SERR_INTERNAL |
> 372                                  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
> 373                                  AHCI_HFLAG_SECT255),
>
> File and the rest of the code in here[3].
>
> As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
> think of, switching MSI and MSI-x off for the whole system, so
> i added to /boot/loader.conf this tunables:
>
> hw.pci.enable_msix="0"
> hw.pci.enable_msi="0"
>
> And then rebooted the machine. After various hours of doing almost nothing
> i've found that the machine answered ping but was unable to answer any
> request (eg, ssh, nagios nrpe, etc). The machine recovered itself after
> some minutes and when i was able to ssh into i saw the following in dmesg:
>
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158
>
> and a lot more errors like that. I didn't get this errors with MSI enabled.
> I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later
> used for DMA related things. Could someone who is more knowledgeable check
> if we're doing the right thing?
>
> I've attached verbose dmesg of a machine that's like this one with
> 7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire.
>
> Also, please, could someone give me a hand on how could i continue debugging
> this interrupt issues? I'm a bit lost and digging code and posting each
> time i think i've found something is not going to go anywhere.
>
> I would also like to say that i've seen reports of this kind of problems
> on amd64 machines in the lists since various years ago, so i don't think
> this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital)
> on the lists
>
>
> Thanks in advance for any help.
> Regards.
>
>
> [1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/
> [2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287
> [3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369
>   

Sorry I didn't take the time to read all the thread, but I got similar 
problem with the same IXP600 chipset.
Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. 
The simptoms where similar : interrupt 22 was shared between the sata 
controler and the wireless card. And I got Interrupt Storms at random 
times when using the wireless network.

No problem since I removed the ral(4) NIC (got a real access point now).
You might not want to point the finger at the re(4) driver too fast.

Arnaud Houdelette





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?493FB378.5030106>