Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 26 Sep 2008 13:12:14 +0300
From:      Anton - Valqk <lists@lozenetz.org>
To:        stable@freebsd.org
Subject:   HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!
Message-ID:  <48DCB57E.8000001@lozenetz.org>

next in thread | raw e-mail | index | archive | help
Hello,
I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
'dropping traffic'.
Following is the explanation of hardware and the thinga that are happening.
The machine is DELL optiplex PII 300mHZ with 512RAM.
It has 3 NICs:
fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=8<VLAN_MTU>
        inet 7.8.9.10 netmask 0xfffff000 broadcast 7.8.9.255
        ether 00:91:21:16:14:bf
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
rl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=8<VLAN_MTU>
        inet 8.9.10.11 netmask 0xffffffe0 broadcast 8.9.10.255
        ether 00:02:44:73:2a:fa
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
xl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=9<RXCSUM,VLAN_MTU>
        inet 192.168.123.2 netmask 0xffffff00 broadcast 192.168.123.255
        inet 192.168.123.5 netmask 0xffffff00 broadcast 192.168.123.255
        inet 192.168.123.6 netmask 0xffffff00 broadcast 192.168.123.255
        ether 00:c0:4f:20:66:a3
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
fxp0 and rl0 are external links to the world and are plugged into pci slots
xl0 is the internal interface and is integrated on motherboard.
It also has 1 PROMISE ULTRA133 ATA pci IDE controller plugged into the
pci slot.
It has 5 disks in it - 4 connected to the PROMISE card and 1 to the
motherboard ide.

they are as follows:
ad0 and ad6 are two identical hitachi disks in gmirror for the system
and a partition that I keep backups on.

ad4, ad5 and ad7 are storage disks - seagates 500GB 8mb cache that I
keep isos etc files on and are the problematic (maybe because of high
traffic operations compared to the other two?).

What is the problem:
Actually there are two problems:
1. I get a lot of dma times outs. mostly on ad5 and ad7 where I keep
files over 4-5MBs and write/read very often with 3-6-8MB/s from the
disk. I don't use ad4 so I can not tell if there's gona be timeous but I
suppose there will (currently has linux partitions on it and is not
mounted). I get these errors:
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5554848
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5914112
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=14924096
dmesg.today:ad7: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=374303456
dmesg.today:ad7: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
error=10<NID_NOT_FOUND> LBA=374303456
dmesg.today:g_vfs_done():ad7[WRITE(offset=191643369472,
length=131072)]error = 5
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50757760
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50760192
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=12032
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50769792

strange thing is that I'm seeing the g_vfs_done just recently and this
problem is from the very start of this hardware setup of the machine.
The machine used to work with two hitachi disks connected to the ad0 and
ad1 (integrated ide) and only one - xl0 - nic perfectly.
The problems started when I plugged in the PROMISE and other nic cards
and started using it as router, fileserver and backup server (each in
separate jail, except the pf firewall).
2. The other strange issue is that when (I guess) it starts timeouting
*sometimes* not everytime I'm loosing connection to xl0 or fxp0
(sometimes the rl0 works and accepts connections from the outside,
sometimes - not). When I go to the machine and plug a monitor - there
are no messages from kernel, no logs in /var/log/messages or debug -
noting. Stange thing is that I ping host from the local net and it time
outs, ifconfig shows that interface is connected at fd 100mbit and
everyting seems ok. I've tried ifconfig xl0 down up but doesn't help,
tried plugging out the cable and it got connected but not packets passed
- timeout again!
I've rebooted and nic came up. These 'drops' became more and more common
recently and last night I wasn't able to login for about an hour and
after that the machine came back up again by itself!!!that's in the lan
- but it wasn't accessible at all from the outside - strange thins is
that it replied to ping but I wasn't able to even open the ssh port
connection and the nat wasn't working?! After that I've remembered that
at this time I have a cronjob started for about an hour that fetches
into a file a online radio cast for an hour.... wired!!! it also have
rtorrent, apache22, samba (in a jail) runing.

some output from it can be found here:
http://valqk.ath.cx/tmp/dmesg
http://valqk.ath.cx/tmp/vmstat
http://valqk.ath.cx/tmp/smartctl


please give any ideas/hints/solutions!

thanks a lot to everyone!
cheers,
valqk.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48DCB57E.8000001>