Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Jan 2005 15:26:42 +0300
From:      "Artem Kuchin" <matrix@itlegion.ru>
To:        "Robert Watson" <rwatson@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Lock up problems with 5.3-STABLE  (was: Cannot build kernel with options WITNESS)
Message-ID:  <00ba01c50211$23389ee0$0c00a8c0@artem>
References:  <Pine.NEB.3.96L.1050122222459.19903T-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Sun, 23 Jan 2005, Artem Kuchin wrote:
> 
>> > On Sat, 22 Jan 2005, Artem Kuchin wrote:
>> > 
>> >> I cvssed just an hour ago. 5.3-STABLE and cannot build kernel with
>> >> WITNES. It complains: 
>> > 
>> > This occurs when building WITNESS without DDB in the kernel, which was not
>> > a tested build case when I added "show alllocks", and apparently is a
>> > relatively uncommon configuration as you're the first person to bump into
>> > it.  I've just committed the fix as subr_witness.c:1.187 in HEAD, and
>> > subr_witness.c:1.178.2.4 in RELENG_5.  Please let me know if this doesn't
>> > fix the problem for you.
>> 
>> It fixed the problem. I am actually stuggling with unpredictable weird
>> lock ups, when the host can be pinged but i cannot connect via
>> ssh/telnet or httpd or anything else. It happens w/o any visible reason.
>> I am running several jails with mysql and apache in each and canot make
>> the whole system stable yet. 
> 
> This is typically a sign of one of two problems:
> 
> - The system is live locked due to very high load, so the ithread,
>  netisrs, etc, in the kernel run fine, but user processes don't get a
>  chance to run. 
> 
> - The system is dead locked due to user space processes getting wedged on
>  common locks, but the kernel ithreads and netisrs can keep on
>  responding. 
> 
> I generally assume that it's a deadlock as opposed to a live lock.  I'd
> compile a kernel with DDB, KDB, WITNESS, and BREAK_TO_DEBUGGER.  When the
> system appears to wedge, break into the debugger using a console or serial
> break (FYI: serial break is more reliable, and you get the benefit of
> being able to easily copy and paste debugging output using the serial
> console for DDB).  Use "show alllocks" and "show lockedvnods" to examine
> most of the system's locking state.  Changes are, either all the
> interesting processes are stacked up on VFS or VM locks, since those kinds
> of deadlocks produce the exact symptoms you describe: ping works fine
> because it only hits the netisr, but when you open TCP connections, the
> sshd (etc) block on VM or VFS locks attempting to fork new children or
> access a file in the file system name space.  At first, the TCP
> connections will establish but there will be no application data; after a
> bit, they will not even return a SYN/ACK because the listen queue for the
> listen socket has filled.
> 

Well, i cvsed and reconpiled the kernel with WITNESS, INVARINATS, turned off
adaptive giant and got a lock today at 7 am. Since the server is remotely controlled
i took my digital camera because i cannot connect serial console to it and went to the server.
I expetced to see some special message about something going wrong, break
into debugger (CTRL+ALT+ESC) and to take some pictures of dumps of console.
But, i saw nothing. The lasrt message on th screen was about ssh loging last evening
and the last message in /var/log/all.log was about entropy save from cron.
I could not break into debugger usinmg CTRL-ALT+ESC. I did nothing. So,
it looked like a hard lock.

At this point i would like to tell the whole story.
We bought this server in may 2004 and decided to extemsively test the hardware
while there were not 5.3. We actually expected it around august. SO, we installed
5-CURRENT and ran high load tests (cpu, memory, disk storage, network) from
/usr/ports/benchmark at the same time and one-by-one several weeks. There were
not a glitch. After that we turned it off and waited for RELEASE. RELEASE has
come and we begun to setup the servre as it should work. As the server's
primary mission is to host a buch of site we decided to setup jails for each site,
So we did in december and put the server on prividers co-location severals
kilometer away from the office. Next day the server locked up. We were surprised
but just rebooted it, It locked up the next day gain. We cvsupped and rebuild the
system and the jails. The server locked up the next day. During the new year break
i have figureed that if there are more that one jail running the server locks withun
24 hours with very hight probablity and within 48 hours with 100% probability. I 
wrote into freebsd-stable about it. You have asked for debugger dump (pcpu, list of
lock, e.t.c). I could not do it at that time, so, i did not reply and just cvsupped in
the beginning of january and rebuilt the system and the jails again. Magically,  after
that i could run 5 jails (did not tried more) for over a week and i already decided that
the bug was fixed and I could host the site. Alas, the next glitch did not wait to long.
After a few more days i saw a srange situatuon - i could not connect to server using
SSH. SSH replied about auth key or something like that. I rebootied the system and
ssh worked fine. Still have no idea what that was, but i setuo IPFIREWALL and a telnet
server for accept connection only from one ip address, so, if ssh fails I could use telnet.
After that i moved a real site with perl scripts, 1GB database, mail account (using qmail+vpopmail)
into one of the jails and the next day got the next problem: I could ping server, but could not
connect using ssh, www, telnet (110,25,23). I tried to recompile the kernel with INVARINATS,
WITNESS and disable the adaptive giant. I could not, so I wrote about it to you.  You fixed
the source and now i recompliled the source again and today got a lock again with all those
options enabled and this time i could not ping the server.
I could thing that there is semething wrong with the hardware, but it passed
many days of testing. Anyway, my current idea are

1) Something wrong with jail code 
2) Something wrong with SMP code
3) Something wrong with HYPERTHREADING code
4) Something wrong with Memory disk code (md device, which i use)
5) Something wrong with the hardware

So, today, i opened bios, truned off hyperthreaading, fast strinmg operations and
all other 'more advanced' features in the bios. Turned off IDE controller the motherboard.
This rule out HYPERTHREADING code problem and somewaht hardware problem.

I turned off MD usage (not more memory disk, but actually i need it very badly).
So i rule out the md code problem.

Now, i will run some web access test (simulation of browsing for a week). It the
sever does not lock up, i will consider that i have found a workaround for some 
hidden bug and the bug is somewere in md, ht code or hardware.

If it locks up again the i will giveup jails and try for one more  week. If it does not
lock up - jail code is the problem.

If it locks up without jails, then i will turn off SMP and try again.

If it locks up without nothing, then hardware if faulty and will have futher 
choice of hanging myself or shooting in the head.

I would like to see your and others' comments on the story and i have one
more question: what does options         _KPOSIX_PRIORITY_SCHEDULING 
do? May it be somehow related to the problem?


The hardware is:

      MB dual xeon Supermicro X5DPE-G2  
      CPU P4 XEON 2,667Ghz 512Kb cache 533mhz socket 604 
      2 Gb 266Mhz, DDR, ECC, Reg, 1GB dimm 
      4 HDDs 120Gb (seagate baracuda 7200.7) 
      3Ware Escalade 8506-4LP 
      Case Supermicro SC822T-550LP  
      Slim DVD/CD-RW Toshiba SD-R2412B IDE (OEM)
     

The todays kernel CONFIG  wich got locked:

machine         i386
cpu             I486_CPU
cpu             I586_CPU
cpu             I686_CPU
ident           OMNI2

options         SMP

options         QUOTA

options         SCHED_4BSD              # 4BSD scheduler
options         INET                    # InterNETworking
options         INET6                   # IPv6 communications protocols
options         FFS                     # Berkeley Fast Filesystem
options         SOFTUPDATES             # Enable FFS soft updates support
options         UFS_ACL                 # Support for access control lists
options         UFS_DIRHASH             # Improve performance on big directories
#options        MD_ROOT                 # MD is a potential root device
#options        NFSCLIENT               # Network Filesystem Client
#options        NFSSERVER               # Network Filesystem Server
#options        NFS_ROOT                # NFS usable as /, requires NFSCLIENT
options         MSDOSFS                 # MSDOS Filesystem
options         CD9660                  # ISO 9660 Filesystem
options         PROCFS                  # Process filesystem (requires PSEUDOFS)
options         PSEUDOFS                # Pseudo-filesystem framework
options         GEOM_GPT                # GUID Partition Tables.
options         COMPAT_43               # Compatible with BSD 4.3 [KEEP THIS!]
options         COMPAT_FREEBSD4         # Compatible with FreeBSD4
#options        SCSI_DELAY=15000        # Delay (in ms) before probing SCSI
options         KTRACE                  # ktrace(1) support
options         SYSVSHM                 # SYSV-style shared memory
options         SYSVMSG                 # SYSV-style message queues
options         SYSVSEM                 # SYSV-style semaphores
options         _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
#options        KBD_INSTALL_CDEV        # install a CDEV entry in /dev

device          apic            # I/O APIC

# Bus support.  Do not remove isa, even if you have no isa slots
device          isa
device          pci

# Floppy drives
device          fdc

# ATA and ATAPI devices
device          ata
device          atadisk         # ATA disk drives
device          ataraid         # ATA RAID drives
device          atapicd         # ATAPI CDROM drives
#device         atapifd         # ATAPI floppy drives
#device         atapist         # ATAPI tape drives
options         ATA_STATIC_ID   # Static device numbering

# SCSI peripherals
device          scbus           # SCSI bus (required for SCSI)
device          da              # Direct Access (disks)
device          pass            # Passthrough device (direct SCSI access)
device          twe             # 3ware ATA RAID

# atkbdc0 controls both the keyboard and the PS/2 mouse
device          atkbdc          # AT keyboard controller
device          atkbd           # AT keyboard
device          psm             # PS/2 mouse

device          vga             # VGA video card driver

device          splash          # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device          sc

device          agp             # support several AGP chipsets

# Floating point support - do not disable.
device          npx

# Power management support (see NOTES for more options)
#device         apm
# Add suspend/resume support for the i8254.
#device         pmtimer

# Serial (COM) ports
device          sio             # 8250, 16[45]50 based serial ports

# Parallel port
device          ppc
device          ppbus           # Parallel port bus (required)
device          lpt             # Printer
device          ppi             # Parallel port interface device
#device         vpo             # Requires scbus and da


device          miibus          # MII bus support
device          fxp             # Intel EtherExpress PRO/100B (82557, 82558)
device          em


device          loop            # Network loopback
device          mem             # Memory and kernel memory devices
device          io              # I/O device
device          random          # Entropy device
device          ether           # Ethernet support
#device         sl              # Kernel SLIP
#device         ppp             # Kernel PPP
device          tun             # Packet tunnel.
device          pty             # Pseudo-ttys (telnet etc)
device          md              # Memory "disks"
#device         gif             # IPv6 and IPv4 tunneling
#device         faith           # IPv6-to-IPv4 relaying (translation)

device          bpf             # Berkeley packet filter
# USB support
device          uhci            # UHCI PCI->USB interface
device          ohci            # OHCI PCI->USB interface
device          usb             # USB Bus (required)
#device         udbp            # USB Double Bulk Pipe devices
device          ugen            # Generic
device          uhid            # "Human Interface Devices"
device          ulpt            # Printer
device          umass           # Disks/Mass storage - Requires scbus and da


# FireWire support
device          firewire        # FireWire bus code
#device         sbp             # SCSI over FireWire (Requires scbus and da)
#device         fwe             # Ethernet over FireWire (non-standard!)

options         IPFIREWALL
options         IPFIREWALL_VERBOSE
options         IPFIREWALL_VERBOSE_LIMIT=10000
options         IPFIREWALL_DEFAULT_TO_ACCEPT

device          snp
device          speaker

options         DDB
options         KDB
options         BREAK_TO_DEBUGGER
options         INVARIANT_SUPPORT
options         INVARIANTS
options         WITNESS
options         WITNESS_KDB
options         WITNESS_SKIPSPIN
#options        ADAPTIVE_GIANT          # Giant mutex is adaptive.


DMESG (the config which got locked):

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 5.3-STABLE #3: Sun Jan 23 01:04:00 MSK 2005
    matrix@omni2.itlegion.ru:/usr/obj/usr/src/sys/OMNI2
WARNING: WITNESS option enabled, expect reduced performance.
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 2.66GHz (2665.93-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf25  Stepping = 5
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,S
SE2,SS,HTT,TM,PBE>
  Hyperthreading: 2 logical CPUs
real memory  = 4160225280 (3967 MB)
avail memory = 4077486080 (3888 MB)
ACPI APIC Table: <PTLTD          APIC  >
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  6
 cpu3 (AP): APIC ID:  7
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard
ioapic2 <Version 2.0> irqs 48-71 on motherboard
ioapic3 <Version 2.0> irqs 72-95 on motherboard
ioapic4 <Version 2.0> irqs 96-119 on motherboard
npx0: [FAST]
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <PTLTD   RSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
cpu0: <ACPI CPU (2 Cx states)> on acpi0
cpu1: <ACPI CPU (2 Cx states)> on acpi0
cpu2: <ACPI CPU (2 Cx states)> on acpi0
cpu3: <ACPI CPU (2 Cx states)> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <unknown> at device 0.1 (no driver attached)
pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pci1: <base peripheral, interrupt controller> at device 28.0 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> at device 29.0 on pci1
pci2: <ACPI PCI bus> on pcib2
pci1: <base peripheral, interrupt controller> at device 30.0 (no driver attached)
pcib3: <ACPI PCI-PCI bridge> at device 31.0 on pci1
pci3: <ACPI PCI bus> on pcib3
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x3000-0x303f mem 0xfc200000-0xfc21ffff irq 28 at device 2
.0 on pci3
em0: Ethernet address: 00:30:48:2a:2d:bc
em0:  Speed:N/A  Duplex:N/A
em1: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x3040-0x307f mem 0xfc220000-0xfc23ffff irq 29 at device 2
.1 on pci3
em1: Ethernet address: 00:30:48:2a:2d:bd
em1:  Speed:N/A  Duplex:N/A
pcib4: <ACPI PCI-PCI bridge> at device 3.0 on pci0
pci4: <ACPI PCI bus> on pcib4
pci4: <base peripheral, interrupt controller> at device 28.0 (no driver attached)
pcib5: <ACPI PCI-PCI bridge> at device 29.0 on pci4
pci5: <ACPI PCI bus> on pcib5
pci4: <base peripheral, interrupt controller> at device 30.0 (no driver attached)
pcib6: <ACPI PCI-PCI bridge> at device 31.0 on pci4
pci6: <ACPI PCI bus> on pcib6
twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0x4000-0x400f mem 0xfc800000-0xfcffffff irq 72 at device 1.
0 on pci6
twe0: [GIANT-LOCKED]
twe0: 4 ports, Firmware FE7S 1.05.00.063, BIOS BE7X 1.08.00.048
uhci0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> port 0x2000-0x201f irq 16 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> port 0x2020-0x203f irq 19 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801CA/CAM (ICH3) USB controller USB-C> port 0x2040-0x205f irq 18 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: <Intel 82801CA/CAM (ICH3) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
pcib7: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci7: <ACPI PCI bus> on pcib7
pci7: <display, VGA> at device 1.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH3 UDMA100 controller> port 0x2060-0x206f,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0
ata0: channel #0 on atapci0
ata2: channel #1 on atapci0
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
acpi_button0: <Power Button> on acpi0
speaker0: <PC speaker> port 0x61 on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
atkbd0: [GIANT-LOCKED]
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0
fdc0: [FAST]
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
orm0: <ISA Option ROMs> at iomem 0xe0000-0xe3fff,0xc9000-0xc9fff,0xc8000-0xc8fff,0xc0000-0xc7fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 10.000 msec
ipfw2 initialized, divert disabled, rule-based forwarding disabled, default to accept, logging limited to 10000 packets/entry
 by default
acd0: CDRW <TOSHIBA DVD-ROM SD-R2412/1015> at ata0-slave UDMA33
twed0: <Unit 0, RAID5, Normal> on twe0
twed0: 343417MB (703318656 sectors)
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
Mounting root from ufs:/dev/twed0s1a
em0: Link is up 100 Mbps Full Duplex




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?00ba01c50211$23389ee0$0c00a8c0>