From owner-freebsd-hardware@FreeBSD.ORG Mon Sep 3 04:14:14 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 70082106566B for ; Mon, 3 Sep 2012 04:14:14 +0000 (UTC) (envelope-from ayoung@mosaicarchive.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 2635D8FC0A for ; Mon, 3 Sep 2012 04:14:13 +0000 (UTC) Received: by obbun3 with SMTP id un3so11138492obb.13 for ; Sun, 02 Sep 2012 21:14:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=CljdvpjTIQrYxzMdVn71KGg/wzVmTewsQBIxYe7YiUQ=; b=Y3hBfFA9tQnM7Sf/r2AJWwL8hddBrSb+No2nbCd9DQZr68Jh3SXHQj+MuEqE3jbfCw nGwxA1dPbhCR+Uu2qmH6VlZhGjI4S3hmZQ71gfgcqxV2FlH1J2ypVu0s8lI7v0PcqoKV 6oo+fTavkg2NW9vlQB4qIP0fQg7tZPCwDIVMQsnkzuT0yTVRl5sM3kilfIXJp39wbtmQ xEoq1CUN61v/RumB/zq0UuObz1Fmg5LneyrTctGWB5Ay2J3Pl1+K+gMvVzcAWINijCMm p2/QxDIjQFjBZpXnrJzmpcFML1tlI20yjp7NBNY5OmMb2ih6IVHBLoj54+IBqmImS33M baSg== MIME-Version: 1.0 Received: by 10.182.73.65 with SMTP id j1mr13275615obv.42.1346645653410; Sun, 02 Sep 2012 21:14:13 -0700 (PDT) Received: by 10.76.174.38 with HTTP; Sun, 2 Sep 2012 21:14:13 -0700 (PDT) X-Originating-IP: [96.237.242.243] In-Reply-To: References: Date: Mon, 3 Sep 2012 00:14:13 -0400 Message-ID: From: Andy Young To: Ragnar Lonn X-Gm-Message-State: ALoCoQngxsnLS2msGmQ67OjvqOnU0+SrNl3MZejKbRQJpbVZljpVH915rbYcO7D9we53EYM7e3cZ Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hardware@freebsd.org Subject: Re: Load testing knocks out network X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Sep 2012 04:14:14 -0000 Hi Ragnar, Thank you for the reply. That makes a lot of sense. I think the resources at risk had to do with the low level details of the network card. I experimented tonight with bumping the hw.igb.rxd and hw.igb.txd tunable parameters of the NIC driver to their max value of 4096 (the default was 256). This seems to have resolved the issue. Before bumping their values, my load test was crashing the network at about 350 simultaneous connections. The behavior I witnessed was the application server (jetty) would seize up first and I could see it was no longer responding through my ssh connections. If I killed it off right away, all of the connections got closed and everything went back to being fine. If I left it in that state for 30 seconds or so, the system became unrecoverable. The connections remained open according to netstat even after I had closed the server and client processes. Short of rebooting, nothing I did would close the connections down. After bumping the rxd and txd parameters as well as kern.ipc.nmbclusters, the problem seems to have gone away. I can now successfully simulate over 800 simultaneous connections and it hasn't crashed since. To be honest, I don't know what the rxd and txd parameters do but it seems to have helped. Andy On Sep 2, 2012 1:58 AM, "Ragnar Lonn" wrote: > Hi Andy, > > I work for an online load testing service (loadimpact.com) and what we > see is that the most common cause when a server crashes during a load test, > is that it runs out of some vital system resource. Usually system memory, > but network connections (sockets/file descriptors) is also a likely cause. > > You should have gotten some kind of error messages in the system log, but > if the problem is easily repeatable I would set up monitoring of at least > memory and file descriptors, and see if you are near the limits when the > machine freezes. > > Regards, > > /Ragnar On Sat, Sep 1, 2012 at 10:44 PM, Andy Young wrote: > I read through the driver man page, which is a great source of > information. I see I'm using the Intel igb driver and it supports three > tunables. Could I have exceeded the number of receive descriptors? What > would the effect of this number being too low be? What about the Adaptive > Interrupt Moderation? > > To clarify, I was simulating about 800 users simultaneously uploading > files when the crash occurred. > > Thanks for any help or insights!! > > Andy > > NAME > igb -- Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver > > LOADER TUNABLES > Tunables can be set at the loader(8) prompt before booting the kernel > or > stored in loader.conf(5). > > hw.igb.rxd > Number of receive descriptors allocated by the driver. The > default value is 256. The minimum is 80, and the maximum is > 4096. > > hw.igb.txd > Number of transmit descriptors allocated by the driver. The > default value is 256. The minimum is 80, and the maximum is > 4096. > > hw.igb.enable_aim > If set to 1, enable Adaptive Interrupt Moderation. The > default > is to enable Adaptive Interrupt Moderation. > > > On Sat, Sep 1, 2012 at 4:14 PM, Andy Young wrote: > >> Last night one our servers went offline while I was load testing it. When >> I got to the datacenter to check on it, the server seemed perfectly fine. >> Everything was running on it, there were no panics or any other sign of a >> hard crash. The only problem is the network was unreachable. I couldn't >> connect to the box even from a laptop directly attached to the ethernet >> port. I couldn't connect to anything from the box either. It was if the >> network controller had seized up. I restarted netif and it didn't make a >> difference. Rebooting the machine however, solved the issue and everything >> went back to working great. I restarted the load testing and reproduced the >> problem twice more this morning so at least its repeatable. It feels like a >> network controller / driver issue to me for a couple reasons. First, the >> problem affects the entire system. We're running FreeBSD 9 with about a >> half dozen jails. Most of the jails are running Apache but the one I was >> load testing was running Jetty. However, if it was my application code >> crashing I would expect the problem to at least be isolated to the jail >> that hosts it. Instead, the entire machine and all jails in it lose access >> to the network. >> >> Apart from not being able to access the network, I don't see any other >> signs of problems. This is the first major problem I've had to debug in >> FreeBSD so I'm not a debugging expert by any means. There are no error >> messages in /var/log/messages or dmesg apart from syslogd not being able to >> reach the network. If anyone has ideas on where I can look for more >> evidence of what is going wrong, I would really appreciate it. >> >> We're running FreeBSD 9.0-RELEASE-p3. The network controller is a >> Intel(R) PRO/1000 Network Connection version - 2.2.5 configured with 6 ips >> using aliases, five of which are used for jails. >> >> Thank you for the help!! >> >> Andy >> >> >> > > > -- > Andrew Young > Mosaic Storage Systems, Inc > http://www.mosaicarchive.com/ > > Follow us on: > Twitter , Facebook > , Google Plus > , Pinterest > > -- Andrew Young Mosaic Storage Systems, Inc http://www.mosaicarchive.com/ Follow us on: Twitter , Facebook , Google Plus , Pinterest