Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 May 2005 17:13:48 -0600 (MDT)
From:      Matt Ruzicka <matt@frii.com>
To:        freebsd-net@freebsd.org
Subject:   Outbound TCP issue, potentially related to 'FreeBSD-SA-05:08.kmem [REVISED]'
Message-ID:  <Pine.BSF.4.58.0505121627400.66727@elara.frii.com>

next in thread | raw e-mail | index | archive | help
A couple days after we patched our systems, we started to receive a number
of reports of mysql connection errors when our patched FreeBSD 4.9 web
servers were trying to connect to our mysql server, which lives on a
separate FreeBSD machine.

Initially we thought this was a networking error related to our server
load balancer (which has been a troublemaker in the past) or some other
networking device, but testing has proven otherwise.


* Problem description:

Outbound TCP connections are randomly failing to connect.  They receive a
"Can't assign requested address" error from the connect() call.  The error
has been demonstrated against multiple machines on multiple different
ports.  It only impacts outgoing connections from our web servers - no
inbound connections have failed or dropped.  Also, we have not seen this
problem on any of our other servers, which have also been patched.

The errors are sporadic.  The most frequent pattern we've seen is a 5
to 10 minute period of success, followed by a couple of seconds of
frequent failures.  When we start getting errors connecting to one
port/machine we see concurrent errors to other ports/machines.


* What we've tried:

The impacted machines are in a server-load-balanced environment, so we
spent quite a bit of time convincing ourselves that this was not an
external network error.  We created a perl test script that tries to
connect to a given machine and port once per second and logs its
success or failure.  (script is included below)  We then aimed it at
machines both inside and outside the SLB environment.

We originally tried it against multiple different ports, but after
finding that the failures were not port-specific, we simplified the
methodology to make all connections to port 5666.  (a monitoring app)

Reverse tests were also run to see if the failures impacted incoming
connections.  No failures were ever logged in this direction.

The tests established that we reliably saw failures from the two
impacted machines to any other server, including each other. (The two
boxes are separated by a switch, but not the SLB.) It did not matter
if the remote machine was on the same network, or was in front or
behind the SLB switch.  Connections between other machines behind the
same switch showed no failures.

We next set up tcpdump on one impacted machine and started logging the
test connections.  When a failure occurred, the dumps showed no packets
leaving the box to the target machine.

At that point we felt reasonably confident that the problem was not an
external network issue, so we moved on to systems troubleshooting.

Since this machine was running a few revisions behind we felt it would be
prudent to upgrade to the latest release of FreeBSD.

Both web servers have since been upgraded to the latest version of 4.11 to
ensure it was not an issue related to the old versions we were running.
After the upgrade errors returned to the previous levels after a few
hour lull.

Apache, PHP and related modules were both reinstalled on the boxes after
the FreeBSD upgrade to ensure they were using the correct libraries and
such.

The only error we have found in the logs was right after boot and is
related to PMAP_SHPGPERPROC and discussed here:

  http://lists.freebsd.org/pipermail/freebsd-hackers/2003-May/000695.html

If I understand this correctly we should have plenty of PV entries
available.

-----
Message Queues:
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP CBYTES  QNUM QBYTES LSPID LRPID   STIME    RTIME    CTIME

Shared Memory:
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP NATTCH  SEGSZ  CPID  LPID   ATIME    DTIME    CTIME
m 262144          0 --rw-------     root    wheel     root    wheel     21 524288  81250  8125014:03:40 17:02:37 14:03:40
m 458754          0 --rw-------     root    wheel     root    wheel     42 524288  74667  7466716:06:03 17:02:39 16:06:03

Semaphores:
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP NSEMS
OTIME    CTIME

ITEM            SIZE     LIMIT    USED    FREE  REQUESTS
PV ENTRY:         28,  2281326, 545883, 1036172, 589082427
-----

* Test script:

Note that we also tried a similar script using raw socket calls, rather
than using IO::Socket.  The results were identical.

-----
#!/usr/bin/perl

use strict;
use warnings;

use Sys::Hostname qw(hostname);
use IO::Socket;

use constant LOG_FILE	=>	'/tmp/';

# host to connect to
my $host = shift(@ARGV) || 'xxx.xxx.xxx.xxx';

# open our log file
my $log_file = LOG_FILE . hostname() . '_to_' . $host . '.nrpe';
open(LOG, '>>', $log_file) or die "Can't open log: $log_file $!";

while(1){

	my $start_time = time();

	# try a connection
	eval {
		my $socket = IO::Socket::INET->new($host . ':5666') or die
"Can't connect: $!";

		$socket->close();
	};

	my $result = "ok";
	$result = "failed ($@)" if $@;

	print LOG hostname() . ' ' . scalar(localtime($start_time)) . ' ' .  $result . "\n";

	sleep 1;
}
-----


* Summary:

Since this is not affecting any of our other servers, which have been
patched, I do not feel it is a direct result of the patch, but suspect the
patch may have accentuated an existing issue.

Any suggestions as to what could be causing this would be greatly
appreciated.

Please let me know what additional information about the system I can
gather if it will be of assistance.

Thank you very much in advance.


Matthew Ruzicka - Systems Administrator
Front Range Internet, Inc.
matt@frii.net - (970) 212-0728






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.58.0505121627400.66727>