Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Feb 2006 21:03:26 -0500 (EST)
From:      "Brian A. Seklecki" <lavalamp@spiritual-machines.org>
To:        freebsd-questions@freebsd.org
Subject:   Re: ng_one2many v.s. AFT (NIC Fault Tolerance/Fail Over/Redundancy Revisited)
Message-ID:  <20060215210248.P47621@arbitor.digitalfreaks.org>

next in thread | raw e-mail | index | archive | help

FYI, to bring this thread back to the list

---------- Forwarded message ----------
Date: Wed, 15 Feb 2006 20:53:59 -0500 (EST)
From: Brian A. Seklecki <lavalamp@spiritual-machines.org>
To: Jonathan Donaldson <donaldson@cisco.com>, glebius@freebsd.org,
     glebius@cell.sick.ru
Cc: jks@clickcom.com, Brian J. Creasy <bcreasy@collaborativefusion.com>,
     Chad Ziccardi <cz@digitalfreaks.org>, Danny Howard <dannyman@toldme.com>,
     Brad Bendy <brad@shockwebhost.com>
Subject: Re: ng_one2many v.s. AFT (NIC Fault Tolerance/Fail Over/Redundancy
     Revisited) (fwd)

On Wed, 15 Feb 2006, Jonathan Donaldson wrote:

> Take a look here:
> 
> http://www.freebsd.org/cgi/getmsg.cgi?fetch=607312+0+/usr/local/www/db/text/2004/cvs-all/20041128.cvs-all
>

Yea, I see it now.  Sorry.  I'm CC'ing the developer who commited the changes, 
and the the MFC.

The man page needs to be updated, and it should mention your caveat.

I got caught by your caveat with the one-link-down-at-boot.

However, the code begins to work after bringing up the down link, as if it 
would if they were both active at boot, which is good.

Where I got tripped up was that I thought that quote: "The node listens to flow 
control message from many hooks, and considers link failed if NGM_LINK_IS_DOWN 
is received.",

Where "Flow Control Messages" I interrpted that as something on the wire like a 
STP/802.1q BPDU.

Apparently, it's really an In-Kernel event related to the new ethernet 
link-state code in 6.x, or maybe just glorrified poll()'ing.

Either way, it works well.  Sorry for jumping the gun.

~lava

P.S., in 7.0-CURRENT, there appears to be an import of the OpenBSD bridge(4) to 
relate the old-school "options BRIDGE" code.  This one being 802.1q STP aware. 
When 7.x becomes release production, I suspect I'll end up using that instead 
since it works so well with NetBSD/OpenBSD for HA ethernet, plus I'd rather 
have a PVST+ Cisco switch make the packet forwarding the decisions >:}

~lava


> and then look here:
> 
> http://fxr.watson.org/fxr/source/netgraph/ng_one2many.h?v=RELENG6
> 
> 
> 65 /* Algorithms for detecting link failure (XXX only one so far) */
> 66 #define NG_ONE2MANY_FAIL_MANUAL         1       /* use enabledLinks[] 
> array */
> 67 #define NG_ONE2MANY_FAIL_NOTIFY         2       /* listen to flow control 
> msgs */
> 
> 
> so set your fail alg to 2 and see if you see the messages and failover...
> 
> 
> 
> On Feb 15, 2006, at 8:11 PM, Brian A. Seklecki wrote:
> 
>> On Thu, 12 Jan 2006, Brian J. Creasy wrote:
>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>> Brian A. Seklecki wrote:
>>> |
>>> | Johnathan's comments suggest that we may need to move to 6.x on the
>>> | production cluster.
>>> |
>>> | 6.x has been upgraded from a technology release to stable, and our goal
>>> | is stability.
>>> |
>>> | Brian:  What are you thoughts so far on the 6.x experience?
>>> 
>>> no complaints here.. though, i have it running only on my laptop and
>> 
>> ....Okay.
>>
>>  | <jonathan> As of Freebsd 6_0 (which is at RC1 now), the NG_ONE2MANY does
>>  | support the failure of a link which does not end up with 50% packet
>>  | loss. There is new code in the One2Many module that xmits a layer 2 "I'm
>>  | alive" broadcast out all links, as long as this is picked up on the
>>  | other links, then all interfaces are considered alive. If one of the
>>  | packets is not received, then after 2 x heartbeat duration that link is
>>  | considered "down". I have tested this in the 6.0 code and it works with
>>  | one caveat. When the server is brought up, both interfaces must be
>>  | connected and live, or for some reason, the failure algorithm never
>>  | seems to kick in. I saw exactly what you saw in 5.4 and newer with
>>  | regards to the 50% packet loss.</jonathan>
>> 
>> Jonathan:
>> 
>> I'm not sure where you got the info about this.  Accoring to the
>> NG_ONE2MANY(4) page in CVS -rHEAD (-CURRENT):
>> 
>> "Currently, the valid settings for the xmitAlg field are
>> NG_ONE2MANY_XMIT_ROUNDROBIN (default) or NG_ONE2MANY_XMIT_ALL.  The only
>> valid setting for failAlg is NG_ONE2MANY_FAIL_MANUAL; this is also the
>> default setting."
>> 
>> I have 6.1-BETA1 on a box right now and I've got my config setup for
>> NG_ONE2MANY_XMIT_ROUNDROBIN + NG_ONE2MANY_FAIL_NOTIFY and I don't see any
>> layer2 heartbeat related traffic (watching via tcpdump(8) on another
>> machine in the same segment)
>> 
>> Can you share what you saw?
>> 
>> ~lava
>> 
>>> |> mission critical environment).
>>> |> - Xmit-All causes twice as much load on to be placed on the switch
>>> |> /fabric and switch CPU.
>>> |>
>>> |
>>> | <jonathan> As of Freebsd 6_0 (which is at RC1 now), the NG_ONE2MANY does
>>> | support the failure of a link which does not end up with 50% packet
>>> | loss. There is new code in the One2Many module that xmits a layer 2 "I'm
>>> | alive" broadcast out all links, as long as this is picked up on the
>>> | other links, then all interfaces are considered alive. If one of the
>>> | packets is not received, then after 2 x heartbeat duration that link is
>>> | considered "down". I have tested this in the 6.0 code and it works with
>>> | one caveat. When the server is brought up, both interfaces must be
>>> | connected and live, or for some reason, the failure algorithm never
>>> | seems to kick in. I saw exactly what you saw in 5.4 and newer with
>>> | regards to the 50% packet loss.</jonathan>
>>> |
>>> |
>>> |> What ng_one2many needs is a "Active-Standy" XMIT algorithm (STP BOFH's
>>> |> will think BLOCKING/FORWARDING).  It could even be used on top of
>>> |> other NetGraph nodes like ng_fec or possibly (hopefully) ng_802.3ad >:}
>>> |>
>>> |
>>> 
>>> - --
>>> Brian J. Creasy
>>> Collaborative Fusion, Inc.
>>> 412.422.3463 x4020       bcreasy@collaborativefusion.com
>>> 
>>> pgp public key:
>>> ~  http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x5F94E004
>>> 
>>> ****************************************************************
>>> IMPORTANT: This message contains confidential information
>>> and is intended only for the individual named. If the reader of
>>> this message is not an intended recipient (or the individual
>>> responsible for the delivery of this message to an intended
>>> recipient), please be advised that any re-use, dissemination,
>>> distribution or copying of this message is prohibited. Please
>>> notify the sender immediately by e-mail if you have received
>>> this e-mail by mistake and delete this e-mail from your system.
>>> E-mail transmission cannot be guaranteed to be secure or
>>> error-free as information could be intercepted, corrupted, lost,
>>> destroyed, arrive late or incomplete, or contain viruses. The
>>> sender therefore does not accept liability for any errors or
>>> omissions in the contents of this message, which arise as a
>>> result of e-mail transmission.
>>> ****************************************************************
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2 (FreeBSD)
>>> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>>> 
>>> iD8DBQFDxmXvDgwDm1+U4AQRAr3GAJ42+HcJFO595aZvljztWCkd+NWgvACeMQiu
>>> ILXLchBGR90TZTZHjn6DVCY=
>>> =68DY
>>> -----END PGP SIGNATURE-----
>>> 
>> 
>> l8*
>>        -lava
>> 
>> x.25 - minix - bitnet - plan9 - 110 bps - ASR 33 - base8
>> 
> 
> Thanks,
> Jonathan
> -------------------------------------------------------------
> Jonathan Donaldson
> Technical Lead
> 
> Cisco Systems - CV2BU
> 4690 E. Fulton St C-210
> Ada, MI    49301
> 
> Office:    +1-972-813-5251
> Cell:       +1-616-301-4277
> eMail:    donaldson@cisco.com
> 
>

l8*
 	-lava

x.25 - minix - bitnet - plan9 - 110 bps - ASR 33 - base8



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060215210248.P47621>