Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jul 2000 19:02:16 -0400 (EDT)
From:      Bosko Milekic <bmilekic@dsuper.net>
To:        Alfred Perlstein <bright@wintelcom.net>
Cc:        net@FreeBSD.ORG
Subject:   Re: kern/19866: The mbuf subsystem refcount stuff.
Message-ID:  <Pine.BSF.4.21.0007191829400.41534-100000@jehovah.technokratis.com>
In-Reply-To: <20000719092114.S13979@fw.wintelcom.net>

next in thread | previous in thread | raw e-mail | index | archive | help

On Wed, 19 Jul 2000, Alfred Perlstein wrote:

> > > when you want to make a copy your code looks something like this:
> > > 
> > > /* copying m into n */
> > > 
> > > if (m->rp == NULL && n->rp == NULL) {
> > > 	m->rp = n->rp = mclrefcnt_alloc();
>         \--------------------------------/ 
> This is a race condition. --/
> 
> There's a chance that two cpu threads may issue a m_copym on a mbuf
> chain on seperate CPUs, if the mbuf chain's refcounts aren't yet
> allocated and NULL, there's a race where _both_ CPUs will call this
> code.
> 
> The effect will be an incorrect refcount and the leaking of a
> refcount structure.  Pretty fatal.
> 
> the code will have to change to something like this:
> 
>   mclrefcnt_free(n->rp);
>   n->rp = m->rp;
>   atomic_inc(m->rp.mextr_refcnt);
> 
> which is actually simpler. :)

	This seems unnecessary, as far as I can see. If you are "copying m
  into n" the idea is that "m" is already allocated and, if it is an M_EXT
  mbuf, it already has a mextrfcnt structure allocated for it, and is
  pointing to it. How this works is that when you allocate an mbuf and
  assign it external storage, the first mbuf to be assigned this ext_buf
  will be the one that will have to perform the locking and get the
  mextrefcnt allocated and detatched from the list of mextrefcnt's. Let's
  say this "first" mbuf is called `m,' then later on you can copy `m' into
  `n' by giving `n' the same m_ext struct as `m,' and then having
  n->m_ext->mextr_refcnt be incremented (++), which doesn't need any
  locking. You should clearly be able to have mbufs that are not at all
  associated to mextrfcnts for the simple reason that it would be useless
  if they don't have external storage associated to them.
  	If, when you are copying m into n, n already has a reference pointer,
  then you're in big trouble and what you're doing is incorrect.

> The solution I see is to never have it NULL.. See the comments on
> the free function.

	It's safe to have it null when you're not using m_ext (when the mbuf
  doesn't refer to an ext_buf). I don't see why you would check BOTH m and
  n's refcnt pointers when you should take for granted that if you are
  copying something into n, then n surely isn't referencing anything else
  because if it is, you shouldn't be copying anything into it unless you
  free what was previously there first.

> 
> > > } else if (m->rp == NULL) {
> > > 	m->rp = n->rp;
> > > } else if (n->rp == NULL) {
> > > 	n->rp = m->rp;
> > > } else {
> > > 	mclrefcnt_free(n->rp);
> > > 	n->rp = m->rp;
> > > }
> > > atomic_inc(m->rp.mextr_refcnt);
> > > 
> > > 
> > > /* freeing m */
> > > 
> > >      /* x must get the value I reduced it to */
> > > x = atomic_dec_and_fetch(m->rp.mextr_refcnt); 
> > > if (x == 0) {
> > > 	/* do extfree callback */
> > > } else {
> > > 	m->rp = NULL;
> 
> This is incorrect, the code should be:
>     m->rp = mclrefcnt_alloc();
> 
> that will ensure that we don't have mbufs where the m->rp isn't
> initialized to a structure, it also allows us to simplify the copy
> code above.
> 
> the mbufs startup code will have to pre-assign them all refcount
> structures.

	Not all mbufs should have an assigned mextrefcnt, because not all
  mbufs refer to external storage, so it would be a waste of memory.
	When you free an mbuf, you'll also have to be under splimp() most
  likely (unless we find those atomic queue manip. constructs). The reason
  is that you'll be placing the mbuf back on the list, which requires
  blocking against interrupts, and also against other CPUs in the SMP case
  (unless we decide to split the list in the future). So what you do in
  this case should be more like this:
  
  struct mextrefcnt *our_ref = mbuf_to_be_freed->m_ext->mext_refcnt;

  int s = splimp();
  our_ref->counter--;
  if (our_ref->counter == 0) {
  	/* call (mbuf_to_be_freed->m_ext->ext_free)(blah, blah) */
	/* free mextrefcnt structure (our_ref) to the mextrefcnt freelist */
  }
  mbuf_to_be_freed->m_ext->m_ext_refcnt = NULL;
  /* actually, all of the mbuf's m_ext should be zero'd, but the next
  allocation should take care of that anyway, so we can skip it here and do
  it during the allocating of the mbuf. */
  free_the_mbuf(mbuf_to_be_freed);

  Clearly, when you are freeing the mbuf, you have to splimp(), but this is
  normal for now. The idea is though that you _don't_ have to splimp() when
  you just increment the ref count, unless it's the first incrementation.
  This is the reason, as I understand it, that your idea makes sense (as
  opposed to my linked list idea which always needed to be splimp()ed).

  Naturally, MEXTADD() will also have to be modified to not only setup the
  mbuf for the ext_buf type (e.g. setup ext_free, and other m_ext fields)
  but to also splimp() and allocate the mextrefcnt structure and increment
  it by one. This is expected.

> The race condition is outlined above and I think the solution ought
> to work.
> 
> The refcount list certainly does not need to remain global, there's
> nothing wrong with per-cpu pools of them and when your own pool is
> exhausted you can try to lock against another CPU's pool and steal
> structures.

	Well, now that I've thought about it more extensively, this does make
  sense, but only if you consider that just because an mbuf was originally
  allocated from CPU#1's freelist, doesn't mean that it necessarily has to
  be freed to CPU#1's freelist, such that it may end up on any of the CPU's
  freelists eventually; same goes for the mextrefcnt list elements.

> The CPUs will have to lock thier own queues against themselves for
> protection against interrupts, doing so is cheap, when we have a
> shortage we can take the performance hit to lock against another
> CPU in order to steal refcount structures.

	Agreed. But if we do go with this, we'll REALLY be in trouble when it
  comes to freeing physical memory, especially if you consider that each
  CPU's freelists can now contain mbufs from pretty much anywhere. In other
  words, I can now have 16 mbufs making up a page spread out across 4
  different CPU's free lists, and I would have to giant lock in order to
  free them, or even track all of them at that (say I want to move them to
  a less used list, etc.). Poul-Henning, do you see what I mean, given
  discussions we had regarding this?

> And if that code to do queue/dequeue wasn't just a figment of my
> imagination we may not need to lock at all.

	I think Mike blurted out that there are some atomic queue
  manipulation constructs, although I am personally not familiar with them
  (yet). If we do find them and they do apply, we will see a dramatic
  advantage not only in the SMP area, but overall. It almost seems to good
  to be applicable. :-)

> -- 
> -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> "I have the heart of a child; I keep it in a jar on my desk."

 Cheers,
 Bosko.

--
 Bosko Milekic  *  Voice/Mobile: 514.865.7738  *  Pager: 514.921.0237
    bmilekic@technokratis.com  *  http://www.technokratis.com/




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0007191829400.41534-100000>