Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 10 Jul 2004 15:25:48 -0700 (PDT)
From:      Don Lewis <truckman@FreeBSD.org>
To:        dl@leo.org
Cc:        current@FreeBSD.org
Subject:   Re: panic: m_copym, length > size of mbuf chain
Message-ID:  <200407102225.i6AMPmhw015583@gw.catspoiler.org>
In-Reply-To: <20040710105017.GA61243@atrbg11.informatik.tu-muenchen.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 10 Jul, Daniel Lang wrote:
> Hi Robert,
> 
> Robert Watson wrote on Wed, Jul 07, 2004 at 12:24:59PM -0400:
> [..]
>> Just to try ruling out possibilities -- have you run an extensive set of
>> hardware diagnostics?  Most server class hardware ships with a decent
>> diagnostics disk, and I'm sure we can find some for you in the event your
>> hardware didn't come with some.  While it's quite possibly a software
>> problem, tracking hardware problems using software symptoms constitutes
>> undesirable pain and so it wouldn't hurt to give that a spin.  I remember
>> seing your earlier e-mails about running with WITNESS increasing the
>> chances of pain -- this could be a bug in WITNESS as you suggest, or it
>> could be that WITNESS increases the opportunities for a variety of locking
>> related races by increasing the cost of lock/unlock operations.
> [..]
> 
> So I come back to the issue. As I already wrote, I guess I can
> rule out hardware problems now. I did a very thorough test with
> the Dell diagnosis utilities which showed no problems.
> 
> Also, after John's patch I did not see any WITNESS related
> problems (so far) again. But I had the m_copy panic again
> (see subject). This time I did file a PR and did some more detailed
> gdb analysis. It is all documented at:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889
> 
> I am puzzled, because the stack frame on entering m_copym has
> 0x0 as first argument (m), however in the previous frame
> when m_copy() is called, the struct mbuf* argument is valid.

m_copym() overwrites its first and third arguments as it walks the mbuf
chain.

struct mbuf *
m_copym(struct mbuf *m, int off0, int len, int wait)
{
[snip]
	while (off > 0) {
		KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain"));
		if (off < m->m_len)
			break;
		off -= m->m_len;
		m = m->m_next;
	}
[snip]
	while (len > 0) {
		if (m == NULL) {
			KASSERT(len == M_COPYALL, 
			    ("m_copym, length > size of mbuf chain"));
			break;
		}
[snip]
		if (len != M_COPYALL)
			len -= n->m_len;
		off = 0;
		m = m->m_next;
		np = &n->m_next;
	}


The interesting bits would seem to be in stack frame 11, tcp_output().
Check the arguments being passed to m_copym():

#10 0xc0551805 in m_copym (m=0x0, off0=737, len=1222, wait=1)
    at /usr/src/sys/kern/uipc_mbuf.c:380

We don't know the original value of len that was passed to m_copym(),
because it could have been decremented if m_copym() iterated a few times
before it paniced, but it was at least 1222.  If we add that to off0,
then the length of original mbuf chain passed to m_copym() should have
been at least 1959.

Now take look at the call to m_copy():

#11 0xc059ed5a in tcp_output (tp=0xc3f50000)
    at /usr/src/sys/netinet/tcp_output.c:748
748                             m->m_next = m_copy(so->so_snd.sb_mb, off, (int) len);

It would be interesting to see the value of len in stack frame 11, so
that we know the original value passed to m_copym().

Also the contents of *so is interesting.

(kgdb) p *so
[snip]
    sb_cc = 975, sb_hiwat = 33580, sb_mbcnt = 1536, sb_mbmax = 262144,

I'm not sure if sb_cc or sb_mbcnt is the important member, but I think
it is sb_cc.  I think this means that the mbuf chain contains 975 bytes
of data but tcp_output() is telling m_copy() to copy (at least) 1222
bytes of data starting at offset 737.

It looks to me like tcp_output() is passing a bogus len value to
m_copy().




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200407102225.i6AMPmhw015583>