Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Jan 2014 10:20:40 -0500
From:      Curtis Villamizar <curtis@ipv6.occnc.com>
To:        pyunyh@gmail.com
Cc:        freebsd-stable@freebsd.org, Curtis Villamizar <curtis@ipv6.occnc.com>
Subject:   Re: regression: msk0 watchdog timeout and interrupt storm
Message-ID:  <201401061520.s06FKeVG009399@maildrop2.v6ds.occnc.com>
In-Reply-To: Your message of "Mon, 06 Jan 2014 14:04:00 %2B0900." <20140106050400.GA1372@michelle.cdnetworks.com>

next in thread | previous in thread | raw e-mail | index | archive | help

In message <20140106050400.GA1372@michelle.cdnetworks.com>
Yonghyeon PYUN writes:
 
> On Sun, Jan 05, 2014 at 11:30:45PM -0500, Curtis Villamizar wrote:
> > 
> > Pyun,
> > 
> > Replying to self since I did not get your reply but saw it on the
> > stable10 mailing list archive.  I pasted in your responses so its
> > really a reply to you.
> > 
> > Sorry for the delay to your email on Jan 2.  I had some email trouble
> > (self induced by DNS change) that should be fixed now.
> > 
>  
> Ok.
>  
> [...]
>  
> > >  
> > > Marvell calls DMA descriptors as LEs. The maximum number of status
> > > LEs supported by controller is 4096 and it should be large enough
> > > to hold status LE update(for dual-port controllers, the status
> > > DMA block is shared between each port).
> > 
> > Yes.  I am aware of this, but regardless I ran into this bug and
> > forcing MSK_TX_RING_CNT and MSK_RX_RING_CNT removed the symptom.
> > 
>  
> Ok.
>  
> > > > This does seem to me like a regression in 10.0 caused by the change to
> > > > if_mskreg.h (Nov 16).  The workaround so far has been fine for me.
> > >  
> > > If you revert the change made in r258790, does the issue go away?
> > > Are you running amd64?  Because you touched #if (BUS_SPACE_MAXADDR
> > > > 0xFFFFFFFF) block in if_mskreg.h I guess you're running amd64 but
> > > I need confirmation. If your system have more than 4GB memory on
> > > amd64, could you reduce amount of available memory to be less than
> > > 4GB?(i.e. set hw.physmem in loader.conf)
> > > Also would you show me dmesg(8) output(msk(4) and e1000phy(4) only)
> > > to know exact Yukon controller model?
> > 
> > Yes it is AMD64.
> > 
> > uname -m
> > amd64
> > 
> > CPU: AMD Athlon(tm) II X2 B24 Processor (2992.58-MHz K8-class CPU)
> >  Origin = "AuthenticAMD" Id = 0x100f63 Family = 0x10 Model = 0x6 Stepping = 3
> >  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
> >  Features2=0x802009<SSE3,MON,CX16,POPCNT>
> >  AMD
> >  Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
> >  AMD
> >  Features2=0x37ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT>
> >  TSC: P-state invariant
> > 
> > pciconf -lcv
> > [...]
> > mskc0@pci0:2:0:0:       class=0x020000 card=0x305817aa chip=0x438011ab
> >   rev=0x10 hdr=0x00
> >     vendor     = 'Marvell Technology Group Ltd.'
> >     device     = '88E8057 PCI-E Gigabit Ethernet Controller'
> >     class      = network
> >     subclass   = ethernet
> >     cap 01[48] = powerspec 3  supports D0 D1 D2 D3  current D0
> >     cap 05[5c] = MSI supports 1 message, 64 bit enabled with 1 message
> >     cap 10[c0] = PCI-Express 2 legacy endpoint max data 128(128) link x1(x1)
> >                  speed 2.5(2.5) ASPM disabled(L0s/L1)
> >     ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
> >     ecap 0003[130] = Serial 1 ef3856ffffdc9cc8
> > 
>  
> dmesg(8) output will show more useful information than pciconf(8)
> in this case.  There are too many Yukon II variants.

Here are some relevant parts of dmesg.  Is there anything else you want?

real memory  = 2147483648 (2048 MB)
avail memory = 2061438976 (1965 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <LENOVO TC-9I   >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1

pcib2: <ACPI PCI-PCI bridge> irq 19 at device 7.0 on pci0
pci2: <ACPI PCI bus> on pcib2
 on pci1
pcib2: <ACPI PCI-PCI bridge> irq 19 at device 7.0 on pci0
pci2: <ACPI PCI bus> on pcib2
mskc0: <Marvell Yukon 88E8057 Gigabit Ethernet> port 0xe800-0xe8ff mem
0xfebfc000-0xfebfffff irq 19 at device 0.0 on pci2
msk0: <Marvell Technology Group Ltd. Yukon Ultra 2 Id 0xba Rev 0x00>
on mskc0
msk0: Ethernet address: c8:9c:dc:56:38:ef
miibus0: <MII bus> on msk0
e1000phy0: <Marvell 88E1149 Gigabit PHY> PHY 0 on miibus0
e1000phy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master,
auto, auto-flow

The computer is a Lenovo ThinkCenter (small tower) and not an uncommon
machine so others are likely to run into this.

> > Please let me know what I could do to help debug this.
> > 
>  
> If you have more than 4GB memory, try reducing the amount of
> memory(e.g. 3G) in /boot/loader.conf and let me know whether that
> makes any difference for you.
> Note, in order to test this you have to back out your local
> changes.

Only have 2 GB memory.

> > > > involved.
> > I did not back out the change entirely (yet).  I only effectively
> > backed out the change to the two constants MSK_TX_RING_CNT and
> > MSK_RX_RING_CNT and that was enough to make the problem go away.
> > 
>  
> I'm under the impression that the controller may have additional
> DMA addressing limitation where TX/RX and status LEs should have
> the same high DMA address.  Due to the lack of documentation I'm
> not sure about that.  If the issue does not happen with 3GB memory,
> we have to use 32bit DMA addressing.

We have 2 GB memory so the problem with the original code does happen
with less than 4 GB memory.  Everything has the same high address of
zero.

Is there anything else you want me to try?

Curtis

btw - I added someone from Marvell on the Bcc in case he wants to join
in on the conversation or give us a hint in private email.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201401061520.s06FKeVG009399>