From nobody@FreeBSD.org Mon Aug 27 17:52:12 2007 Return-Path: Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CE79F16A475 for ; Mon, 27 Aug 2007 17:52:12 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21]) by mx1.freebsd.org (Postfix) with ESMTP id A551113C442 for ; Mon, 27 Aug 2007 17:52:12 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (localhost [127.0.0.1]) by www.freebsd.org (8.14.1/8.14.1) with ESMTP id l7RHqCfQ072413 for ; Mon, 27 Aug 2007 17:52:12 GMT (envelope-from nobody@www.freebsd.org) Received: (from nobody@localhost) by www.freebsd.org (8.14.1/8.14.1/Submit) id l7RHqCix072412; Mon, 27 Aug 2007 17:52:12 GMT (envelope-from nobody) Message-Id: <200708271752.l7RHqCix072412@www.freebsd.org> Date: Mon, 27 Aug 2007 17:52:12 GMT From: Hugo To: freebsd-gnats-submit@FreeBSD.org Subject: msk driver always fails under moderate network load. X-Send-Pr-Version: www-3.1 X-GNATS-Notify: >Number: 115882 >Category: kern >Synopsis: [msk] msk driver always fails under moderate network load. >Confidential: no >Severity: serious >Priority: high >Responsible: yongari >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Aug 27 18:00:03 GMT 2007 >Closed-Date: Tue Mar 18 01:48:03 UTC 2008 >Last-Modified: Fri May 23 06:30:00 UTC 2008 >Originator: Hugo >Release: 7.0-CURRENT >Organization: >Environment: FreeBSD nexus.bsdlan.org 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Sun Aug 26 15:56:22 WEST 2007 klr@nexus.bsdlan.org:/usr/obj/usr/src/sys/NEXUS i386 >Description: pciconf -lv: mskc0@pci2:0:0: class=0x020000 card=0x81421043 chip=0x436211ab rev=0x19 hdr=0x00 vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' device = 'Yukon 88E8053 PCI-E Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet /boot/loader.conf: hw.pci.enable_msix=0 hw.pci.enable_msi=0 /var/log/messages: Aug 27 18:37:48 nexus kernel: msk0: watchdog timeout (missed Tx interrupts) -- recovering Aug 27 18:38:08 nexus last message repeated 2 times Aug 27 18:38:41 nexus kernel: msk0: watchdog timeout (missed Tx interrupts) -- recovering Aug 27 18:39:10 nexus kernel: msk0: watchdog timeout Aug 27 18:39:10 nexus kernel: msk0: link state changed to DOWN Aug 27 18:39:12 nexus kernel: msk0: link state changed to UP Aug 27 18:39:28 nexus kernel: msk0: watchdog timeout (missed Tx interrupts) -- recovering Aug 27 18:40:08 nexus kernel: msk0: watchdog timeout (missed Tx interrupts) -- recovering Aug 27 18:40:49 nexus last message repeated 3 times Aug 27 18:40:54 nexus kernel: msk0: watchdog timeout Aug 27 18:40:54 nexus kernel: msk0: link state changed to DOWN Aug 27 18:40:56 nexus kernel: msk0: link state changed to UP Aug 27 18:41:31 nexus kernel: msk0: watchdog timeout (missed Tx interrupts) -- recovering During normal usage (browsing, email, instant messaging) the NIC will work fine. However, very rarely during online gaming and *always* during torrent downloads, the interface will go down with the above messages. It is impossible to bring it back without a reboot. The settings described above in loader.conf seem to delay the start of the symptoms, but the problem will always manifest itself. This did *not* happen with 6.1-RELEASE and the msk driver on Marvell's website. >How-To-Repeat: Launch ktorrent and let it download for some time (usually less than 30 minutes, less than 10 if hw.pci.enable_msix and hw.pci.enable_msi are still enabled) >Fix: >Release-Note: >Audit-Trail: Responsible-Changed-From-To: freebsd-bugs->yongari Responsible-Changed-By: yongari Responsible-Changed-When: Thu Aug 30 04:24:27 UTC 2007 Responsible-Changed-Why: Grab. http://www.freebsd.org/cgi/query-pr.cgi?pr=115882 State-Changed-From-To: open->feedback State-Changed-By: yongari State-Changed-When: Thu Aug 30 04:27:07 UTC 2007 State-Changed-Why: Would you show me the following information to investigate the issue? - verbosed boot messages related with msk(4) - vmstat -i - ifconfig msk0 http://www.freebsd.org/cgi/query-pr.cgi?pr=115882 State-Changed-From-To: feedback->closed State-Changed-By: rwatson State-Changed-When: Sun Jan 27 13:02:40 UTC 2008 State-Changed-Why: Close due to feedback timeout (>2 months). If you have further information to help debug this problem, please follow up on the PR by e-mail and we can re-open it. Thanks for the report. http://www.freebsd.org/cgi/query-pr.cgi?pr=115882 From: David Schultz To: Hugo , yongari@FreeBSD.ORG Cc: freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Mon, 18 Feb 2008 19:21:05 -0500 FWIW, I'm seeing this issue, too. It generally happens under heavy network load (for a single-user machine), e.g., transferring files over the LAN via scp or downloading an ISO image from a website. I haven't noticed it with NFS, at least not yet. No amount of fiddling in ifconfig fixes things; I'm forced to reboot. Some info: Boot messages: found-> vendor=0x11ab, dev=0x4362, revid=0x22 domain=0, bus=4, slot=0, func=0 class=02-00-00, hdrtype=0x00, mfdev=0 cmdreg=0x0007, statreg=0x0010, cachelnsz=1 (dwords) lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) intpin=a, irq=5 powerspec 2 supports D0 D1 D2 D3 current D0 MSI supports 2 messages, 64 bit map[10]: type Memory, range 64, base 0xfddfc000, size 14, enabled pcib4: requested memory range 0xfddfc000-0xfddfffff: good map[18]: type I/O Port, range 32, base 0xbe00, size 8, enabled pcib4: requested I/O range 0xbe00-0xbeff: in range pcib4: matched entry for 4.0.INTA pcib4: slot 0 INTA hardwired to IRQ 17 mskc0: port 0xbe00-0xbeff mem 0xfddfc000-0xfddfffff irq 17 at device 0.0 on pci4 mskc0: Reserved 0x4000 bytes for rid 0x10 type 3 at 0xfddfc000 mskc0: MSI count : 2 mskc0: attempting to allocate 2 MSI vectors (2 supported) msi: routing MSI IRQ 256 to vector 54 msi: routing MSI IRQ 257 to vector 55 mskc0: using IRQs 256-257 for MSI mskc0: RAM buffer size : 48KB mskc0: Port 0 : Rx Queue 32KB(0x00000000:0x00007fff) mskc0: Port 0 : Tx Queue 16KB(0x00008000:0x0000bfff) msk0: on mskc0 msk0: bpf attached msk0: Ethernet address: 00:01:29:a3:3c:a3 miibus0: on msk0 e1000phy0: PHY 0 on miibus0 e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto mskc0: [MPSAFE] mskc0: [FILTER] [...] ioapic0: Assigning ISA IRQ 1 to local APIC 0 ioapic0: Assigning ISA IRQ 4 to local APIC 1 ioapic0: Assigning ISA IRQ 9 to local APIC 0 ioapic0: Assigning ISA IRQ 12 to local APIC 1 ioapic0: Assigning PCI IRQ 16 to local APIC 0 ioapic0: Assigning PCI IRQ 18 to local APIC 1 ioapic0: Assigning PCI IRQ 19 to local APIC 0 ioapic0: Assigning PCI IRQ 21 to local APIC 1 ioapic0: Assigning PCI IRQ 22 to local APIC 0 ioapic0: Assigning PCI IRQ 23 to local APIC 1 msi: Assigning MSI IRQ 256 to local APIC 0 pciconf: mskc0@pci0:4:0:0: class=0x020000 card=0x110215bd chip=0x436211ab rev=0x22 hdr=0x00 vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' device = '88E8053 Marvell Yukon 88E8053 PCI-E Gigabit Ethernet Controller' class = network subclass = ethernet vmstat -i: interrupt total rate irq1: atkbd0 414 0 irq12: psm0 705928 7 irq16: uhci0+ 595697 5 irq18: ehci0 uhci5 1 0 irq19: uhci2 uhci* 7274884 72 irq21: uhci1 91394 0 irq22: pcm0 1300595 13 irq23: uhci3 ehci1 287 0 cpu0: timer 199843890 1998 irq256: mskc0 1707899 17 cpu1: timer 199262662 1993 Total 410783651 4108 (The interrupt count stops going up when the card stops working.) ifconfig msk0: msk0: flags=8843 metric 0 mtu 1500 options=19a ether [...] inet [...] media: Ethernet autoselect (100baseTX ) status: active From: Pyun YongHyeon To: David Schultz Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Tue, 19 Feb 2008 09:55:07 +0900 On Mon, Feb 18, 2008 at 07:21:05PM -0500, David Schultz wrote: > FWIW, I'm seeing this issue, too. It generally happens under heavy > network load (for a single-user machine), e.g., transferring files > over the LAN via scp or downloading an ISO image from a website. I think scping or downloading ISO images are not heavy network loads. > I haven't noticed it with NFS, at least not yet. No amount of fiddling > in ifconfig fixes things; I'm forced to reboot. > > Some info: Missing FreeBSD version? > > Boot messages: > > found-> vendor=0x11ab, dev=0x4362, revid=0x22 > domain=0, bus=4, slot=0, func=0 > class=02-00-00, hdrtype=0x00, mfdev=0 > cmdreg=0x0007, statreg=0x0010, cachelnsz=1 (dwords) > lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) > intpin=a, irq=5 > powerspec 2 supports D0 D1 D2 D3 current D0 > MSI supports 2 messages, 64 bit > map[10]: type Memory, range 64, base 0xfddfc000, size 14, enabled > pcib4: requested memory range 0xfddfc000-0xfddfffff: good > map[18]: type I/O Port, range 32, base 0xbe00, size 8, enabled > pcib4: requested I/O range 0xbe00-0xbeff: in range > pcib4: matched entry for 4.0.INTA > pcib4: slot 0 INTA hardwired to IRQ 17 > mskc0: port 0xbe00-0xbeff mem 0xfddfc000-0xfddfffff irq 17 at device 0.0 on pci4 > mskc0: Reserved 0x4000 bytes for rid 0x10 type 3 at 0xfddfc000 > mskc0: MSI count : 2 > mskc0: attempting to allocate 2 MSI vectors (2 supported) > msi: routing MSI IRQ 256 to vector 54 > msi: routing MSI IRQ 257 to vector 55 > mskc0: using IRQs 256-257 for MSI > mskc0: RAM buffer size : 48KB > mskc0: Port 0 : Rx Queue 32KB(0x00000000:0x00007fff) > mskc0: Port 0 : Tx Queue 16KB(0x00008000:0x0000bfff) > msk0: on mskc0 > msk0: bpf attached > msk0: Ethernet address: 00:01:29:a3:3c:a3 > miibus0: on msk0 > e1000phy0: PHY 0 on miibus0 > e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto > mskc0: [MPSAFE] > mskc0: [FILTER] > [...] > ioapic0: Assigning ISA IRQ 1 to local APIC 0 > ioapic0: Assigning ISA IRQ 4 to local APIC 1 > ioapic0: Assigning ISA IRQ 9 to local APIC 0 > ioapic0: Assigning ISA IRQ 12 to local APIC 1 > ioapic0: Assigning PCI IRQ 16 to local APIC 0 > ioapic0: Assigning PCI IRQ 18 to local APIC 1 > ioapic0: Assigning PCI IRQ 19 to local APIC 0 > ioapic0: Assigning PCI IRQ 21 to local APIC 1 > ioapic0: Assigning PCI IRQ 22 to local APIC 0 > ioapic0: Assigning PCI IRQ 23 to local APIC 1 > msi: Assigning MSI IRQ 256 to local APIC 0 > > > pciconf: > mskc0@pci0:4:0:0: class=0x020000 card=0x110215bd chip=0x436211ab rev=0x22 hdr=0x00 > vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' > device = '88E8053 Marvell Yukon 88E8053 PCI-E Gigabit Ethernet Controller' > class = network > subclass = ethernet > > > vmstat -i: > interrupt total rate > irq1: atkbd0 414 0 > irq12: psm0 705928 7 > irq16: uhci0+ 595697 5 > irq18: ehci0 uhci5 1 0 > irq19: uhci2 uhci* 7274884 72 > irq21: uhci1 91394 0 > irq22: pcm0 1300595 13 > irq23: uhci3 ehci1 287 0 > cpu0: timer 199843890 1998 > irq256: mskc0 1707899 17 > cpu1: timer 199262662 1993 > Total 410783651 4108 > > (The interrupt count stops going up when the card stops working.) > > > ifconfig msk0: > msk0: flags=8843 metric 0 mtu 1500 > options=19a > ether [...] inet [...] > media: Ethernet autoselect (100baseTX ) > status: active Does the link parner also agree on 100baseTx and full-duplex media configuration? If link partner maintains counters for number of transmitted pause frames to your box would you let me know? -- Regards, Pyun YongHyeon From: David Schultz To: Pyun YongHyeon Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Mon, 18 Feb 2008 23:36:27 -0500 On Tue, Feb 19, 2008, Pyun YongHyeon wrote: > On Mon, Feb 18, 2008 at 07:21:05PM -0500, David Schultz wrote: > > FWIW, I'm seeing this issue, too. It generally happens under heavy > > network load (for a single-user machine), e.g., transferring files > > over the LAN via scp or downloading an ISO image from a website. > > I think scping or downloading ISO images are not heavy network > loads. Okay, then call them moderate loads if you will, but the point is that the card is still having problems! > > > I haven't noticed it with NFS, at least not yet. No amount of fiddling > > in ifconfig fixes things; I'm forced to reboot. > > > > Some info: > > Missing FreeBSD version? Aah, sorry. It's an amd64 -CURRENT with sources from 2/6. > > ifconfig msk0: > > msk0: flags=8843 metric 0 mtu 1500 > > options=19a > > ether [...] inet [...] > > media: Ethernet autoselect (100baseTX ) > > status: active > > Does the link parner also agree on 100baseTx and full-duplex media > configuration? If link partner maintains counters for number of > transmitted pause frames to your box would you let me know? Currently it's connected to a switch that I do not have access to, but previously I had it connected to another box with a card supported by the em(4) driver. At that time, both ends agreed to 1000baseTX and the msk card still wedged after a few minutes of heavy traffic. As for pause frames, I'm not sure. I can reconnect it to that box and try to reproduce the problem, but I don't know how offhand to get the em(4) driver to give me info on link layer control packets. State-Changed-From-To: closed->open State-Changed-By: linimon State-Changed-When: Tue Feb 19 15:42:42 UTC 2008 State-Changed-Why: Re-open with new data. The following 2 email messages got caught in the spamtrap due to a spamassassin outage. Date: Tue, 19 Feb 2008 14:11:49 +0900 From: Pyun YongHyeon On Mon, Feb 18, 2008 at 11:36:27PM -0500, David Schultz wrote: > On Tue, Feb 19, 2008, Pyun YongHyeon wrote: > > On Mon, Feb 18, 2008 at 07:21:05PM -0500, David Schultz wrote: > > > FWIW, I'm seeing this issue, too. It generally happens under heavy > > > network load (for a single-user machine), e.g., transferring files > > > over the LAN via scp or downloading an ISO image from a website. > > > > I think scping or downloading ISO images are not heavy network > > loads. > > Okay, then call them moderate loads if you will, but the point is > that the card is still having problems! > I see. :-) > > > > > I haven't noticed it with NFS, at least not yet. No amount of fiddling > > > in ifconfig fixes things; I'm forced to reboot. > > > > > > Some info: > > > > Missing FreeBSD version? > > Aah, sorry. It's an amd64 -CURRENT with sources from 2/6. > Ok. > > > ifconfig msk0: > > > msk0: flags=8843 metric 0 mtu 1500 > > > options=19a > > > ether [...] inet [...] > > > media: Ethernet autoselect (100baseTX ) > > > status: active > > > > Does the link parner also agree on 100baseTx and full-duplex media > > configuration? If link partner maintains counters for number of > > transmitted pause frames to your box would you let me know? > > Currently it's connected to a switch that I do not have access to, > but previously I had it connected to another box with a card > supported by the em(4) driver. At that time, both ends agreed to > 1000baseTX and the msk card still wedged after a few minutes of > heavy traffic. > It seems that you can reliably reproduce the issue. Would you let me know what commands were used to wedge msk(4)? I can't reproduce it with scping/downloading files or netperf tests. > As for pause frames, I'm not sure. I can reconnect it to that box > and try to reproduce the problem, but I don't know how offhand to > get the em(4) driver to give me info on link layer control > packets. Does interface down and re-up make it work again? -- Regards, Pyun YongHyeon Date: Tue, 19 Feb 2008 01:01:54 -0500 From: David Schultz On Tue, Feb 19, 2008, Pyun YongHyeon wrote: > > Currently it's connected to a switch that I do not have access to, > > but previously I had it connected to another box with a card > > supported by the em(4) driver. At that time, both ends agreed to > > 1000baseTX and the msk card still wedged after a few minutes of > > heavy traffic. > > > > It seems that you can reliably reproduce the issue. Would you let > me know what commands were used to wedge msk(4)? > I can't reproduce it with scping/downloading files or netperf tests. Generally anything that tries to transfer a lot of data from a remote host via TCP seems to do it. It does take some time to reproduce, so it's not like I can just type a single command to do it. Most recently it was downloading a 67 MB tarball from a fast HTTP server with firefox, before that it was scping a large file, and before that it was another HTTP connection. > > As for pause frames, I'm not sure. I can reconnect it to that box > > and try to reproduce the problem, but I don't know how offhand to > > get the em(4) driver to give me info on link layer control > > packets. > > Does interface down and re-up make it work again? No, I tried that many times. I also fiddled with all of the options in ifconfig that I could find, but in the end I just had to reboot the machine. Is there any other diagnostic info that would help? As I said, once it wedges, the total interrupt count in vmstat -i stops increasing. Maybe the watchdog timeout code isn't reenabling interrupts on the card properly or something... http://www.freebsd.org/cgi/query-pr.cgi?pr=115882 From: Pyun YongHyeon To: David Schultz Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Thu, 21 Feb 2008 12:44:38 +0900 --6TrnltStXW4iwmi0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue, Feb 19, 2008 at 01:01:54AM -0500, David Schultz wrote: > On Tue, Feb 19, 2008, Pyun YongHyeon wrote: > > > Currently it's connected to a switch that I do not have access to, > > > but previously I had it connected to another box with a card > > > supported by the em(4) driver. At that time, both ends agreed to > > > 1000baseTX and the msk card still wedged after a few minutes of > > > heavy traffic. > > > > > > > It seems that you can reliably reproduce the issue. Would you let > > me know what commands were used to wedge msk(4)? > > I can't reproduce it with scping/downloading files or netperf tests. > > Generally anything that tries to transfer a lot of data from a > remote host via TCP seems to do it. It does take some time to > reproduce, so it's not like I can just type a single command to do > it. Most recently it was downloading a 67 MB tarball from a fast > HTTP server with firefox, before that it was scping a large file, > and before that it was another HTTP connection. > > > > As for pause frames, I'm not sure. I can reconnect it to that box > > > and try to reproduce the problem, but I don't know how offhand to > > > get the em(4) driver to give me info on link layer control > > > packets. > > > > Does interface down and re-up make it work again? > > No, I tried that many times. I also fiddled with all of the > options in ifconfig that I could find, but in the end I just had > to reboot the machine. > Hmm, that looks like a hardware hang and PCI hardware reset recovered the controller. If this is right the issue you're seeing is not related with this PR. > Is there any other diagnostic info that would help? As I said, > once it wedges, the total interrupt count in vmstat -i stops > increasing. Maybe the watchdog timeout code isn't reenabling > interrupts on the card properly or something... I can't still reproduce it but would you try attached patch? -- Regards, Pyun YongHyeon --6TrnltStXW4iwmi0 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="msk.pause.patch" --- sys/dev/msk/if_msk.c.orig 2008-02-04 09:59:06.000000000 +0900 +++ sys/dev/msk/if_msk.c 2008-02-21 12:35:51.000000000 +0900 @@ -3658,9 +3658,13 @@ CSR_WRITE_4(sc, MR_ADDR(sc_if->msk_port, RX_GMF_FL_MSK), GMR_FS_ANY_ERR); - /* Set Rx FIFO flush threshold to 64 bytes. */ + /* + * Set Rx FIFO flush threshold to 64 bytes. + * Increase the threshold by single unit to work-aorund + * hardware hang on pause frames. + */ CSR_WRITE_4(sc, MR_ADDR(sc_if->msk_port, RX_GMF_FL_THR), - RX_GMF_FL_THR_DEF); + RX_GMF_FL_THR_DEF + 1); /* Configure Tx MAC FIFO. */ CSR_WRITE_4(sc, MR_ADDR(sc_if->msk_port, TX_GMF_CTRL_T), GMF_RST_SET); --- sys/dev/msk/if_mskreg.h.orig 2007-12-05 18:41:58.000000000 +0900 +++ sys/dev/msk/if_mskreg.h 2008-02-21 12:31:00.000000000 +0900 @@ -1818,6 +1818,7 @@ GMR_FS_LONG_ERR | \ GMR_FS_MII_ERR | \ GMR_FS_BAD_FC | \ + GMR_FS_GOOD_FC | \ GMR_FS_UN_SIZE | \ GMR_FS_JABBER) --6TrnltStXW4iwmi0-- From: David Schultz To: Pyun YongHyeon Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Sun, 24 Feb 2008 12:57:05 -0500 On Thu, Feb 21, 2008, Pyun YongHyeon wrote: > > Is there any other diagnostic info that would help? As I said, > > once it wedges, the total interrupt count in vmstat -i stops > > increasing. Maybe the watchdog timeout code isn't reenabling > > interrupts on the card properly or something... > > I can't still reproduce it but would you try attached patch? I've been running with the patch for 2 days now, and it hasn't hanged yet, so it seems like this fixed the problem! I'll keep exercising it for a few more days and let you know if there are any further problems. From: Pyun YongHyeon To: David Schultz Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Mon, 25 Feb 2008 12:40:43 +0900 On Sun, Feb 24, 2008 at 12:57:05PM -0500, David Schultz wrote: > On Thu, Feb 21, 2008, Pyun YongHyeon wrote: > > > Is there any other diagnostic info that would help? As I said, > > > once it wedges, the total interrupt count in vmstat -i stops > > > increasing. Maybe the watchdog timeout code isn't reenabling > > > interrupts on the card properly or something... > > > > I can't still reproduce it but would you try attached patch? > > I've been running with the patch for 2 days now, and it hasn't > hanged yet, so it seems like this fixed the problem! I'll keep > exercising it for a few more days and let you know if there are > any further problems. Thanks for testing. I'll wait one more week and commit the patch. If you encounter the issue again please let me know asap. Thanks. -- Regards, Pyun YongHyeon From: David Schultz To: Pyun YongHyeon Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Thu, 28 Feb 2008 18:32:39 -0500 On Mon, Feb 25, 2008, Pyun YongHyeon wrote: > On Sun, Feb 24, 2008 at 12:57:05PM -0500, David Schultz wrote: > > On Thu, Feb 21, 2008, Pyun YongHyeon wrote: > > > > Is there any other diagnostic info that would help? As I said, > > > > once it wedges, the total interrupt count in vmstat -i stops > > > > increasing. Maybe the watchdog timeout code isn't reenabling > > > > interrupts on the card properly or something... > > > > > > I can't still reproduce it but would you try attached patch? > > > > I've been running with the patch for 2 days now, and it hasn't > > hanged yet, so it seems like this fixed the problem! I'll keep > > exercising it for a few more days and let you know if there are > > any further problems. > > Thanks for testing. I'll wait one more week and commit the patch. > If you encounter the issue again please let me know asap. Sigh, it happened again. This time the interface was mostly idle, too: Feb 28 12:10:27 zim kernel: msk0: watchdog timeout Feb 28 12:10:27 zim kernel: msk0: link state changed to DOWN Feb 28 12:10:29 zim kernel: msk0: link state changed to UP Feb 28 12:10:41 zim kernel: msk0: watchdog timeout (missed Tx interrupts) -- rec overing Feb 28 12:11:22 zim last message repeated 4 times Feb 28 12:12:51 zim last message repeated 6 times Feb 28 12:13:14 zim kernel: msk0: watchdog timeout Feb 28 12:13:14 zim kernel: msk0: link state changed to DOWN Feb 28 12:13:17 zim kernel: msk0: link state changed to UP Feb 28 12:13:25 zim kernel: msk0: watchdog timeout (missed Tx interrupts) -- rec overing Feb 28 12:13:54 zim last message repeated 3 times Feb 28 12:15:09 zim last message repeated 5 times This time I did not compile the msk driver into the kernel. Unloading and then reloading the module fixed the problem without rebooting. (Kernel message log below in case it is somehow useful.) Feb 28 18:26:55 zim kernel: e1000phy0: detached Feb 28 18:26:55 zim kernel: miibus0: detached Feb 28 18:26:55 zim kernel: msk0: detached Feb 28 18:26:55 zim kernel: mskc0: detached Feb 28 18:27:07 zim kernel: pci0: driver added Feb 28 18:27:07 zim kernel: found-> vendor=0x8086, dev=0x2930, revid=0x02 Feb 28 18:27:07 zim kernel: domain=0, bus=0, slot=31, func=3 Feb 28 18:27:07 zim kernel: class=0c-05-00, hdrtype=0x00, mfdev=0 Feb 28 18:27:07 zim kernel: cmdreg=0x0003, statreg=0x0280, cachelnsz=0 (dwords) Feb 28 18:27:07 zim kernel: lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) Feb 28 18:27:07 zim kernel: intpin=b, irq=18 Feb 28 18:27:07 zim kernel: pci0:0:31:3: reprobing on driver added Feb 28 18:27:07 zim kernel: pci1: driver added Feb 28 18:27:07 zim kernel: pci2: driver added Feb 28 18:27:07 zim kernel: pci3: driver added Feb 28 18:27:07 zim kernel: pci4: driver added Feb 28 18:27:07 zim kernel: found-> vendor=0x11ab, dev=0x4362, revid=0x22 Feb 28 18:27:07 zim kernel: domain=0, bus=4, slot=0, func=0 Feb 28 18:27:07 zim kernel: class=02-00-00, hdrtype=0x00, mfdev=0 Feb 28 18:27:07 zim kernel: cmdreg=0x0007, statreg=0x0010, cachelnsz=1 (dwords) Feb 28 18:27:07 zim kernel: lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) Feb 28 18:27:07 zim kernel: intpin=a, irq=17 Feb 28 18:27:07 zim kernel: powerspec 2 supports D0 D1 D2 D3 current D0 Feb 28 18:27:07 zim kernel: MSI supports 2 messages, 64 bit Feb 28 18:27:07 zim kernel: pci0:4:0:0: reprobing on driver added Feb 28 18:27:07 zim kernel: mskc0: port 0xae00-0xaeff mem 0xfdefc000-0xfdefffff irq 17 at device 0.0 on pci4 Feb 28 18:27:07 zim kernel: pcib4: mskc0 requested memory range 0xfdefc000-0xfdefffff: good Feb 28 18:27:07 zim kernel: mskc0: MSI count : 2 Feb 28 18:27:07 zim kernel: mskc0: attempting to allocate 2 MSI vectors (2 supported) Feb 28 18:27:07 zim kernel: msi: routing MSI IRQ 256 to vector 58 Feb 28 18:27:07 zim kernel: msi: routing MSI IRQ 257 to vector 59 Feb 28 18:27:07 zim kernel: mskc0: using IRQs 256-257 for MSI Feb 28 18:27:07 zim kernel: mskc0: RAM buffer size : 48KB Feb 28 18:27:07 zim kernel: mskc0: Port 0 : Rx Queue 32KB(0x00000000:0x00007fff) Feb 28 18:27:07 zim kernel: mskc0: Port 0 : Tx Queue 16KB(0x00008000:0x0000bfff) Feb 28 18:27:07 zim kernel: msi: Assigning MSI IRQ 256 to local APIC 0 Feb 28 18:27:07 zim kernel: mskc0: [MPSAFE] Feb 28 18:27:07 zim kernel: mskc0: [FILTER] Feb 28 18:27:07 zim kernel: pci5: driver added Feb 28 18:27:07 zim kernel: msk0: on mskc0 Feb 28 18:27:07 zim kernel: msk0: bpf attached Feb 28 18:27:07 zim kernel: msk0: Ethernet address: 00:01:29:a3:3c:a3 Feb 28 18:27:07 zim kernel: miibus0: on msk0 Feb 28 18:27:07 zim kernel: e1000phy0: PHY 0 on miibus0 Feb 28 18:27:07 zim kernel: e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto Feb 28 18:27:07 zim kernel: msk0: link state changed to DOWN Feb 28 18:27:09 zim kernel: msk0: link state changed to UP From: Pyun YongHyeon To: David Schultz Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Fri, 29 Feb 2008 12:44:08 +0900 On Thu, Feb 28, 2008 at 06:32:39PM -0500, David Schultz wrote: > On Mon, Feb 25, 2008, Pyun YongHyeon wrote: > > On Sun, Feb 24, 2008 at 12:57:05PM -0500, David Schultz wrote: > > > On Thu, Feb 21, 2008, Pyun YongHyeon wrote: > > > > > Is there any other diagnostic info that would help? As I said, > > > > > once it wedges, the total interrupt count in vmstat -i stops > > > > > increasing. Maybe the watchdog timeout code isn't reenabling > > > > > interrupts on the card properly or something... > > > > > > > > I can't still reproduce it but would you try attached patch? > > > > > > I've been running with the patch for 2 days now, and it hasn't > > > hanged yet, so it seems like this fixed the problem! I'll keep > > > exercising it for a few more days and let you know if there are > > > any further problems. > > > > Thanks for testing. I'll wait one more week and commit the patch. > > If you encounter the issue again please let me know asap. > > Sigh, it happened again. This time the interface was mostly idle, too: > > Feb 28 12:10:27 zim kernel: msk0: watchdog timeout > Feb 28 12:10:27 zim kernel: msk0: link state changed to DOWN > Feb 28 12:10:29 zim kernel: msk0: link state changed to UP > Feb 28 12:10:41 zim kernel: msk0: watchdog timeout (missed Tx interrupts) -- rec > overing > Feb 28 12:11:22 zim last message repeated 4 times > Feb 28 12:12:51 zim last message repeated 6 times > Feb 28 12:13:14 zim kernel: msk0: watchdog timeout > Feb 28 12:13:14 zim kernel: msk0: link state changed to DOWN > Feb 28 12:13:17 zim kernel: msk0: link state changed to UP > Feb 28 12:13:25 zim kernel: msk0: watchdog timeout (missed Tx interrupts) -- rec > overing > Feb 28 12:13:54 zim last message repeated 3 times > Feb 28 12:15:09 zim last message repeated 5 times > > This time I did not compile the msk driver into the kernel. > Unloading and then reloading the module fixed the problem without > rebooting. (Kernel message log below in case it is somehow > useful.) > > Feb 28 18:26:55 zim kernel: e1000phy0: detached > Feb 28 18:26:55 zim kernel: miibus0: detached > Feb 28 18:26:55 zim kernel: msk0: detached > Feb 28 18:26:55 zim kernel: mskc0: detached > Feb 28 18:27:07 zim kernel: pci0: driver added > Feb 28 18:27:07 zim kernel: found-> vendor=0x8086, dev=0x2930, revid=0x02 > Feb 28 18:27:07 zim kernel: domain=0, bus=0, slot=31, func=3 > Feb 28 18:27:07 zim kernel: class=0c-05-00, hdrtype=0x00, mfdev=0 > Feb 28 18:27:07 zim kernel: cmdreg=0x0003, statreg=0x0280, cachelnsz=0 (dwords) > Feb 28 18:27:07 zim kernel: lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) > Feb 28 18:27:07 zim kernel: intpin=b, irq=18 > Feb 28 18:27:07 zim kernel: pci0:0:31:3: reprobing on driver added > Feb 28 18:27:07 zim kernel: pci1: driver added > Feb 28 18:27:07 zim kernel: pci2: driver added > Feb 28 18:27:07 zim kernel: pci3: driver added > Feb 28 18:27:07 zim kernel: pci4: driver added > Feb 28 18:27:07 zim kernel: found-> vendor=0x11ab, dev=0x4362, revid=0x22 > Feb 28 18:27:07 zim kernel: domain=0, bus=4, slot=0, func=0 > Feb 28 18:27:07 zim kernel: class=02-00-00, hdrtype=0x00, mfdev=0 > Feb 28 18:27:07 zim kernel: cmdreg=0x0007, statreg=0x0010, cachelnsz=1 (dwords) > Feb 28 18:27:07 zim kernel: lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) > Feb 28 18:27:07 zim kernel: intpin=a, irq=17 > Feb 28 18:27:07 zim kernel: powerspec 2 supports D0 D1 D2 D3 current D0 > Feb 28 18:27:07 zim kernel: MSI supports 2 messages, 64 bit > Feb 28 18:27:07 zim kernel: pci0:4:0:0: reprobing on driver added > Feb 28 18:27:07 zim kernel: mskc0: port 0xae00-0xaeff mem 0xfdefc000-0xfdefffff irq 17 at device 0.0 on pci4 > Feb 28 18:27:07 zim kernel: pcib4: mskc0 requested memory range 0xfdefc000-0xfdefffff: good > Feb 28 18:27:07 zim kernel: mskc0: MSI count : 2 > Feb 28 18:27:07 zim kernel: mskc0: attempting to allocate 2 MSI vectors (2 supported) > Feb 28 18:27:07 zim kernel: msi: routing MSI IRQ 256 to vector 58 > Feb 28 18:27:07 zim kernel: msi: routing MSI IRQ 257 to vector 59 > Feb 28 18:27:07 zim kernel: mskc0: using IRQs 256-257 for MSI > Feb 28 18:27:07 zim kernel: mskc0: RAM buffer size : 48KB > Feb 28 18:27:07 zim kernel: mskc0: Port 0 : Rx Queue 32KB(0x00000000:0x00007fff) > Feb 28 18:27:07 zim kernel: mskc0: Port 0 : Tx Queue 16KB(0x00008000:0x0000bfff) > Feb 28 18:27:07 zim kernel: msi: Assigning MSI IRQ 256 to local APIC 0 > Feb 28 18:27:07 zim kernel: mskc0: [MPSAFE] > Feb 28 18:27:07 zim kernel: mskc0: [FILTER] > Feb 28 18:27:07 zim kernel: pci5: driver added > Feb 28 18:27:07 zim kernel: msk0: on mskc0 > Feb 28 18:27:07 zim kernel: msk0: bpf attached > Feb 28 18:27:07 zim kernel: msk0: Ethernet address: 00:01:29:a3:3c:a3 > Feb 28 18:27:07 zim kernel: miibus0: on msk0 > Feb 28 18:27:07 zim kernel: e1000phy0: PHY 0 on miibus0 > Feb 28 18:27:07 zim kernel: e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto > Feb 28 18:27:07 zim kernel: msk0: link state changed to DOWN > Feb 28 18:27:09 zim kernel: msk0: link state changed to UP Hmm, I guess this one is not related with your previous bug report. From the above boot messages I think msk(4) is using MSI. How about disabling MSI by setting hw.msk.msi_disable tunable to 1? -- Regards, Pyun YongHyeon From: David Schultz To: Pyun YongHyeon Cc: Hugo , yongari@FreeBSD.ORG, freebsd-gnats-submit@FreeBSD.ORG Subject: Re: kern/115882: msk driver always fails under moderate network load. Date: Mon, 17 Mar 2008 01:45:04 -0400 On Fri, Feb 29, 2008, Pyun YongHyeon wrote: > Hmm, I guess this one is not related with your previous bug > report. From the above boot messages I think msk(4) is using MSI. > How about disabling MSI by setting hw.msk.msi_disable tunable > to 1? I set hw.pci.enable_msi and hw.pci.enable_msix to 0, and no problems for the last two weeks... State-Changed-From-To: open->closed State-Changed-By: yongari State-Changed-When: Tue Mar 18 01:46:06 UTC 2008 State-Changed-Why: Work around for a bug that caused hardware hang was MFCed to RELENG_7 and RELENG_6. Close this PR as work around for watchdog timeout on MacBook is available as a tunable. Thanks for testing! http://www.freebsd.org/cgi/query-pr.cgi?pr=115882 From: "Mars G Miro" To: bug-followup@FreeBSD.org, Hugo Cc: "Pyun YongHyeon" , "David Schultz" Subject: Re: kern/115882: [msk] msk driver always fails under moderate network load. Date: Wed, 21 May 2008 12:15:17 +0800 Greetz, I'm seeing this problem too, in one of my boxens. It is a Nexgate NSA1086 (an older http://www.nexcom.com/ProductModel.aspx?id=d791325b-f791-4fc3-b98b-a482637c72e3) that has 4 msk and 4 sk NICs. Its docs say the msk is connected to its PCI-e bus while the sk are connected to its PCI-X bus. The box runs on RELENG_7 (csup'd 20080409) and even has 1.18.2.11 of if_msk.c. I've tried the patch at http://people.freebsd.org/~yongari/msk/msk.pcierr.patch but to no avail. Even disabling hw.pci.enable_msix and hw.pci.enable_msi before & after the msk.pcierr.patch doesn't help. pciconf -lvc tells me: mskc0@pci0:1:0:0: class=0x020000 card=0x522111ab chip=0x436211ab rev=0x15 hdr=0x00 vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' device = '88E8053 Marvell Yukon 88E8053 PCI-E Gigabit Ethernet Controller' class = network subclass = ethernet cap 01[48] = powerspec 2 supports D0 D1 D2 D3 current D0 cap 03[50] = VPD cap 05[5c] = MSI supports 2 messages, 64 bit enabled with 2 messages cap 10[e0] = PCI-Express 1 legacy endpoint The quickest way I could reproduce the problem is to run an iperf server on this box and from another box, fire up an iperf client sending 200G of data. In about ~ 2 hours, the NIC becomes unusable and no amount of ifconfig up/down can help it, forcing me to reboot. The odd thing is that in my test setup the iperf client is also an msk (a Gigabyte GA-965P mobo) and doesn't have problems at all. I am willing to test patches. Thanks. cheers mars From: "Mars G Miro" To: bug-followup@freebsd.org, Hugo , "Pyun YongHyeon" Cc: "David Schultz" , yongari@FreeBSD.org, "Kudo Chien" Subject: Re: kern/115882: [msk] msk driver always fails under moderate network load. Date: Fri, 23 May 2008 13:56:37 +0800 On Wed, May 21, 2008 at 12:15 PM, Mars G Miro wrote: > Greetz, > > I'm seeing this problem too, in one of my boxens. It is a Nexgate > NSA1086 (an older > http://www.nexcom.com/ProductModel.aspx?id=d791325b-f791-4fc3-b98b-a482637c72e3) > that has 4 msk and 4 sk NICs. Its docs say the msk is connected to its > PCI-e bus while the sk are connected to its PCI-X bus. The box runs on > RELENG_7 (csup'd 20080409) and even has 1.18.2.11 of if_msk.c. I've > tried the patch at > http://people.freebsd.org/~yongari/msk/msk.pcierr.patch but to no > avail. Even disabling hw.pci.enable_msix and hw.pci.enable_msi before > & after the msk.pcierr.patch doesn't help. pciconf -lvc tells me: > > mskc0@pci0:1:0:0: class=0x020000 card=0x522111ab > chip=0x436211ab rev=0x15 hdr=0x00 > vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' > device = '88E8053 Marvell Yukon 88E8053 PCI-E Gigabit Ethernet > Controller' > class = network > subclass = ethernet > cap 01[48] = powerspec 2 supports D0 D1 D2 D3 current D0 > cap 03[50] = VPD > cap 05[5c] = MSI supports 2 messages, 64 bit enabled with 2 messages > cap 10[e0] = PCI-Express 1 legacy endpoint > > The quickest way I could reproduce the problem is to run an iperf > server on this box and from another box, fire up an iperf client > sending 200G of data. In about ~ 2 hours, the NIC becomes unusable and > no amount of ifconfig up/down can help it, forcing me to reboot. The > odd thing is that in my test setup the iperf client is also an msk (a > Gigabyte GA-965P mobo) and doesn't have problems at all. I am willing > to test patches. > I guess i came to my conclusions a wee bit early on the GA-965P mobo's msk NIC. It has the same problems ( I was rsync'ing an /usr/obj (3.1G)) and this problem manifested itself. It's interesting that when I was using it as an iperf client (as describe above) it didnt have the problem at all. Another guy having a similar mobo filed a PR #kern/116853. He's CC'd. Thanks. > Thanks. > > cheers mars From: Pyun YongHyeon To: Mars G Miro Cc: bug-followup@freebsd.org, Hugo , David Schultz , yongari@freebsd.org, Kudo Chien Subject: Re: kern/115882: [msk] msk driver always fails under moderate network load. Date: Fri, 23 May 2008 15:19:54 +0900 On Fri, May 23, 2008 at 01:56:37PM +0800, Mars G Miro wrote: > On Wed, May 21, 2008 at 12:15 PM, Mars G Miro wrote: > > Greetz, > > > > I'm seeing this problem too, in one of my boxens. It is a Nexgate > > NSA1086 (an older > > http://www.nexcom.com/ProductModel.aspx?id=d791325b-f791-4fc3-b98b-a482637c72e3) > > that has 4 msk and 4 sk NICs. Its docs say the msk is connected to its > > PCI-e bus while the sk are connected to its PCI-X bus. The box runs on > > RELENG_7 (csup'd 20080409) and even has 1.18.2.11 of if_msk.c. I've > > tried the patch at > > http://people.freebsd.org/~yongari/msk/msk.pcierr.patch but to no > > avail. Even disabling hw.pci.enable_msix and hw.pci.enable_msi before > > & after the msk.pcierr.patch doesn't help. pciconf -lvc tells me: > > > > mskc0@pci0:1:0:0: class=0x020000 card=0x522111ab > > chip=0x436211ab rev=0x15 hdr=0x00 > > vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' > > device = '88E8053 Marvell Yukon 88E8053 PCI-E Gigabit Ethernet > > Controller' > > class = network > > subclass = ethernet > > cap 01[48] = powerspec 2 supports D0 D1 D2 D3 current D0 > > cap 03[50] = VPD > > cap 05[5c] = MSI supports 2 messages, 64 bit enabled with 2 messages > > cap 10[e0] = PCI-Express 1 legacy endpoint > > > > The quickest way I could reproduce the problem is to run an iperf > > server on this box and from another box, fire up an iperf client > > sending 200G of data. In about ~ 2 hours, the NIC becomes unusable and > > no amount of ifconfig up/down can help it, forcing me to reboot. The > > odd thing is that in my test setup the iperf client is also an msk (a > > Gigabyte GA-965P mobo) and doesn't have problems at all. I am willing > > to test patches. > > > > > I guess i came to my conclusions a wee bit early on the GA-965P mobo's > msk NIC. It has the same problems ( I was rsync'ing an /usr/obj > (3.1G)) and this problem manifested itself. It's interesting that when > I was using it as an iperf client (as describe above) it didnt have > the problem at all. Another guy having a similar mobo filed a PR > #kern/116853. He's CC'd. > I guess the problem you've encountered has nothing to do with this PR. So it would be even better if you can open another PR for this issue. I think I fixed hardware hang issue of 88E8053 but your case still indicates the problem wasn't coverved by the workaround. In order to verify whether you are seeing the same kind of hardware bug, 1. Can you check whether msk(4) received flow-control frames from sender? Since msk(4) have no hardware counter support yet you may have to check statistics of sender or swtich. 2. When msk(4) is not responding, can you send packets to other hosts via msk(4)? Another check point would be whether you can still see incoming packets with tcpdump on msk(4). 3. Show me the output of 'ifconfig msk0 output' when msk(4) is not responding. 4. When msk(4) is not respondig, can you check msk(4) is still generating interrupts?(Check the output of 'systat -vmstat 1'). The PR 116853 is completely different one. His controller is not 88E8053. -- Regards, Pyun YongHyeon >Unformatted: