From dillon@flea.best.net Fri Oct 24 19:06:16 1997 Received: from flea.best.net (root@flea.best.net [206.184.139.131]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id TAA05344 for ; Fri, 24 Oct 1997 19:06:14 -0700 (PDT) (envelope-from dillon@flea.best.net) Received: (from dillon@localhost) by flea.best.net (8.8.7/8.7.3) id TAA06359; Fri, 24 Oct 1997 19:05:18 -0700 (PDT) Message-Id: <199710250205.TAA06359@flea.best.net> Date: Fri, 24 Oct 1997 19:05:18 -0700 (PDT) From: Matt Dillon Reply-To: dillon@best.net To: FreeBSD-gnats-submit@freebsd.org Subject: VM lookup, endless loop in vm_map_lookup_entry() X-Send-Pr-Version: 3.2 >Number: 4844 >Category: kern >Synopsis: VM lookup, endless loop in vm_map_lookup_entry() >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Oct 24 19:10:00 PDT 1997 >Closed-Date: Sun Apr 26 14:01:10 PDT 1998 >Last-Modified: Sun Apr 26 14:01:25 PDT 1998 >Originator: Matt Dillon >Release: FreeBSD 2.2-STABLE i386 >Organization: BEST Internet Communications >Environment: FreeBSD 2.2.5, PPro 200, 2940UW, DEC ether, all pci. >Description: Within a day of upgrading to 2.2.5, we have had to lockups. We were able to escape into the debugger and panic the machines. Both machines locked up in vm_map_lookup_entry() while trying to brelse/bfreekva/vm_map_delete (see below). One locked up from a SCSI interrupt doing the above sequence, the other locked up from system call. The kernel is locking up in vm_map_delete, but I believe it is ACTUALLY locking up in the vm_map_lookup_entry routine that vm_map_delete() is calling and that the Xfastintr4 is losing the vm_map_lookup_entry context. vm_map_lookup_entry is locking up by getting into an endless loop. I have traced and demonstrated the loop below. The cores are available if you need them. We have NEVER had this sort of crash before. We had two the first day operating under 2.2.5 from 2.2.2. I'm hoping it is something obvious. I believe the buffer cache bp's linked lists are getting corrupted. -Matt -------------------------------------- CRASH #1 --------------------------- cur $4 = (struct vm_map_entry *) 0xf2ecdc00 last $8 = (struct vm_map_entry *) 0xf1937f20 starting at cur, linking through cur->next ...... somewhere int the middle, it loops without hitting last $49 = (struct vm_map_entry *) 0xf2ffeec0 (kgdb) print $->next $50 = (struct vm_map_entry *) 0xf2ff5b40 (kgdb) print $->next $51 = (struct vm_map_entry *) 0xf2fdab40 (kgdb) print $->next $52 = (struct vm_map_entry *) 0xf2584840 (kgdb) print $->next $53 = (struct vm_map_entry *) 0xf2f36040 (kgdb) print $->next $54 = (struct vm_map_entry *) 0xf27baa20 (kgdb) print $->next $55 = (struct vm_map_entry *) 0xf2ecc5c0 (kgdb) print $->next $56 = (struct vm_map_entry *) 0xf247b000 (kgdb) print $->next $57 = (struct vm_map_entry *) 0xf2f58040 (kgdb) print $->next $58 = (struct vm_map_entry *) 0xf2f52c40 (kgdb) print $->next $59 = (struct vm_map_entry *) 0xf2fe6440 (kgdb) print $->next $60 = (struct vm_map_entry *) 0xf2affdc0 (kgdb) print $->next $61 = (struct vm_map_entry *) 0xf3022ec0 (kgdb) print $->next $62 = (struct vm_map_entry *) 0xf2f51240 (kgdb) print $->next $63 = (struct vm_map_entry *) 0xf2f50940 (kgdb) print $->next $64 = (struct vm_map_entry *) 0xf2685200 (kgdb) print $->next $65 = (struct vm_map_entry *) 0xf2cd4640 (kgdb) print $->next $66 = (struct vm_map_entry *) 0xf2f4a400 (kgdb) print $->next $67 = (struct vm_map_entry *) 0xf2ffeec0 LOOPS #12 0xf01014e2 in db_command_loop () at ../../ddb/db_command.c:462 #13 0xf0103c28 in db_trap (type=0x3, code=0x0) at ../../ddb/db_trap.c:73 #14 0xf01b4d4b in kdb_trap (type=0x3, code=0x0, regs=0xefbffc14) at ../../i386/i386/db_interface.c:126 #15 0xf01be4f4 in trap (frame={tf_es = 0xf0100010, tf_ds = 0xf4430010, tf_edi = 0x10000, tf_esi = 0xf2347800, tf_ebp = 0xefbffc70, tf_isp = 0xefbffc3c, tf_ebx = 0xf5bc3000, tf_edx = 0x3f9, tf_ecx = 0x3f9, tf_eax = 0x0, tf_trapno = 0x3, tf_err = 0x0, tf_eip = 0xf01c99ad, tf_cs = 0x8, tf_eflags = 0x46, tf_esp = 0xf5bc3000, tf_ss = 0xefbffce4}) at ../../i386/i386/trap.c:403 #16 0xf01c99ad in siointr1 (com=0xf2347800) at ../../i386/isa/sio.c:1401 #17 0xf01c9952 in siointr (unit=0x0) at ../../i386/isa/sio.c:1338 #18 0xf01b61e7 in Xfastintr4 () #19 0xf01a80b5 in vm_map_delete (map=0xf1937f00, start=0xf5bc3000, end=0xf5bc5000) at ../../vm/vm_map.c:1878 #20 0xf012ca4e in bfreekva (bp=0xf4434390) at ../../kern/vfs_bio.c:194 #21 0xf012d350 in brelse (bp=0xf4434390) at ../../kern/vfs_bio.c:603 #22 0xf012dc38 in getnewbuf (slpflag=0x0, slptimeo=0x0, size=0x2000, maxsize=0x2000) at ../../kern/vfs_bio.c:1000 #23 0xf012dfd5 in getblk (vp=0xf27dc380, blkno=0x177, size=0x2000, slpflag=0x0, slptimeo=0x0) at ../../kern/vfs_bio.c:1197 #24 0xf012ff1a in cluster_rbuild (vp=0xf27dc380, filesize=0x9d6048, lbn=0x177, blkno=0x42bc50, size=0x2000, run=0x8) at ../../kern/vfs_cluster.c:296 #25 0xf012fd2c in cluster_read (vp=0xf27dc380, filesize=0x9d6048, lblkno=0x160, size=0x2000, cred=0x0, bpp=0xefbffec0) at ../../kern/vfs_cluster.c:217 #26 0xf019a6e1 in ffs_read (ap=0xefbffee8) at ../../ufs/ufs/ufs_readwrite.c:124 #27 0xf0136df1 in vn_read (fp=0xf2f9c940, uio=0xefbfff34, cred=0xf2fa8180) at vnode_if.h:255 #28 0xf011a1b7 in read (p=0xf2fb3c00, uap=0xefbfff94, retval=0xefbfff84) at ../../kern/sys_generic.c:116 #29 0xf01bef37 in syscall (frame={tf_es = 0xefbf0027, tf_ds = 0xefbf0027, tf_edi = 0x8000, tf_esi = 0x52e000, tf_ebp = 0xefbfc1f4, tf_isp = 0xefbfffe4, tf_ebx = 0x5a7f26, tf_edx = 0x131100, tf_ecx = 0x0, tf_eax = 0x3, tf_trapno = 0x7, tf_err = 0x7, tf_eip = 0x201bafd1, tf_cs = 0x1f, tf_eflags = 0x206, tf_esp = 0xefbfbd10, tf_ss = 0x27}) at ../../i386/i386/trap.c:890 #30 0x201bafd1 in ?? () -------------------------------------- CRASH #2 --------------------------- (same problem showing the loop. This time, it looped from an interrupt) #14 0xf01b4d4b in kdb_trap (type=0x3, code=0x0, regs=0xf01dbdd0) at ../../i386/i386/db_interface.c:126 #15 0xf01be4f4 in trap (frame={tf_es = 0x10, tf_ds = 0xc0030010, tf_edi = 0x10000, tf_esi = 0xf2187800, tf_ebp = 0xf01dbe2c, tf_isp = 0xf01dbdf8, tf_ebx = 0xf6007000, tf_edx = 0x3f9, tf_ecx = 0x3f9, tf_eax = 0x0, tf_trapno = 0x3, tf_err = 0x0, tf_eip = 0xf01c99ad, tf_cs = 0x8, tf_eflags = 0x46, tf_esp = 0xf6007000, tf_ss = 0xf01dbea0}) at ../../i386/i386/trap.c:403 #16 0xf01c99ad in siointr1 (com=0xf2187800) at ../../i386/isa/sio.c:1401 #17 0xf01c9952 in siointr (unit=0x0) at ../../i386/isa/sio.c:1338 #18 0xf01b61e7 in Xfastintr4 () #19 0xf01a80b5 in vm_map_delete (map=0xf1777f00, start=0xf6007000, end=0xf6009000) at ../../vm/vm_map.c:1878 #20 0xf012ca4e in bfreekva (bp=0xf4214670) at ../../kern/vfs_bio.c:194 #21 0xf012d350 in brelse (bp=0xf4214670) at ../../kern/vfs_bio.c:603 #22 0xf012ec7c in biodone (bp=0xf4214670) at ../../kern/vfs_bio.c:1685 #23 0xf018e56c in scsi_done (xs=0xf1777a80) at ../../scsi/scsi_base.c:449 #24 0xf01d5471 in ahc_done (ahc=0xf217f800, scb=0xf217eee0) at ../../i386/scsi/aic7xxx.c:1944 #25 0xf01d2ef7 in ahc_intr (arg=0xf217f800) at ../../i386/scsi/aic7xxx.c:814 (kgdb) >How-To-Repeat: Don't know. These are medium-loaded systems. >Fix: Don't know. >Release-Note: >Audit-Trail: From: "Jordan K. Hubbard" To: dillon@best.net Cc: FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Fri, 24 Oct 1997 19:56:04 -0700 > Within a day of upgrading to 2.2.5, we have had to lockups. > We were able to escape into the debugger and panic the machines. Bizarre - this never showed up in any of your testing of the 2.2.5-BETAs? That would seem to indicate that the bug entered at the very last minute. Jordan From: Matt Dillon To: "Jordan K. Hubbard" Cc: dima@best.net, FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Fri, 24 Oct 1997 22:22:29 -0700 (PDT) : :> Within a day of upgrading to 2.2.5, we have had to lockups. :> We were able to escape into the debugger and panic the machines. : :Bizarre - this never showed up in any of your testing of the :2.2.5-BETAs? That would seem to indicate that the bug entered at the :very last minute. : : Jordan Correct. However, I do not know how uptodate our kernel was prior to the upgrade. It's reasonably recent. Dima should know. No other machines have crashed so far, just those two (the two crashes in the bug report are from two different machines), but I think it may only be a matter of time. Apart from that, and a tcp crash which I'm not going to worry about until the VM issue is resolved, 2.2.5 has been running nicely. -Matt Matthew Dillon Engineering, BEST Internet Communications, Inc. [always include a portion of the original email in any response!] From: Bruce Evans To: dillon@best.net, FreeBSD-gnats-submit@FreeBSD.ORG Cc: Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Sat, 25 Oct 1997 16:40:37 +1000 > Both machines locked up in vm_map_lookup_entry() while trying > to brelse/bfreekva/vm_map_delete (see below). One locked up > from a SCSI interrupt doing the above sequence, the other locked > up from system call. brelse/bfreekva have some problems. See PR4630, especially the followups. Bruce From: dima@best.net (Dima Ruban) To: dillon@best.net (Matt Dillon) Cc: jkh@time.cdrom.com, dima@best.net, FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Sat, 25 Oct 1997 00:17:25 -0700 (PDT) Matt Dillon writes: > : > :> Within a day of upgrading to 2.2.5, we have had to lockups. > :> We were able to escape into the debugger and panic the machines. > : > :Bizarre - this never showed up in any of your testing of the > :2.2.5-BETAs? That would seem to indicate that the bug entered at the > :very last minute. > : > : Jordan This bug was introduced somewhere in between September 19 and today. > Correct. However, I do not know how uptodate our kernel was prior > to the upgrade. It's reasonably recent. Dima should know. No other > machines have crashed so far, just those two (the two crashes in the > bug report are from two different machines), but I think it may only > be a matter of time. > > Apart from that, and a tcp crash which I'm not going to worry about > until the VM issue is resolved, 2.2.5 has been running nicely. > > -Matt > > Matthew Dillon Engineering, BEST Internet Communications, Inc. > > [always include a portion of the original email in any response!] > -- dima From: Matt Dillon To: Bruce Evans Cc: FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Sat, 25 Oct 1997 01:15:15 -0700 (PDT) :> up from system call. : :brelse/bfreekva have some problems. See PR4630, especially the followups. : :Bruce Hmm (looking at PR4630). This looks like a rather serious problem considering the core nature of brelse(). This may be responsible for several other crashes we have had involving "biodone: buffer already done" panics. We've had four or five of those. It sounds to me that a slight modification to the PR4630 suggestion would work. Rather then call bfreekva(), brelse() puts the bp on a defered free list, yes, but why not clear out this list from getnewbuf() ? I don't particularly see the need for a high priority kernel process or other complexity. If getnewbuf() (called by getblk()) is not called from an interrupt, we are home free. I don't think anyone else allocates out of the buffer_map so the defered frees would not create a secondary effect anywhere else. -Matt Matthew Dillon Engineering, BEST Internet Communications, Inc. [always include a portion of the original email in any response!] From: Matt Dillon To: Bruce Evans Cc: FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() Date: Sat, 25 Oct 1997 01:42:18 -0700 (PDT) :> Both machines locked up in vm_map_lookup_entry() while trying :> to brelse/bfreekva/vm_map_delete (see below). One locked up :> from a SCSI interrupt doing the above sequence, the other locked :> up from system call. : :brelse/bfreekva have some problems. See PR4630, especially the followups. : :Bruce Assuming getnewbuf() isn't called from an interrupt, I believe I have a working solution. I'm going to test it a bit more. If it works out, i'll send you the diffs. It's really simple, I add another bufqueues[] element called QUEUE_DEFERED. brelse() puts the bp on QUEUE_DEFERED rather then QUEUE_EMPTY and does not call bfreekva(). getnewbuf() then moves everything from QUEUE_DEFERED to QUEUE_EMPTY and calls bfreekva() on it. The code mods are minor, and even pretty clean. I have a question, though, brelse() makes a call to allocbuf(bp, 0) under certain conditions. Do I have to worry about this too? -Matt Matthew Dillon Engineering, BEST Internet Communications, Inc. [always include a portion of the original email in any response!] State-Changed-From-To: open->closed State-Changed-By: phk State-Changed-When: Sun Apr 26 14:01:10 PDT 1998 State-Changed-Why: fixed in -current >Unformatted: