From shalunov@tuzik.lz.att.com Thu Mar 11 08:35:35 1999 Return-Path: Received: from alms1.fw.att.com (alms1.att.com [192.128.167.146]) by hub.freebsd.org (Postfix) with ESMTP id AB69B15347 for ; Thu, 11 Mar 1999 08:35:16 -0800 (PST) (envelope-from shalunov@tuzik.lz.att.com) Received: from tuzik.lz.att.com ([135.25.200.84]) by alms1.fw.att.com (AT&T/IPNS/GW-1.0) with ESMTP id LAA12678 for ; Thu, 11 Mar 1999 11:34:57 -0500 (EST) Received: (from shalunov@localhost) by tuzik.lz.att.com (8.9.2/8.9.2) id KAA00419; Thu, 11 Mar 1999 10:36:35 -0500 (EST) (envelope-from shalunov) Message-Id: <199903111536.KAA00419@tuzik.lz.att.com> Date: Thu, 11 Mar 1999 10:36:35 -0500 (EST) From: shalunov@lynxhub.lz.att.com Sender: shalunov@tuzik.lz.att.com Reply-To: shalunov@lynxhub.lz.att.com To: FreeBSD-gnats-submit@freebsd.org Subject: kernel lock-up with fork/exec stress test X-Send-Pr-Version: 3.2 >Number: 10545 >Category: kern >Synopsis: When a fork/exec stress test is run, the machine locks up >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Mar 11 08:40:00 PST 1999 >Closed-Date: Mon Sep 3 11:14:11 PDT 2001 >Last-Modified: Mon Sep 03 11:14:47 PDT 2001 >Originator: stanislav shalunov >Release: FreeBSD 3.1-RELEASE i386 >Organization: AT&T >Environment: Stock 3.1-RELEASE system, with recompiled kernel. Hardware is Dell-assembled, details are hopefully clear from dmesg. Kernel config file with comments removed follows (I did not change /etc/make.conf and nothing unusual is set in the environment): machine "i386" cpu "I686_CPU" ident TUZIK maxusers 512 options INET #InterNETworking options FFS #Berkeley Fast Filesystem options FFS_ROOT #FFS usable as root device [keep this!] options MFS #Memory Filesystem options NFS #Network Filesystem options MSDOSFS #MSDOS Filesystem options "CD9660" #ISO 9660 Filesystem options PROCFS #Process filesystem options "COMPAT_43" #Compatible with BSD 4.3 [KEEP THIS!] options IDE_DELAY=5000 options UCONSOLE #Allow users to grab the console options FAILSAFE #Be conservative options USERCONFIG #boot -c editor options VISUAL_USERCONFIG #visual boot -c editor config kernel root on wd0s2a controller isa0 controller eisa0 controller pci0 controller fdc0 at isa? port "IO_FD1" bio irq 6 drq 2 disk fd0 at fdc0 drive 0 disk fd1 at fdc0 drive 1 options "CMD640" # work around CMD640 chip deficiency controller wdc0 at isa? port "IO_WD1" bio irq 14 disk wd0 at wdc0 drive 0 disk wd1 at wdc0 drive 1 controller wdc1 at isa? port "IO_WD2" bio irq 15 disk wd2 at wdc1 drive 0 disk wd3 at wdc1 drive 1 options ATAPI #Enable ATAPI support for IDE bus options ATAPI_STATIC #Don't do it as an LKM device acd0 #IDE CD-ROM controller atkbdc0 at isa? port IO_KBD tty device atkbd0 at isa? tty irq 1 device psm0 at isa? tty irq 12 device vga0 at isa? port ? conflicts pseudo-device splash device sc0 at isa? tty device npx0 at isa? port IO_NPX irq 13 device sio0 at isa? port "IO_COM1" flags 0x10 tty irq 4 device sio1 at isa? port "IO_COM2" tty irq 3 device ppc0 at isa? port? net irq 7 controller ppbus0 device nlpt0 at ppbus? device plip0 at ppbus? device ppi0 at ppbus? device xl0 pseudo-device loop pseudo-device ether pseudo-device sl 1 pseudo-device ppp 1 pseudo-device tun 1 pseudo-device pty 16 pseudo-device gzip # Exec gzipped a.out's options KTRACE #kernel tracing options SYSVSHM options SYSVMSG pseudo-device bpfilter 4 #Berkeley packet filter Maxusers is set high because I needed the machine to be able to handle about 20000 simultaneous TCP connections (mostly in TIME_WAIT state). These are for Apache. The output of dmesg(1) follows: Copyright (c) 1992-1999 FreeBSD Inc. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. FreeBSD 3.1-RELEASE #0: Wed Mar 10 11:57:33 EST 1999 shalunov@tuzik.lz.att.com:/usr/src/sys/compile/TUZIK Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 398776065 Hz CPU: Pentium II/Xeon/Celeron (398.78-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x652 Stepping=2 Features=0x183f9ff> real memory = 67108864 (65536K bytes) avail memory = 61865984 (60416K bytes) Preloaded elf kernel "kernel" at 0xf0288000. Probing for devices on PCI bus 0: chip0: rev 0x02 on pci0.0.0 chip1: rev 0x02 on pci0.1.0 chip2: rev 0x02 on pci0.7.0 ide_pci0: rev 0x01 on pci0.7.1 chip3: rev 0x02 on pci0.7.3 chip4: rev 0x03 on pci0.15.0 xl0: <3Com 3c905B Fast Etherlink XL 10/100BaseTX> rev 0x24 int a irq 11 on pci0.17.0 xl0: Ethernet address: 00:c0:4f:6e:f9:37 xl0: autoneg complete, link status good (half-duplex, 10Mbps) Probing for devices on PCI bus 1: vga0: rev 0x5c int a irq 9 on pci1.0.0 Probing for devices on PCI bus 2: Probing for devices on the ISA bus: sc0 on isa sc0: VGA color <16 virtual consoles, flags=0x0> atkbdc0 at 0x60-0x6f on motherboard atkbd0 irq 1 on isa psm0 irq 12 on isa psm0: model IntelliMouse, device ID 3 sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa sio0: type 16550A sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: FIFO enabled, 8 bytes threshold fd0: 1.44MB 3.5in wdc0 at 0x1f0-0x1f7 irq 14 on isa wdc0: unit 0 (wd0): wd0: 6149MB (12594960 sectors), 12495 cyls, 16 heads, 63 S/T, 512 B/S wdc1 at 0x170-0x177 irq 15 on isa wdc1: unit 0 (atapi): , removable, accel, dma, iordis acd0: drive speed 5512KB/sec, 256KB cache acd0: supported read types: CD-R, CD-RW, CD-DA acd0: Audio: play, 16 volume levels acd0: Mechanism: ejectable tray acd0: Medium: no/blank disc inside, unlocked ppc0 at 0x378 irq 7 on isa ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold nlpt0: on ppbus 0 nlpt0: Interrupt-driven port ppi0: on ppbus 0 plip0: on ppbus 0 vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa npx0 on motherboard npx0: INT 16 interface WARNING: / was not properly dismounted The file /etc/rc.conf follows: # This file now contains just the overrides from /etc/defaults/rc.conf # please make all changes to this file. nfs_client_enable="YES" network_interfaces="xl0 lo0" ifconfig_xl0="inet 135.25.200.84 netmask 255.255.255.0" defaultrouter="135.25.200.1" hostname="tuzik.lz.att.com" linux_enable="YES" accounting_enable="YES" lpd_enable="YES" moused_port="/dev/psm0" moused_enable="YES" saver="logo" blanktime="300" font8x8="koi8-r-8x8" font8x14="koi8-r-8x14" font8x16="koi8-r-8x16" keyrate="fast" keymap="ru.koi8-r" named_enable="YES" sendmail_flags="-bd -q7m" dumpdev="/dev/wd0s2b" The only non-system daemon running is Apache (apache13 from the ports collection). I will be happy to provide any additional enviromental information if you cannot reproduce the problem. >Description: When I run a program to benchmark the system's ability to do fork/exec's, the kernel reproduceably locks up. The program is attached below. The program allows to specify for how long to run the test. If a small number is choosen (up to 5 seconds), everything is OK and the system shows truly unbelievable rates (900--700 fork/exec's per second). If I run the program for the default duration (60 seconds) the console that the program was run from (I do not use X because of unsupported video card) accepts input, but ^C, ^\ and ^Z do nothing (just get printed); the other virtual consoles can be switched to, but the display of programs such as top(1) is not updated and input is ignored completely. Network connections are accepted, but not served (I tried to telnet in to reboot the machine; the connection was accepted, but the login prompt never appeared). The screensaver worked all right: it turned on and then turned off when I pressed a key. The three finger salute resulted in nothing. The machine had to be cold booted. >How-To-Repeat: The file fork-exec.c follows: /* Test program that simply generates lots of fork/exec's. We execute `/bin/sh -c ""'. Written by Stanislav Shalunov. */ #include #include #include #include #include volatile int more_forking; void handler(sig) int sig; { switch (sig) { case SIGINT: case SIGALRM: more_forking = 0; break; default: ; } } int main(argc, argv) int argc; char *argv[]; { int pid; int attempts_to_fork, forks; int start, duration; int time_to_fork; int delay; struct sigaction sa; delay = 0; switch (argc) { case 1: time_to_fork = 60; break; case 2: case 3: time_to_fork = atoi(argv[1]); if (time_to_fork < 10) fprintf(stderr, "Warning: time value of %ds is too " "small to get adequate results.\n", time_to_fork); if (argc > 2) delay = atoi(argv[2]); break; default: fprintf(stderr, "Usage: fork-exec [time to fork in seconds" " [delay in milliseconds]]\n" "Default time is 60 seconds. You can always send an" " interrupt earlier.\nDefault delay is zero.\n"); exit(1); } sa.sa_handler = handler; sa.sa_flags = 0; sigemptyset(&sa.sa_mask); sigaction(SIGINT, &sa, NULL); sigaction(SIGALRM, &sa, NULL); alarm(time_to_fork); attempts_to_fork = forks = 0; more_forking = 1; system("uptime"); start = time(NULL); while (more_forking) { attempts_to_fork++; /* poll(0, 0, 0) should be basically harmless, but we want to eliminate (1) the overhead of a system call (2) the possibility to sleep for a long time because of an overloaded system. */ if (delay) poll(0, 0, delay); pid = fork(); if (pid == 0) { /* XXX: On different systems on can have different shell startup files, etc. */ execl("/bin/sh", "sh", "-c", "", 0); perror("execl"); exit(1); } else if (pid > 0) forks++; } duration = time(NULL) - start; if (! duration) duration = 1; printf("In %d secs made %d attempts to fork, of which %d succeeded.\n", duration, attempts_to_fork, forks); printf("%d%% of forks succeeded; %d forks/second.\n", forks*100/attempts_to_fork, forks/(duration)); fflush(stdout); execlp("uptime", "uptime", 0); fprintf(stderr, "Either ``uptime'' is not in your PATH, or this test\n" "has brought the system to its knees, see what happens:\n"); perror("execlp(uptime)"); exit(1); } In order to reproduce the problem, compile the program (I tried ``gcc -o fork-exec fork-exec.c'' and ``gcc -O6 -o fork-exec fork-exec.c''), and run it with ``./fork-exec''. Have your finger on the RESET button. You might want to play with the arguments, for small values of the first argument the machine doesn't crash. >Fix: Unknown. Workaround might be to set rlimits, but that's not what I am looking for: other Unix systems (such as Solaris, etc.) do not misbehave in any way when I test them this way. >Release-Note: >Audit-Trail: From: Sheldon Hearn To: stanislav shalunov Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Fri, 19 Mar 1999 16:52:26 +0200 Hi Stanislav, There are known issues surrounding high values of maxusers (search freebbsd-current mailing list archives). I'll bet you a noddy badge that your box won't lock up if you drop maxusers to 64 and try again. Ciao, Sheldon. From: Stanislav Shalunov To: sheldonh@iafrica.com (Sheldon Hearn) Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Fri, 19 Mar 1999 10:32:29 -0500 (EST) Sheldon Hearn has written: > There are known issues surrounding high values of maxusers (search > freebbsd-current mailing list archives). I'll bet you a noddy badge that > your box won't lock up if you drop maxusers to 64 and try again. The noddy badge can be sent to Stanislav Shalunov 307 Middletown Lincroft Rd, 1M-214 Lincroft, NJ 07738-1526 I have decreased maxusers to 64 (recompiled the kernel and rebooted into it, of course), and the machine still locked up with exactly the same symptoms. -- stanislav shalunov@lynxhub.att.com | 732-576-3252 From: Sheldon Hearn To: shalunov@lynxhub.lz.att.com Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Fri, 26 Mar 1999 03:17:40 +0200 I've tried your stress test on my 4.0-CURRENT box (built today) and I don't get a lock-up. Instead, I get thousands of these after the first few seconds: Mar 26 03:15:15 axl /kernel: proc: table i Mar 26 03:15:15 axl /kernel: proc: table is full Mar 26 03:15:15 axl last message repeated 43 times Mar 26 03:15:15 axl /kernel: p Mar 26 03:15:15 axl /kernel: proc: table is full [...] I woulda thought this to be expected behaviour. Have you tried your test on a CURRENT box? Ciao, Sheldon. From: stanislav shalunov To: sheldonh@iafrica.com Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Fri, 26 Mar 1999 09:59:48 -0500 (EST) > From: Sheldon Hearn > I've tried your stress test on my 4.0-CURRENT box (built today) and I > don't get a lock-up. Instead, I get thousands of these after the first > few seconds: What results do you get for the default run (without arguments)? > Mar 26 03:15:15 axl /kernel: proc: table i > Mar 26 03:15:15 axl /kernel: proc: table is full > Mar 26 03:15:15 axl last message repeated 43 times > Mar 26 03:15:15 axl /kernel: p > Mar 26 03:15:15 axl /kernel: proc: table is full > [...] I would expect something like this. (I don't quite understand why syslogd is missing *parts* of messages--I would expect it either to get message or to lose it,--but that's another story.) > I woulda thought this to be expected behaviour. Have you tried your test > on a CURRENT box? I would actually prefer to keep this a -release box. If -current doesn't lock up, then there must be a reason for this. (Other than that your hardware might have some slightly different timings subtly affecting the kernel, or that your other system activity pattern is different.) What kernel patches might be relevant, anyone? -- stanislav shalunov@att.com 732-576-3252 From: Sheldon Hearn To: stanislav shalunov Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Sat, 27 Mar 1999 14:12:23 +0200 On Fri, 26 Mar 1999 09:59:48 EST, stanislav shalunov wrote: > What results do you get for the default run (without arguments)? $ ./fork-exec 2:09PM up 2 days, 26 secs, 4 users, load averages: 0.03, 0.04, 0.01 In 60 secs made 3532420 attempts to fork, of which 992 succeeded. 0% of forks succeeded; 16 forks/second. 2:10PM up 2 days, 1 min, 4 users, load averages: 5.62, 2.34, 0.92 During the test, any command run from bash gives: bash: fork: Resource temporarily unavailable Ciao, Sheldon. From: stanislav shalunov To: givenc@rmci.net Cc: freebsd-gnats-submit@freebsd.org Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Mon, 19 Jul 1999 09:52:21 -0400 (EDT) I have mentioned setting rlimits as a remedy in my original report. The thing is, other Unices as well as reportedly other versions of FreeBSD don't lock up like that when this test is run. From: Robert Garrett To: freebsd-gnats-submit@freebsd.org, shalunov@lynxhub.lz.att.com Cc: Subject: Re: kern/10545: When a fork/exec stress test is run, the machine locks up Date: Wed, 11 Aug 1999 06:32:13 -0400 Interesting.. Mondays -current drops core running this program from a non root user.. dual PII-350 mhz box 64 meg memory max users is set to 350 but i was under the impression that dg had fixed that. Robg State-Changed-From-To: open->feedback State-Changed-By: will State-Changed-When: Thu May 24 18:43:13 PDT 2001 State-Changed-Why: Sheldon seemed to indicate that this problem might be solved in newer versions of FreeBSD. Confirm, please? http://www.FreeBSD.org/cgi/query-pr.cgi?pr=10545 State-Changed-From-To: feedback->closed State-Changed-By: will State-Changed-When: Thu May 24 18:45:50 PDT 2001 State-Changed-Why: Submitter's email address expired. http://www.FreeBSD.org/cgi/query-pr.cgi?pr=10545 State-Changed-From-To: closed->open State-Changed-By: will State-Changed-When: Thu May 24 18:50:08 PDT 2001 State-Changed-Why: I'll have another look at this regardless... Responsible-Changed-From-To: freebsd-bugs->will Responsible-Changed-By: will Responsible-Changed-When: Thu May 24 18:50:08 PDT 2001 Responsible-Changed-Why: I'm gonna check this out sometime soon. http://www.FreeBSD.org/cgi/query-pr.cgi?pr=10545 Responsible-Changed-From-To: will->freebsd-bugs Responsible-Changed-By: will Responsible-Changed-When: Wed Aug 29 16:24:04 PDT 2001 Responsible-Changed-Why: Let someone else handle this. http://www.FreeBSD.org/cgi/query-pr.cgi?pr=10545 State-Changed-From-To: open->closed State-Changed-By: dd State-Changed-When: Mon Sep 3 11:14:11 PDT 2001 State-Changed-Why: will should've closed this instead of bring it back to bugs since he reopened it for "wanting to check it out sometime soon". http://www.FreeBSD.org/cgi/query-pr.cgi?pr=10545 >Unformatted: