Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 16 Sep 1998 17:55:15 -0700
From:      Mike Smith <mike@smith.net.au>
To:        current@FreeBSD.ORG
Cc:        bde@FreeBSD.ORG
Subject:   Death by SIGXCPU (problems with our clock code)
Message-ID:  <199809170055.RAA01288@dingo.cdrom.com>

next in thread | raw e-mail | index | archive | help
This is a multipart MIME message.

--==_Exmh_-21291286440
Content-Type: text/plain; charset=us-ascii


Many of us have been seeing problems with processes being killed for
absurd CPU time usage recently, and the blame has been waved everywhere 
from APM (it appears to be as powerful as the Good Times virus when it 
comes to getting into new systems) to the quality of pickled herring 
this time of year. 

Since nobody else has taken up my suggestion to instrument the code to 
find out what's going on, I've shouldered the cross.  I forwarded some 
of this to Poul earlier today, but I realise he may not be able to do 
anything about it immediately, so I thought I should open the matter 
for general study.

Attached are some small diffs to kern_clock.c and kern_synch.c which 
check various time values and report on oddities.  The indication at 
the moment would seem to be that there is a problem in the arithmetic in
kern_clock.c:tco_forward(), and I invite our arithmetic pedants to 
study it briefly to see if they can spot any problems, or to suggest 
further tests to bring the culprits to justice.

Meanwhile, here's the output I've seen so far:

mi_switch: switching history 1985.236473 > 1985.-695149134
mi_switch: switching history 1987.693624450 > 1985.241575
calcru: negative time of -1390647919 usec for pid 97 (syslogd)
getmicrotime: 905988347.835090 > 905988347.825089
microtime: 905988936.547000 > 905988936.-694832253
microtime: 905989985.342863 > 905989985.-695036431
microtime: 905991238.525890 > 905991238.-694847527
mi_switch: switching history 6224.460675 > 6224.-694925183
mi_switch: switching history 6226.693847510 > 6224.462841
mi_switch: switching history 6228.691848697 > 6224.464297
mi_switch: switching history 6230.689849221 > 6224.464892
calcru: negative time of -290624527 usec for pid 278 (xterm)
calcru: negative time of -685601360 usec for pid 272 (xterm)
mi_switch: switching history 7182.665240 > 7182.-694720727
mi_switch: switching history 7184.694051835 > 7182.667612
mi_switch: switching history 7186.692054475 > 7182.670740
mi_switch: switching history 7188.690056433 > 7182.696035
mi_switch: switching history 7190.688059242 > 7182.716614
mi_switch: switching history 7192.686061114 > 7182.726335
mi_switch: switching history 7194.684063012 > 7182.735806
mi_switch: switching history 7196.682064876 > 7182.745597
mi_switch: switching history 7198.680066750 > 7182.754845
mi_switch: switching history 7200.678068675 > 7182.764626
mi_switch: switching history 7202.676069640 > 7182.766654
mi_switch: switching history 7204.674071377 > 7182.783861
mi_switch: switching history 7206.672073450 > 7182.793647
mi_switch: switching history 7208.670075397 > 7182.803403
mi_switch: switching history 7210.668077325 > 7182.813224
mi_switch: switching history 7212.666079274 > 7182.822987
mi_switch: switching history 7214.664081179 > 7182.832737
mi_switch: switching history 7216.662083083 > 7182.842469
mi_switch: switching history 7218.660084985 > 7182.852329
mi_switch: switching history 7220.658086912 > 7182.862197
mi_switch: switching history 7222.656088814 > 7182.865549
mi_switch: switching history 7224.654090663 > 7182.874002
mi_switch: switching history 7226.652092595 > 7182.891112
mi_switch: switching history 7228.650094568 > 7182.900933
mi_switch: switching history 7230.648096446 > 7182.910791
mi_switch: switching history 7232.646098330 > 7182.920238
mi_switch: switching history 7234.644100213 > 7182.930085
mi_switch: switching history 7236.642102122 > 7182.939641
mi_switch: switching history 7238.640104005 > 7182.949304
mi_switch: switching history 7240.638105889 > 7182.959102
mi_switch: switching history 7242.636107832 > 7182.965556
mi_switch: switching history 7244.634109633 > 7182.970635
mi_switch: switching history 7246.632111500 > 7182.987897
mi_switch: switching history 7248.630113451 > 7182.997693
mi_switch: switching history 7250.628115337 > 7183.5565
mi_switch: switching history 7252.626116142 > 7183.7926
mi_switch: switching history 7254.624117864 > 7183.11609
mi_switch: switching history 7256.622119701 > 7183.37161
mi_switch: switching history 7258.620121712 > 7183.46838
mi_switch: switching history 7260.618123816 > 7183.56696
mi_switch: switching history 7262.616125696 > 7183.65549
mi_switch: switching history 7264.614127486 > 7183.68279
mi_switch: switching history 7266.612129322 > 7183.85323
mi_switch: switching history 7268.610131323 > 7183.87777
mi_switch: switching history 7270.608135227 > 7183.93281
mi_switch: switching history 7272.606136952 > 7183.117345
mi_switch: switching history 7274.604138977 > 7183.125618
mi_switch: switching history 7276.602139248 > 7183.127485
mi_switch: switching history 7278.600141013 > 7183.129287
mi_switch: switching history 7280.598142789 > 7183.134399
mi_switch: switching history 7282.596144784 > 7183.136835
mi_switch: switching history 7284.594145490 > 7183.139110
mi_switch: switching history 7286.592147226 > 7183.165555
mi_switch: switching history 7288.590149056 > 7183.188125
mi_switch: switching history 7290.588151269 > 7183.205708
mi_switch: switching history 7292.586153236 > 7183.215177
mi_switch: switching history 7294.584155252 > 7183.217624
mi_switch: switching history 7296.582158959 > 7183.222991
mi_switch: switching history 7298.580160682 > 7183.247362
mi_switch: switching history 7300.578162685 > 7183.259122
mi_switch: switching history 7302.576164593 > 7183.265601
mi_switch: switching history 7304.574166441 > 7183.267790
calcru: negative time of -1224705799 usec for pid 239 (XF86_SVGA)

As you can see, this accounted for two xterms and the X server in a 
period of about 30 minutes.

Any input would be gratefully received - we need to fix this before 3.0 
or it will be a total writeoff for laptop users.


--==_Exmh_-21291286440
Content-Type: text/plain ; name="kern_synch.diff"; charset=us-ascii
Content-Description: kern_synch.diff
Content-Disposition: attachment; filename="kern_synch.diff"

Index: kern_synch.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.61
diff -u -r1.61 kern_synch.c
--- kern_synch.c	1998/07/15 02:32:10	1.61
+++ kern_synch.c	1998/09/16 22:20:27
@@ -604,6 +604,7 @@
 	register struct proc *p = curproc;	/* XXX */
 	register struct rlimit *rlim;
 	int x;
+	u_int64_t	runtime;
 
 	/*
 	 * XXX this spl is almost unnecessary.  It is partly to allow for
@@ -633,9 +634,21 @@
 	 * process was running, and add that to its total so far.
 	 */
 	microuptime(&switchtime);
+	if (timevalcmp(&(p->p_switchtime), &switchtime, >))
+	    printf("mi_switch: switching history %d.%d > %d.%d\n",
+		   p->p_switchtime.tv_sec, p->p_switchtime.tv_usec,
+		   switchtime.tv_sec, switchtime.tv_usec);
+	runtime = p->p_runtime;
 	p->p_runtime += (switchtime.tv_usec - p->p_switchtime.tv_usec) +
 	    (switchtime.tv_sec - p->p_switchtime.tv_sec) * (int64_t)1000000;
-
+	timevalsub(&switchtime,&(p->p_switchtime));
+	runtime += ((u_int64_t)switchtime.tv_sec * 1000000) + 
+	    switchtime.tv_usec;
+	
+	if (runtime != p->p_runtime)
+	    printf("mi_switch: runtime botch, is %08x%08x should %08x%08x\n",
+		   (int)(p->p_runtime >> 32), (int)(p->p_runtime & 0xffffffff),
+		   (int)(runtime >> 32), (int)(runtime & 0xffffffff));
 	/*
 	 * Check if the process exceeds its cpu resource allocation.
 	 * If over max, kill it.

--==_Exmh_-21291286440
Content-Type: text/plain ; name="kern_clock.diff"; charset=us-ascii
Content-Description: kern_clock.diff
Content-Disposition: attachment; filename="kern_clock.diff"

Index: kern_clock.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.78
diff -u -r1.78 kern_clock.c
--- kern_clock.c	1998/08/05 18:06:40	1.78
+++ kern_clock.c	1998/09/16 22:28:17
@@ -497,24 +497,37 @@
 getmicrotime(struct timeval *tvp)
 {
 	struct timecounter *tc;
+	static struct timeval otv;
 
 	tc = timecounter;
 	*tvp = tc->tc_microtime;
+
+	if (timevalcmp(&otv, tvp, >))
+	    printf("getmicrotime: %d.%d > %d.%d",
+		   otv.tv_sec, otv.tv_usec, tvp->tv_sec, tvp->tv_usec);
+	otv = *tvp;
 }
 
 void
 getnanotime(struct timespec *tsp)
 {
 	struct timecounter *tc;
+	static struct timespec ots;
 
 	tc = timecounter;
 	*tsp = tc->tc_nanotime;
+
+	if (timespeccmp(&ots, tsp, >))
+	    printf("getnanotime: %d.%d > %d.%d",
+		   ots.tv_sec, ots.tv_nsec, tsp->tv_sec, tsp->tv_nsec);
+	ots = *tsp;
 }
 
 void
 microtime(struct timeval *tv)
 {
 	struct timecounter *tc;
+	static struct timeval otv;
 
 	tc = (struct timecounter *)timecounter;
 	tv->tv_sec = tc->tc_offset_sec;
@@ -526,6 +539,11 @@
 		tv->tv_usec -= 1000000;
 		tv->tv_sec++;
 	}
+
+	if (timevalcmp(&otv, tv, >))
+	    printf("microtime: %d.%d > %d.%d",
+		   otv.tv_sec, otv.tv_usec, tv->tv_sec, tv->tv_usec);
+	otv = *tv;
 }
 
 void
@@ -534,6 +552,7 @@
 	unsigned count;
 	u_int64_t delta;
 	struct timecounter *tc;
+	static struct timespec ots;
 
 	tc = (struct timecounter *)timecounter;
 	ts->tv_sec = tc->tc_offset_sec;
@@ -549,6 +568,11 @@
 		ts->tv_sec++;
 	}
 	ts->tv_nsec = delta;
+
+	if (timespeccmp(&ots, ts, >))
+	    printf("nanotime: %d.%d > %d.%d",
+		   ots.tv_sec, ots.tv_nsec, ts->tv_sec, ts->tv_nsec);
+	ots = *ts;
 }
 
 void

--==_Exmh_-21291286440
Content-Type: text/plain; charset=us-ascii

\\  Sometimes you're ahead,       \\  Mike Smith
\\  sometimes you're behind.      \\  mike@smith.net.au
\\  The race is long, and in the  \\  msmith@freebsd.org
\\  end it's only with yourself.  \\  msmith@cdrom.com

--==_Exmh_-21291286440--



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809170055.RAA01288>