From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 02:15:21 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7603237B401 for ; Sun, 15 Jun 2003 02:15:21 -0700 (PDT) Received: from mailout02.sul.t-online.com (mailout02.sul.t-online.com [194.25.134.17]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9303E43F93 for ; Sun, 15 Jun 2003 02:15:20 -0700 (PDT) (envelope-from Alexander@Leidinger.net) Received: from fwd05.aul.t-online.de by mailout02.sul.t-online.com with smtp id 19RTbZ-0005KO-05; Sun, 15 Jun 2003 11:15:17 +0200 Received: from Andro-Beta.Leidinger.net (bLHHceZpwe3upx2cE9GCNRtXTO8FgbbvyCvBildQPTizoMO8T1xe81@[217.83.19.136]) by fmrl05.sul.t-online.com with esmtp id 19RTbL-1ZdjhA0; Sun, 15 Jun 2003 11:15:03 +0200 Received: from Magelan.Leidinger.net (Magelan [192.168.1.1]) h5F9F1oM082662; Sun, 15 Jun 2003 11:15:01 +0200 (CEST) (envelope-from Alexander@Leidinger.net) Received: from Magelan.Leidinger.net (netchild@localhost [127.0.0.1]) by Magelan.Leidinger.net (8.12.9/8.12.9) with SMTP id h5F9F1qC000840; Sun, 15 Jun 2003 11:15:01 +0200 (CEST) (envelope-from Alexander@Leidinger.net) Date: Sun, 15 Jun 2003 11:15:01 +0200 From: Alexander Leidinger To: Dag-Erling Smorgrav Message-Id: <20030615111501.7cd49611.Alexander@Leidinger.net> In-Reply-To: References: <20030614183544.051c7a57.Alexander@Leidinger.net> X-Mailer: Sylpheed version 0.8.10claws (GTK+ 1.2.10; i386-portbld-freebsd5.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Seen: false X-ID: bLHHceZpwe3upx2cE9GCNRtXTO8FgbbvyCvBildQPTizoMO8T1xe81@t-dialin.net cc: freebsd-arch@freebsd.org Subject: Re: unbreaking alloca X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 09:15:21 -0000 On Sun, 15 Jun 2003 00:45:23 +0200 Dag-Erling Smorgrav wrote: > How's this? Yes, the comment looks good to me. I tested if icc also understands __buildin_alloca, and it does. I assume ecc understands it too, so you could also check for __INTEL_COMPILER (if someone is interested: I also have icc patches for some of the gcc specific parts in cdefs.h). Bye, Alexander. -- If Bill Gates had a dime for every time a Windows box crashed... ...Oh, wait a minute, he already does. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 02:51:47 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D4BCC37B401 for ; Sun, 15 Jun 2003 02:51:47 -0700 (PDT) Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by mx1.FreeBSD.org (Postfix) with ESMTP id C576243F93 for ; Sun, 15 Jun 2003 02:51:46 -0700 (PDT) (envelope-from des@ofug.org) Received: by flood.ping.uio.no (Postfix, from userid 2602) id 62C7B530F; Sun, 15 Jun 2003 11:51:43 +0200 (CEST) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Alexander Leidinger References: <20030614183544.051c7a57.Alexander@Leidinger.net> <20030615111501.7cd49611.Alexander@Leidinger.net> From: Dag-Erling Smorgrav Date: Sun, 15 Jun 2003 11:51:43 +0200 In-Reply-To: <20030615111501.7cd49611.Alexander@Leidinger.net> (Alexander Leidinger's message of "Sun, 15 Jun 2003 11:15:01 +0200") Message-ID: User-Agent: Gnus/5.1001 (Gnus v5.10.1) Emacs/21.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii cc: freebsd-arch@freebsd.org Subject: Re: unbreaking alloca X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 09:51:48 -0000 Alexander Leidinger writes: > I tested if icc also understands __buildin_alloca, and it does. I assume > ecc understands it too, so you could also check for __INTEL_COMPILER (if > someone is interested: I also have icc patches for some of the gcc > specific parts in cdefs.h). does #if defined(__GNUC__) || defined(__INTEL_COMPILER) look OK to you? DES -- Dag-Erling Smorgrav - des@ofug.org From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 03:48:22 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9957437B401 for ; Sun, 15 Jun 2003 03:48:22 -0700 (PDT) Received: from mailout03.sul.t-online.com (mailout03.sul.t-online.com [194.25.134.81]) by mx1.FreeBSD.org (Postfix) with ESMTP id A0BE243F75 for ; Sun, 15 Jun 2003 03:48:21 -0700 (PDT) (envelope-from Alexander@Leidinger.net) Received: from fwd04.aul.t-online.de by mailout03.sul.t-online.com with smtp id 19RV3a-0007Kf-0C; Sun, 15 Jun 2003 12:48:18 +0200 Received: from Andro-Beta.Leidinger.net (ZGq3HOZare18xLrK+IdpG2HvHlnPZQPF3BQZdHBMqhuFJ6A46ePhoM@[217.83.19.136]) by fmrl04.sul.t-online.com with esmtp id 19RV3X-1qT1FY0; Sun, 15 Jun 2003 12:48:15 +0200 Received: from Magelan.Leidinger.net (Magelan [192.168.1.1]) h5FAmEoM082934; Sun, 15 Jun 2003 12:48:14 +0200 (CEST) (envelope-from Alexander@Leidinger.net) Received: from Magelan.Leidinger.net (netchild@localhost [127.0.0.1]) by Magelan.Leidinger.net (8.12.9/8.12.9) with SMTP id h5FAmDqC079595; Sun, 15 Jun 2003 12:48:13 +0200 (CEST) (envelope-from Alexander@Leidinger.net) Date: Sun, 15 Jun 2003 12:48:13 +0200 From: Alexander Leidinger To: Dag-Erling Smorgrav Message-Id: <20030615124813.7ecf3863.Alexander@Leidinger.net> In-Reply-To: References: <20030614183544.051c7a57.Alexander@Leidinger.net> <20030615111501.7cd49611.Alexander@Leidinger.net> X-Mailer: Sylpheed version 0.8.10claws (GTK+ 1.2.10; i386-portbld-freebsd5.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Seen: false X-ID: ZGq3HOZare18xLrK+IdpG2HvHlnPZQPF3BQZdHBMqhuFJ6A46ePhoM@t-dialin.net cc: freebsd-arch@freebsd.org Subject: Re: unbreaking alloca X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 10:48:22 -0000 On Sun, 15 Jun 2003 11:51:43 +0200 Dag-Erling Smorgrav wrote: > does > > #if defined(__GNUC__) || defined(__INTEL_COMPILER) > > look OK to you? Yes. Bye, Alexander. -- The best things in life are free, but the expensive ones are still worth a look. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 06:06:55 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5AE4D37B401 for ; Sun, 15 Jun 2003 06:06:55 -0700 (PDT) Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by mx1.FreeBSD.org (Postfix) with SMTP id 1178043FB1 for ; Sun, 15 Jun 2003 06:06:54 -0700 (PDT) (envelope-from iedowse@maths.tcd.ie) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 15 Jun 2003 14:06:53 +0100 (BST) To: freebsd-arch@freebsd.org Date: Sun, 15 Jun 2003 14:06:50 +0100 From: Ian Dowse Message-ID: <200306151406.aa36218@salmon.maths.tcd.ie> Subject: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 13:06:55 -0000 Below is a patch that makes the implementation of the kernel message buffer mostly reentrant and more generic, and stops printf() ever calling directly into the tty code. This should fix panics that can occur via tputchar() when using xconsole, and generally make the use of printf() in the kernel a bit safer. Many of the ideas here were suggested by Bruce Evans. A summary of the changes: - Use atomic operations to update the message buffer pointers. - Use a kind of sequence number for the pointers instead of just the offset into the buffer, as this avoids the need for the read code to touch the write pointer or the write code to touch the read pointer. - Change the interface to the message buffer functions so that the internals are not exposed to the callers, and pass in the message buffer structure pointer to all functions. - Put the new message buffer code in a new file subr_msgbuf.c. - Add a new message buffer `consmsgbuf' that is used to pass kernel messages to the TIOCCONS tty. - Allocate the buffer space for this when the console is redirected with TIOCCONS, and free it after the console is detached. - Use a timeout routine to send the messages to the tty. - Disable the virtual tty console while DDB is active. I'd like to commit this in a few days. Any objections, comments or suggestions? Ian Index: sbin/dmesg/dmesg.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sbin/dmesg/dmesg.c,v retrieving revision 1.19 diff -u -r1.19 dmesg.c --- sbin/dmesg/dmesg.c 2 May 2003 07:08:52 -0000 1.19 +++ sbin/dmesg/dmesg.c 18 May 2003 13:03:16 -0000 @@ -137,9 +137,7 @@ errx(1, "kvm_read: %s", kvm_geterr(kd)); kvm_close(kd); buflen = cur.msg_size; - bufpos = cur.msg_bufx; - if (bufpos >= buflen) - bufpos = 0; + bufpos = MSGBUF_SEQ_TO_POS(&cur, cur.msg_wseq); } /* Index: sys/sys/msgbuf.h =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/sys/msgbuf.h,v retrieving revision 1.20 diff -u -r1.20 msgbuf.h --- sys/sys/msgbuf.h 28 Mar 2003 02:50:10 -0000 1.20 +++ sys/sys/msgbuf.h 15 Jun 2003 12:00:45 -0000 @@ -41,16 +41,32 @@ #define MSG_MAGIC 0x063062 u_int msg_magic; int msg_size; /* size of buffer area */ - int msg_bufx; /* write pointer */ - int msg_bufr; /* read pointer */ + int msg_wseq; /* write sequence number */ + int msg_rseq; /* read sequence number */ + int msg_seqmod; /* range for sequence numbers */ char *msg_ptr; /* pointer to buffer */ u_int msg_cksum; /* checksum of contents */ }; +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? \ + (mbp)->msg_seqmod : 0)) +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ + (u_int)(mbp)->msg_size)) +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2))) + #ifdef _KERNEL extern int msgbuftrigger; extern struct msgbuf *msgbufp; void msgbufinit(void *ptr, int size); +void msgbuf_addchar(struct msgbuf *mbp, int c); +void msgbuf_clear(struct msgbuf *mbp); +void msgbuf_copy(struct msgbuf *src, struct msgbuf *dst); +int msgbuf_getbytes(struct msgbuf *mbp, char *buf, int buflen); +int msgbuf_getchar(struct msgbuf *mbp); +int msgbuf_getcount(struct msgbuf *mbp); +void msgbuf_init(struct msgbuf *mbp, void *ptr, int size); +void msgbuf_reinit(struct msgbuf *mbp, void *ptr, int size); +int msgbuf_peekbytes(struct msgbuf *mbp, char *buf, int buflen, int *seqp); #if !defined(MSGBUF_SIZE) #define MSGBUF_SIZE 32768 Index: sys/sys/tty.h =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/sys/tty.h,v retrieving revision 1.71 diff -u -r1.71 tty.h --- sys/sys/tty.h 5 Mar 2003 08:17:10 -0000 1.71 +++ sys/sys/tty.h 15 Jun 2003 01:27:23 -0000 @@ -265,6 +265,7 @@ #ifdef MALLOC_DECLARE MALLOC_DECLARE(M_TTYS); #endif +extern struct msgbuf consmsgbuf; /* Message buffer for constty. */ extern struct tty *constty; /* Temporary virtual console. */ extern long tk_cancc; extern long tk_nin; @@ -275,6 +276,8 @@ void catq(struct clist *from, struct clist *to); void clist_alloc_cblocks(struct clist *q, int ccmax, int ccres); void clist_free_cblocks(struct clist *q); +void constty_set(struct tty *tp); +void constty_clear(void); int getc(struct clist *q); void ndflush(struct clist *q, int cc); char *nextc(struct clist *q, char *cp, int *c); Index: sys/kern/tty.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/kern/tty.c,v retrieving revision 1.202 diff -u -r1.202 tty.c --- sys/kern/tty.c 11 Jun 2003 00:56:58 -0000 1.202 +++ sys/kern/tty.c 15 Jun 2003 10:31:53 -0000 @@ -261,7 +261,7 @@ funsetown(&tp->t_sigio); s = spltty(); if (constty == tp) - constty = NULL; + constty_clear(); ttyflush(tp, FREAD | FWRITE); clist_free_cblocks(&tp->t_canq); @@ -871,9 +871,9 @@ if (error) return (error); - constty = tp; + constty_set(tp); } else if (tp == constty) - constty = NULL; + constty_clear(); break; case TIOCDRAIN: /* wait till output drained */ error = ttywait(tp); Index: sys/kern/tty_cons.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/kern/tty_cons.c,v retrieving revision 1.111 diff -u -r1.111 tty_cons.c --- sys/kern/tty_cons.c 11 Jun 2003 00:56:58 -0000 1.111 +++ sys/kern/tty_cons.c 15 Jun 2003 01:15:23 -0000 @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -117,11 +118,16 @@ static int cn_mute; static int openflag; /* how /dev/console was opened */ static int cn_is_open; +static char *consbuf; /* buffer used by `consmsgbuf' */ +static struct callout conscallout; /* callout for outputting to constty */ +struct msgbuf consmsgbuf; /* message buffer for console tty */ static u_char console_pausing; /* pause after each line during probe */ static char *console_pausestr= ""; +struct tty *constty; /* pointer to console "window" tty */ void cndebug(char *); +static void constty_timeout(void *arg); CONS_DRIVER(cons, NULL, NULL, NULL, NULL, NULL, NULL, NULL); SET_DECLARE(cons_set, struct consdev); @@ -587,6 +593,70 @@ } if (on) refcount++; +} + +static int consmsgbuf_size = 8192; +SYSCTL_INT(_kern, OID_AUTO, consmsgbuf_size, CTLFLAG_RW, &consmsgbuf_size, 0, + ""); + +/* + * Redirect console output to a tty. + */ +void +constty_set(struct tty *tp) +{ + int size; + + KASSERT(tp != NULL, ("constty_set: NULL tp")); + if (consbuf == NULL) { + size = consmsgbuf_size; + consbuf = malloc(size, M_TTYS, M_WAITOK); + msgbuf_init(&consmsgbuf, consbuf, size); + callout_init(&conscallout, 0); + } + constty = tp; + constty_timeout(NULL); +} + +/* + * Disable console redirection to a tty. + */ +void +constty_clear(void) +{ + int c; + + constty = NULL; + if (consbuf == NULL) + return; + callout_stop(&conscallout); + while ((c = msgbuf_getchar(&consmsgbuf)) != -1) + cnputc(c); + free(consbuf, M_TTYS); + consbuf = NULL; +} + +/* Times per second to check for pending console tty messages. */ +static int constty_wakeups_per_second = 5; +SYSCTL_INT(_kern, OID_AUTO, constty_wakeups_per_second, CTLFLAG_RW, + &constty_wakeups_per_second, 0, ""); + +static void +constty_timeout(void *arg) +{ + int c; + + while (constty != NULL && (c = msgbuf_getchar(&consmsgbuf)) != -1) { + if (tputchar(c, constty) < 0) + constty = NULL; + } + if (constty != NULL) { + callout_reset(&conscallout, hz / constty_wakeups_per_second, + constty_timeout, NULL); + } else { + /* Deallocate the constty buffer memory. */ + constty_clear(); + } } static void Index: sys/kern/subr_msgbuf.c =================================================================== RCS file: sys/kern/subr_msgbuf.c diff -N sys/kern/subr_msgbuf.c --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ sys/kern/subr_msgbuf.c 15 Jun 2003 12:05:37 -0000 @@ -0,0 +1,239 @@ +/* + * Copyright (c) 2003 Ian Dowse. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD$ + */ + +/* + * Generic message buffer support routines. + */ + +#include +#include +#include + +/* Read/write sequence numbers are modulo a multiple of the buffer size. */ +#define SEQMOD(size) ((size) * 16) + +static u_int msgbuf_cksum(struct msgbuf *mbp); + +/* + * Initialize a message buffer of the specified size at the specified + * location. This also zeros the buffer area. + */ +void +msgbuf_init(struct msgbuf *mbp, void *ptr, int size) +{ + + mbp->msg_ptr = ptr; + mbp->msg_size = size; + mbp->msg_seqmod = SEQMOD(size); + msgbuf_clear(mbp); + mbp->msg_magic = MSG_MAGIC; +} + +/* + * Reinitialize a message buffer, retaining its previous contents if + * the size and checksum are correct. If the old contents cannot be + * recovered, the message buffer is cleared. + */ +void +msgbuf_reinit(struct msgbuf *mbp, void *ptr, int size) +{ + u_int cksum; + + if (mbp->msg_magic != MSG_MAGIC || mbp->msg_size != size) { + msgbuf_init(mbp, ptr, size); + return; + } + mbp->msg_seqmod = SEQMOD(size); + mbp->msg_wseq = MSGBUF_SEQNORM(mbp, mbp->msg_wseq); + mbp->msg_rseq = MSGBUF_SEQNORM(mbp, mbp->msg_rseq); + mbp->msg_ptr = ptr; + cksum = msgbuf_cksum(mbp); + if (cksum != mbp->msg_cksum) { + printf("msgbuf cksum mismatch (read %x, calc %x)\n", + mbp->msg_cksum, cksum); + msgbuf_clear(mbp); + } +} + +/* + * Clear the message buffer. + */ +void +msgbuf_clear(struct msgbuf *mbp) +{ + + bzero(mbp->msg_ptr, mbp->msg_size); + mbp->msg_wseq = 0; + mbp->msg_rseq = 0; + mbp->msg_cksum = 0; +} + +/* + * Get a count of the number of unread characters in the message buffer. + */ +int +msgbuf_getcount(struct msgbuf *mbp) +{ + int len; + + len = MSGBUF_SEQSUB(mbp, mbp->msg_wseq, mbp->msg_rseq); + if (len < 0 || len > mbp->msg_size) + len = mbp->msg_size; + return (len); +} + +/* + * Append a character to a message buffer. This function can be + * considered fully reentrant so long as the number of concurrent + * callers is less than the number of characters in the buffer. + * However, the message buffer is only guaranteed to be consistent + * for reading when there are no callers in this function. + */ +void +msgbuf_addchar(struct msgbuf *mbp, int c) +{ + int new_seq, pos, seq; + + do { + seq = mbp->msg_wseq; + new_seq = MSGBUF_SEQNORM(mbp, seq + 1); + } while (atomic_cmpset_rel_int(&mbp->msg_wseq, seq, new_seq) == 0); + pos = MSGBUF_SEQ_TO_POS(mbp, seq); + atomic_add_int(&mbp->msg_cksum, (u_int)(u_char)c - + (u_int)(u_char)mbp->msg_ptr[pos]); + mbp->msg_ptr[pos] = c; +} + +/* + * Read and mark as read a character from a message buffer. + * Returns the character, or -1 if no characters are available. + */ +int +msgbuf_getchar(struct msgbuf *mbp) +{ + int c, len, wseq; + + wseq = mbp->msg_wseq; + len = MSGBUF_SEQSUB(mbp, wseq, mbp->msg_rseq); + if (len == 0) + return (-1); + if (len < 0 || len > mbp->msg_size) + mbp->msg_rseq = MSGBUF_SEQNORM(mbp, wseq - mbp->msg_size); + c = (u_char)mbp->msg_ptr[MSGBUF_SEQ_TO_POS(mbp, mbp->msg_rseq)]; + mbp->msg_rseq = MSGBUF_SEQNORM(mbp, mbp->msg_rseq + 1); + return (c); +} + +/* + * Read and mark as read a number of characters from a message buffer. + * Returns the number of characters that were placed in `buf'. + */ +int +msgbuf_getbytes(struct msgbuf *mbp, char *buf, int buflen) +{ + int len, pos, wseq; + + wseq = mbp->msg_wseq; + len = MSGBUF_SEQSUB(mbp, wseq, mbp->msg_rseq); + if (len == 0) + return (0); + if (len < 0 || len > mbp->msg_size) { + mbp->msg_rseq = MSGBUF_SEQNORM(mbp, wseq - mbp->msg_size); + len = mbp->msg_size; + } + pos = MSGBUF_SEQ_TO_POS(mbp, mbp->msg_rseq); + len = imin(len, mbp->msg_size - pos); + len = imin(len, buflen); + + bcopy(&mbp->msg_ptr[pos], buf, len); + mbp->msg_rseq = MSGBUF_SEQNORM(mbp, mbp->msg_rseq + len); + return (len); +} + +/* + * Peek at the full contents of a message buffer without marking any + * data as read. `seqp' should point to an integer that + * msgbuf_peekbytes() can use to retain state between calls so that + * the whole message buffer can be read in multiple short reads. + * To initialise this variable to the start of the message buffer, + * call msgbuf_peekbytes() with a NULL `buf' parameter. + * + * Returns the number of characters that were placed in `buf'. + */ +int +msgbuf_peekbytes(struct msgbuf *mbp, char *buf, int buflen, int *seqp) +{ + int len, pos, wseq; + + if (buf == NULL) { + /* Just initialise *seqp. */ + *seqp = MSGBUF_SEQNORM(mbp, mbp->msg_wseq - mbp->msg_size); + return (0); + } + + wseq = mbp->msg_wseq; + len = MSGBUF_SEQSUB(mbp, wseq, *seqp); + if (len == 0) + return (0); + if (len < 0 || len > mbp->msg_size) { + *seqp = MSGBUF_SEQNORM(mbp, wseq - mbp->msg_size); + len = mbp->msg_size; + } + pos = MSGBUF_SEQ_TO_POS(mbp, *seqp); + len = imin(len, mbp->msg_size - pos); + len = imin(len, buflen); + bcopy(&mbp->msg_ptr[MSGBUF_SEQ_TO_POS(mbp, *seqp)], buf, len); + *seqp = MSGBUF_SEQNORM(mbp, *seqp + len); + return (len); +} + +/* + * Compute the checksum for the complete message buffer contents. + */ +static u_int +msgbuf_cksum(struct msgbuf *mbp) +{ + u_int sum; + int i; + + sum = 0; + for (i = 0; i < mbp->msg_size; i++) + sum += (u_char)mbp->msg_ptr[i]; + return (sum); +} + +/* + * Copy from one message buffer to another. + */ +void +msgbuf_copy(struct msgbuf *src, struct msgbuf *dst) +{ + int c; + + while ((c = msgbuf_getchar(src)) >= 0) + msgbuf_addchar(dst, c); +} Index: sys/kern/subr_log.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/kern/subr_log.c,v retrieving revision 1.56 diff -u -r1.56 subr_log.c --- sys/kern/subr_log.c 11 Jun 2003 00:56:57 -0000 1.56 +++ sys/kern/subr_log.c 14 Jun 2003 14:59:09 -0000 @@ -122,11 +122,12 @@ static int logread(dev_t dev, struct uio *uio, int flag) { + char buf[128]; struct msgbuf *mbp = msgbufp; int error = 0, l, s; s = splhigh(); - while (mbp->msg_bufr == mbp->msg_bufx) { + while (msgbuf_getcount(mbp) == 0) { if (flag & IO_NDELAY) { splx(s); return (EWOULDBLOCK); @@ -141,19 +142,13 @@ logsoftc.sc_state &= ~LOG_RDWAIT; while (uio->uio_resid > 0) { - l = mbp->msg_bufx - mbp->msg_bufr; - if (l < 0) - l = mbp->msg_size - mbp->msg_bufr; - l = imin(l, uio->uio_resid); + l = imin(sizeof(buf), uio->uio_resid); + l = msgbuf_getbytes(mbp, buf, l); if (l == 0) break; - error = uiomove((char *)msgbufp->msg_ptr + mbp->msg_bufr, - l, uio); + error = uiomove(buf, l, uio); if (error) break; - mbp->msg_bufr += l; - if (mbp->msg_bufr >= mbp->msg_size) - mbp->msg_bufr = 0; } return (error); } @@ -168,7 +163,7 @@ s = splhigh(); if (events & (POLLIN | POLLRDNORM)) { - if (msgbufp->msg_bufr != msgbufp->msg_bufx) + if (msgbuf_getcount(msgbufp) > 0) revents |= events & (POLLIN | POLLRDNORM); else selrecord(td, &logsoftc.sc_selp); @@ -204,18 +199,12 @@ static int logioctl(dev_t dev, u_long com, caddr_t data, int flag, struct thread *td) { - int l, s; switch (com) { /* return number of characters immediately available */ case FIONREAD: - s = splhigh(); - l = msgbufp->msg_bufx - msgbufp->msg_bufr; - splx(s); - if (l < 0) - l += msgbufp->msg_size; - *(int *)data = l; + *(int *)data = msgbuf_getcount(msgbufp); break; case FIONBIO: Index: sys/kern/subr_prf.c =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/kern/subr_prf.c,v retrieving revision 1.102 diff -u -r1.102 subr_prf.c --- sys/kern/subr_prf.c 11 Jun 2003 00:56:57 -0000 1.102 +++ sys/kern/subr_prf.c 15 Jun 2003 01:14:14 -0000 @@ -89,12 +89,7 @@ extern int log_open; -struct tty *constty; /* pointer to console "window" tty */ - -static void (*v_putc)(int) = cnputc; /* routine to putc on virtual console */ static void msglogchar(int c, int pri); -static void msgaddchar(int c, void *dummy); -static u_int msgbufcksum(char *cp, size_t size, u_int cksum); static void putchar(int ch, void *arg); static char *ksprintn(char *nbuf, uintmax_t num, int base, int *len); static void snprintf_func(int ch, void *arg); @@ -337,21 +332,28 @@ putchar(int c, void *arg) { struct putchar_arg *ap = (struct putchar_arg*) arg; - int flags = ap->flags; struct tty *tp = ap->tty; + int consdirect, flags = ap->flags; + + consdirect = ((flags & TOCONS) && constty == NULL) || !msgbufmapped; + /* Don't use the tty code after a panic or while in ddb. */ if (panicstr) - constty = NULL; - if ((flags & TOCONS) && tp == NULL && constty) { - tp = constty; - flags |= TOTTY; - } - if ((flags & TOTTY) && tp && tputchar(c, tp) < 0 && - (flags & TOCONS) && tp == constty) - constty = NULL; + consdirect = 1; +#ifdef DDB + if (db_active) + consdirect = 1; +#endif + if (consdirect) { + if (c != '\0') + cnputc(c); + } else { + if ((flags & TOTTY) && tp != NULL) + tputchar(c, tp); + if ((flags & TOCONS) && constty != NULL) + msgbuf_addchar(&consmsgbuf, c); + } if ((flags & TOLOG)) msglogchar(c, ap->pri); - if ((flags & TOCONS) && constty == NULL && c != '\0') - (*v_putc)(c); } /* @@ -788,16 +790,16 @@ return; if (pri != -1 && pri != lastpri) { if (dangling) { - msgaddchar('\n', NULL); + msgbuf_addchar(msgbufp, '\n'); dangling = 0; } - msgaddchar('<', NULL); + msgbuf_addchar(msgbufp, '<'); for (p = ksprintn(nbuf, (uintmax_t)pri, 10, NULL); *p;) - msgaddchar(*p--, NULL); - msgaddchar('>', NULL); + msgbuf_addchar(msgbufp, *p--); + msgbuf_addchar(msgbufp, '>'); lastpri = pri; } - msgaddchar(c, NULL); + msgbuf_addchar(msgbufp, c); if (c == '\n') { dangling = 0; lastpri = -1; @@ -806,41 +808,6 @@ } } -/* - * Put char in log buffer - */ -static void -msgaddchar(int c, void *dummy) -{ - struct msgbuf *mbp; - - if (!msgbufmapped) - return; - mbp = msgbufp; - mbp->msg_cksum += (u_char)c - (u_char)mbp->msg_ptr[mbp->msg_bufx]; - mbp->msg_ptr[mbp->msg_bufx++] = c; - if (mbp->msg_bufx >= mbp->msg_size) - mbp->msg_bufx = 0; - /* If the buffer is full, keep the most recent data. */ - if (mbp->msg_bufr == mbp->msg_bufx) { - if (++mbp->msg_bufr >= mbp->msg_size) - mbp->msg_bufr = 0; - } -} - -static void -msgbufcopy(struct msgbuf *oldp) -{ - int pos; - - pos = oldp->msg_bufr; - while (pos != oldp->msg_bufx) { - msglogchar(oldp->msg_ptr[pos], -1); - if (++pos >= oldp->msg_size) - pos = 0; - } -} - void msgbufinit(void *ptr, int size) { @@ -849,39 +816,14 @@ size -= sizeof(*msgbufp); cp = (char *)ptr; - msgbufp = (struct msgbuf *) (cp + size); - if (msgbufp->msg_magic != MSG_MAGIC || msgbufp->msg_size != size || - msgbufp->msg_bufx >= size || msgbufp->msg_bufx < 0 || - msgbufp->msg_bufr >= size || msgbufp->msg_bufr < 0 || - msgbufcksum(cp, size, msgbufp->msg_cksum) != msgbufp->msg_cksum) { - bzero(cp, size); - bzero(msgbufp, sizeof(*msgbufp)); - msgbufp->msg_magic = MSG_MAGIC; - msgbufp->msg_size = size; - } - msgbufp->msg_ptr = cp; + msgbufp = (struct msgbuf *)(cp + size); + msgbuf_reinit(msgbufp, cp, size); if (msgbufmapped && oldp != msgbufp) - msgbufcopy(oldp); + msgbuf_copy(oldp, msgbufp); msgbufmapped = 1; oldp = msgbufp; } -static u_int -msgbufcksum(char *cp, size_t size, u_int cksum) -{ - u_int sum; - int i; - - sum = 0; - for (i = 0; i < size; i++) - sum += (u_char)cp[i]; - if (sum != cksum) - printf("msgbuf cksum mismatch (read %x, calc %x)\n", cksum, - sum); - - return (sum); -} - SYSCTL_DECL(_security_bsd); static int unprivileged_read_msgbuf = 1; @@ -893,7 +835,8 @@ static int sysctl_kern_msgbuf(SYSCTL_HANDLER_ARGS) { - int error; + char buf[128]; + int error, len, seq; if (!unprivileged_read_msgbuf) { error = suser(req->td); @@ -901,25 +844,20 @@ return (error); } - /* - * Unwind the buffer, so that it's linear (possibly starting with - * some initial nulls). - */ - error = sysctl_handle_opaque(oidp, msgbufp->msg_ptr + msgbufp->msg_bufx, - msgbufp->msg_size - msgbufp->msg_bufx, req); - if (error) - return (error); - if (msgbufp->msg_bufx > 0) { - error = sysctl_handle_opaque(oidp, msgbufp->msg_ptr, - msgbufp->msg_bufx, req); + /* Read the whole buffer, one chunk at a time. */ + msgbuf_peekbytes(msgbufp, NULL, 0, &seq); + while ((len = msgbuf_peekbytes(msgbufp, buf, sizeof(buf), &seq)) > 0) { + error = sysctl_handle_opaque(oidp, buf, len, req); + if (error) + return (error); } - return (error); + return (0); } SYSCTL_PROC(_kern, OID_AUTO, msgbuf, CTLTYPE_STRING | CTLFLAG_RD, 0, 0, sysctl_kern_msgbuf, "A", "Contents of kernel message buffer"); -static int msgbuf_clear; +static int msgbuf_clearflag; static int sysctl_kern_msgbuf_clear(SYSCTL_HANDLER_ARGS) @@ -927,17 +865,14 @@ int error; error = sysctl_handle_int(oidp, oidp->oid_arg1, oidp->oid_arg2, req); if (!error && req->newptr) { - /* Clear the buffer and reset write pointer */ - bzero(msgbufp->msg_ptr, msgbufp->msg_size); - msgbufp->msg_bufr = msgbufp->msg_bufx = 0; - msgbufp->msg_cksum = 0; - msgbuf_clear = 0; + msgbuf_clear(msgbufp); + msgbuf_clearflag = 0; } return (error); } SYSCTL_PROC(_kern, OID_AUTO, msgbuf_clear, - CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_SECURE, &msgbuf_clear, 0, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_SECURE, &msgbuf_clearflag, 0, sysctl_kern_msgbuf_clear, "I", "Clear kernel message buffer"); #ifdef DDB @@ -951,11 +886,11 @@ return; } db_printf("msgbufp = %p\n", msgbufp); - db_printf("magic = %x, size = %d, r= %d, w = %d, ptr = %p, cksum= %d\n", - msgbufp->msg_magic, msgbufp->msg_size, msgbufp->msg_bufr, - msgbufp->msg_bufx, msgbufp->msg_ptr, msgbufp->msg_cksum); + db_printf("magic = %x, size = %d, r= %u, w = %u, ptr = %p, cksum= %u\n", + msgbufp->msg_magic, msgbufp->msg_size, msgbufp->msg_rseq, + msgbufp->msg_wseq, msgbufp->msg_ptr, msgbufp->msg_cksum); for (i = 0; i < msgbufp->msg_size; i++) { - j = (i + msgbufp->msg_bufr) % msgbufp->msg_size; + j = MSGBUF_SEQ_TO_POS(msgbufp, i + msgbufp->msg_rseq); db_printf("%c", msgbufp->msg_ptr[j]); } db_printf("\n"); Index: sys/conf/files =================================================================== RCS file: /dump/FreeBSD-CVS/src/sys/conf/files,v retrieving revision 1.792 diff -u -r1.792 files --- sys/conf/files 13 Jun 2003 12:08:09 -0000 1.792 +++ sys/conf/files 14 Jun 2003 14:57:22 -0000 @@ -1084,6 +1084,7 @@ kern/subr_mbuf.c standard kern/subr_mchain.c optional libmchain kern/subr_module.c standard +kern/subr_msgbuf.c standard kern/subr_param.c standard kern/subr_pcpu.c standard kern/subr_power.c standard From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 11:08:20 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8A89437B401 for ; Sun, 15 Jun 2003 11:08:20 -0700 (PDT) Received: from dhcp01.pn.xcllnt.net (209-128-86-226.BAYAREA.NET [209.128.86.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id A097943FBF for ; Sun, 15 Jun 2003 11:08:19 -0700 (PDT) (envelope-from marcel@dhcp01.pn.xcllnt.net) Received: from dhcp01.pn.xcllnt.net (localhost [127.0.0.1]) by dhcp01.pn.xcllnt.net (8.12.9/8.12.9) with ESMTP id h5FI8Jqh015728; Sun, 15 Jun 2003 11:08:19 -0700 (PDT) (envelope-from marcel@dhcp01.pn.xcllnt.net) Received: (from marcel@localhost) by dhcp01.pn.xcllnt.net (8.12.9/8.12.9/Submit) id h5FI8IRn015727; Sun, 15 Jun 2003 11:08:18 -0700 (PDT) Date: Sun, 15 Jun 2003 11:08:18 -0700 From: Marcel Moolenaar To: Ian Dowse Message-ID: <20030615180818.GA15538@dhcp01.pn.xcllnt.net> References: <200306151406.aa36218@salmon.maths.tcd.ie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200306151406.aa36218@salmon.maths.tcd.ie> User-Agent: Mutt/1.5.4i cc: freebsd-arch@freebsd.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 18:08:20 -0000 On Sun, Jun 15, 2003 at 02:06:50PM +0100, Ian Dowse wrote: > > Index: sys/sys/msgbuf.h > =================================================================== > RCS file: /dump/FreeBSD-CVS/src/sys/sys/msgbuf.h,v > retrieving revision 1.20 > diff -u -r1.20 msgbuf.h > --- sys/sys/msgbuf.h 28 Mar 2003 02:50:10 -0000 1.20 > +++ sys/sys/msgbuf.h 15 Jun 2003 12:00:45 -0000 > @@ -41,16 +41,32 @@ > #define MSG_MAGIC 0x063062 > u_int msg_magic; > int msg_size; /* size of buffer area */ > - int msg_bufx; /* write pointer */ > - int msg_bufr; /* read pointer */ > + int msg_wseq; /* write sequence number */ > + int msg_rseq; /* read sequence number */ > + int msg_seqmod; /* range for sequence numbers */ > char *msg_ptr; /* pointer to buffer */ > u_int msg_cksum; /* checksum of contents */ > }; Nit: please move the msg_ptr field before the int fields. There used to be no internal padding, but on 64-bit platforms with an odd number of ints, this will be the case. -- Marcel Moolenaar USPA: A-39004 marcel@xcllnt.net From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 11:26:06 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 99AC437B401 for ; Sun, 15 Jun 2003 11:26:06 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id EAED743F75 for ; Sun, 15 Jun 2003 11:26:05 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5FIPvM7046944; Sun, 15 Jun 2003 11:26:01 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306151826.h5FIPvM7046944@gw.catspoiler.org> Date: Sun, 15 Jun 2003 11:25:57 -0700 (PDT) From: Don Lewis To: iedowse@maths.tcd.ie In-Reply-To: <200306151406.aa36218@salmon.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 18:26:06 -0000 On 15 Jun, Ian Dowse wrote: > > Below is a patch that makes the implementation of the kernel message > buffer mostly reentrant and more generic, and stops printf() ever > calling directly into the tty code. This should fix panics that can > occur via tputchar() when using xconsole, and generally make the > use of printf() in the kernel a bit safer. Many of the ideas here > were suggested by Bruce Evans. > > A summary of the changes: > - Use atomic operations to update the message buffer pointers. > - Use a kind of sequence number for the pointers instead of just > the offset into the buffer, as this avoids the need for the read > code to touch the write pointer or the write code to touch the > read pointer. > +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? \ > + (mbp)->msg_seqmod : 0)) > +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ > + (u_int)(mbp)->msg_size)) > +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2))) > + According to my copy of K&R, there is no guarantee that ((negative_int % postive_int) <= 0) on all platforms, though this is generally true. If the sequence numbers wrap, there will be a discontinuity in the sequence of normalized sequence numbers unless msg_seqmod evenly divides the full integer range, which would indicate that msg_seqmod needs to be a power of two on the platforms of interest. Integer division is fairly slow operation for most CPUs, so why not just enforce the power of two constraint and just grab the bottom bits of the sequence numbers using a bitwise logical operation to normalize? From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 11:29:34 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 209D137B401 for ; Sun, 15 Jun 2003 11:29:34 -0700 (PDT) Received: from harmony.village.org (rover.bsdimp.com [204.144.255.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2659043F75 for ; Sun, 15 Jun 2003 11:29:33 -0700 (PDT) (envelope-from imp@bsdimp.com) Received: from localhost (warner@rover2.village.org [10.0.0.1]) by harmony.village.org (8.12.8/8.12.3) with ESMTP id h5FITWkA054567; Sun, 15 Jun 2003 12:29:32 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Sun, 15 Jun 2003 12:28:52 -0600 (MDT) Message-Id: <20030615.122852.26275397.imp@bsdimp.com> To: iedowse@maths.tcd.ie From: "M. Warner Losh" In-Reply-To: <200306151406.aa36218@salmon.maths.tcd.ie> References: <200306151406.aa36218@salmon.maths.tcd.ie> X-Mailer: Mew version 2.1 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit cc: freebsd-arch@freebsd.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 18:29:34 -0000 In message: <200306151406.aa36218@salmon.maths.tcd.ie> Ian Dowse writes: : occur via tputchar() when using xconsole, and generally make the : use of printf() in the kernel a bit safer. Many of the ideas here : were suggested by Bruce Evans. Safe enough to be used for ddb? That is, can we get the ddb output in out dmesg buffers? Warner From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 11:57:00 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A70BE37B401; Sun, 15 Jun 2003 11:57:00 -0700 (PDT) Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by mx1.FreeBSD.org (Postfix) with SMTP id 4A68243FB1; Sun, 15 Jun 2003 11:56:59 -0700 (PDT) (envelope-from iedowse@maths.tcd.ie) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 15 Jun 2003 19:56:58 +0100 (BST) To: Don Lewis In-Reply-To: Your message of "Sun, 15 Jun 2003 11:25:57 PDT." <200306151826.h5FIPvM7046944@gw.catspoiler.org> Date: Sun, 15 Jun 2003 19:56:58 +0100 From: Ian Dowse Message-ID: <200306151956.aa86884@salmon.maths.tcd.ie> cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 18:57:00 -0000 In message <200306151826.h5FIPvM7046944@gw.catspoiler.org>, Don Lewis writes: > >> +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? >\ >> + (mbp)->msg_seqmod : 0)) >> +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ >> + (u_int)(mbp)->msg_size)) >> +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2) >)) >> + > >According to my copy of K&R, there is no guarantee that ((negative_int % >postive_int) <= 0) on all platforms, though this is generally true. > >If the sequence numbers wrap, there will be a discontinuity in the >sequence of normalized sequence numbers unless msg_seqmod evenly divides >the full integer range, which would indicate that msg_seqmod needs to be >a power of two on the platforms of interest. > >Integer division is fairly slow operation for most CPUs, so why not just >enforce the power of two constraint and just grab the bottom bits of the >sequence numbers using a bitwise logical operation to normalize? The sequence number mechanism could do with a few further comments, as it's not particularily obvious what is going on. As you point out, a simple mapping from a binary sequence number to an index using the modulo operation will suffer discontinuities when the sequence numbers wrap, unless the size divides into the range of the sequence numbers. The code here (unless I've missed something) deals with that by ensuring that the range of the sequence numbers is always a multiple of the message buffer size, and that's why the odd normalisation macro is needed. The msg_seqmod field is initialised to 16 times the message buffer size, so by using MSGBUF_SEQNORM() whenever the sequence numbers are updated, there are no discontinuities in the value of MSGBUF_SEQ_TO_POS() as the sequence numbers advance. By using atomic_cmpset*, it can be guaranteed that sequence numbers outside this range never make it to the pointers. The value 16 is just chosen to make it quite unlikely for an old sequence number to be interpreted as current. Bruce originally suggested this approach, and he suggested using a power of 2 message buffer size so that a simple binary operation could be performed in MSGBUF_SEQ_TO_POS(). The problem is that MSGBUF_SIZE has been documented for a long time as only being restricted to a multiple of the page size, and then the top few bytes get taken by the msgbuf structure. This combined with the fact that the message buffer is allocated in MD code would make it waste memory (you'd always lose PAGE_SIZE - sizeof(struct msgbuf)), messy to change, and would require many people to modify their kernel configurations. I don't particularily like the divisions either, but it seems very unlikely to me that they will be significant in practice. Ian From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 12:02:15 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D9E5737B401 for ; Sun, 15 Jun 2003 12:02:15 -0700 (PDT) Received: from falcon.midgard.homeip.net (h76n3fls20o913.telia.com [213.67.148.76]) by mx1.FreeBSD.org (Postfix) with SMTP id B38A943FBF for ; Sun, 15 Jun 2003 12:02:12 -0700 (PDT) (envelope-from ertr1013@student.uu.se) Received: (qmail 75511 invoked by uid 1001); 15 Jun 2003 19:02:09 -0000 Date: Sun, 15 Jun 2003 21:02:09 +0200 From: Erik Trulsson To: Don Lewis Message-ID: <20030615190209.GA75458@falcon.midgard.homeip.net> Mail-Followup-To: Don Lewis , iedowse@maths.tcd.ie, freebsd-arch@FreeBSD.org References: <200306151406.aa36218@salmon.maths.tcd.ie> <200306151826.h5FIPvM7046944@gw.catspoiler.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200306151826.h5FIPvM7046944@gw.catspoiler.org> User-Agent: Mutt/1.5.4i cc: iedowse@maths.tcd.ie cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2003 19:02:16 -0000 On Sun, Jun 15, 2003 at 11:25:57AM -0700, Don Lewis wrote: > On 15 Jun, Ian Dowse wrote: > > > > Below is a patch that makes the implementation of the kernel message > > buffer mostly reentrant and more generic, and stops printf() ever > > calling directly into the tty code. This should fix panics that can > > occur via tputchar() when using xconsole, and generally make the > > use of printf() in the kernel a bit safer. Many of the ideas here > > were suggested by Bruce Evans. > > > > A summary of the changes: > > - Use atomic operations to update the message buffer pointers. > > - Use a kind of sequence number for the pointers instead of just > > the offset into the buffer, as this avoids the need for the read > > code to touch the write pointer or the write code to touch the > > read pointer. > > > +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? \ > > + (mbp)->msg_seqmod : 0)) > > +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ > > + (u_int)(mbp)->msg_size)) > > +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2))) > > + > > According to my copy of K&R, there is no guarantee that ((negative_int % > postive_int) <= 0) on all platforms, though this is generally true. With a C99 compiler it is always true. In C89 it was implementation defined if integer division rounded towards zero or towards negative-infinity. In C99 integer division always rounds towards zero. This combined with the fact that (a/b)*b + a%b == a is always true (for integer a,b and b!=0) means that (neg_int % pos_int <= 0 ) is always true in C99, while it wasn't always true in C89. -- Erik Trulsson ertr1013@student.uu.se From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 18:01:28 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 35B6837B401 for ; Sun, 15 Jun 2003 18:01:28 -0700 (PDT) Received: from sccrmhc11.attbi.com (sccrmhc11.attbi.com [204.127.202.55]) by mx1.FreeBSD.org (Postfix) with ESMTP id 813FF43F75 for ; Sun, 15 Jun 2003 18:01:27 -0700 (PDT) (envelope-from DougB@freebsd.org) Received: from master.dougb.net (12-234-22-23.client.attbi.com[12.234.22.23](untrusted sender)) by attbi.com (sccrmhc11) with SMTP id <200306160101260110080ea2e>; Mon, 16 Jun 2003 01:01:26 +0000 Date: Sun, 15 Jun 2003 18:01:33 -0700 (PDT) From: Doug Barton To: "Michael W . Lucas" In-Reply-To: <20030610192329.A15847@blackhelicopters.org> Message-ID: <20030615173721.Q32802@znfgre.qbhto.arg> References: <20030610124747.A7560@phantom.cris.net> <20030610192329.A15847@blackhelicopters.org> Organization: http://www.FreeBSD.org/ X-message-flag: Outlook -- Not just for spreading viruses anymore! MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@FreeBSD.org cc: Alexey Zelkin Subject: Re: removing stale files (was: Re: cvs commit: src/etc Makefile locale.alias locale.deprecated nls.alias) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 01:01:28 -0000 On Tue, 10 Jun 2003, Michael W . Lucas wrote: > [cc trimmed] > > On Tue, Jun 10, 2003 at 12:47:47PM +0300, Alexey Zelkin wrote: > > But I think there're already someone who > > has it implemented. Otherwise I'll spend some time, write and commit > > it. > > NetBSD's etcupdate has this functionality, and a bunch of other stuff. > > Of course, etcupdate started life as our mergemaster... That's not strictly correct. I had a chat with luke about this recently. He has somewhat different goals in mind than I do. etcupdate does things differently than mergemaster does it, and there is also another program called /etc/postinstall that does a few other things differently, but related. The main difference between his approach and mine is that his stuff has specific, and sometimes detailed knowledge about individual files. I purposely avoided that approach, which makes mergemaster a lot more flexible, at the cost of not necessarily making the _files_ all the same when you're done. What I wanted to do instead was to make the _configuration_ the same (i.e., updated to the latest stuff), while potentially leaving some crufty files behind that should probably be deleted by hand at some point. There are pluses and minuses to both approaches. My method allows the users more flexibility in customizing their stuff, at the cost of requiring them to do more maintenance by hand to keep things clean. Since stale files in /etc usually don't cause much if any harm, I felt like this was the better approach. Doug -- This .signature sanitized for your protection From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 18:21:46 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8CDD437B401 for ; Sun, 15 Jun 2003 18:21:46 -0700 (PDT) Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by mx1.FreeBSD.org (Postfix) with SMTP id 7812843FBD for ; Sun, 15 Jun 2003 18:21:45 -0700 (PDT) (envelope-from iedowse@maths.tcd.ie) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 16 Jun 2003 02:21:44 +0100 (BST) To: "M. Warner Losh" In-Reply-To: Your message of "Sun, 15 Jun 2003 12:28:52 MDT." <20030615.122852.26275397.imp@bsdimp.com> Date: Mon, 16 Jun 2003 02:21:44 +0100 From: Ian Dowse Message-ID: <200306160221.aa52952@salmon.maths.tcd.ie> cc: freebsd-arch@freebsd.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 01:21:46 -0000 In message <20030615.122852.26275397.imp@bsdimp.com>, "M. Warner Losh" writes: >In message: <200306151406.aa36218@salmon.maths.tcd.ie> > Ian Dowse writes: >: occur via tputchar() when using xconsole, and generally make the >: use of printf() in the kernel a bit safer. Many of the ideas here >: were suggested by Bruce Evans. > >Safe enough to be used for ddb? That is, can we get the ddb output in >out dmesg buffers? That's what I use it for anyway - I use a laptop that has no serial ports but does retain its memory contents across a reboot, so having panic messages and stack traces left in the message buffer helps a lot. There are probably a few ways of doing it, but I just replaced db_putchar()'s contents with the following: printf("%c", c); if (c == '\r' || c == '\n') db_check_interrupt(); This probably needs to have a sysctl to control the behaviour, as many people won't want the message buffer to be filled up with interactive DDB output normally. Ian From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 20:37:04 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 53E7E37B401; Sun, 15 Jun 2003 20:37:04 -0700 (PDT) Received: from mail.chesapeake.net (chesapeake.net [208.142.252.6]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3EA2043FD7; Sun, 15 Jun 2003 20:37:03 -0700 (PDT) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h5G3b2K60019; Sun, 15 Jun 2003 23:37:02 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Sun, 15 Jun 2003 23:37:02 -0400 (EDT) From: Jeff Roberson To: Don Lewis In-Reply-To: <200306141643.h5EGh7M7044570@gw.catspoiler.org> Message-ID: <20030615233553.N36168-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: vnode/buf locking deadlock between nfsiod and getblk() X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 03:37:04 -0000 On Sat, 14 Jun 2003, Don Lewis wrote: > The one remaining vnode locking issue in the NFS client code that I'm > aware of is that nfs_doio() does stuff with the vnode associated with > the buf passed to it that requires the vnode lock to be held, but the > vnode is not locked when nfs_nfsiod() calls nfs_doio(). > > I've made a couple of attempts to fix nfs_nfsiod() by locking the vnode, > and I've always run into deadlocks like this: > > 43 c63f9b58 e4b3c000 0 0 0 0000204 [SLP]nfs 0xc687f794] nfsiod 0 > > 570 c6728790 e6ddb000 1001 563 570 0004002 [SLP]getblk 0xd28980a4] ls > > mi_switch(c61c6000,50,c051cc8e,cd,0) at mi_switch+0x210 > msleep(c687f794,c05dffc4,50,c05299ce,0) at msleep+0x484 > acquire(e4b0ec6c,1000000,600,f1,c61c6000) at acquire+0x9e > lockmgr(c687f794,1010002,c687f6d8,c61c6000,d2897fd8) at lockmgr+0x387 > vop_sharedlock(e4b0ec9c,0,c0524105,360,e4b0ecb0) at vop_sharedlock+0x84 > vn_lock(c687f6d8,20002,c61c6000,c0529f78,0) at vn_lock+0xe9 > nfssvc_iod(c060d6e0,e4b0ed48,c051a213,30e,0) at nfssvc_iod+0x12a > fork_exit(c03e3e10,c060d6e0,e4b0ed48) at fork_exit+0xc0 > fork_trampoline() at fork_trampoline+0x1a > > mi_switch(c6729390,50,c0328ad0,c6729390,0) at mi_switch+0x210 > msleep(d28980a4,c05e06ac,50,c0522430,c8) at msleep+0x484 > acquire(e6dbc9e8,2000020,600,f1,c6729390) at acquire+0x9e > lockmgr(d28980a4,2090022,c687f6d8,c6729390,c687f6d8) at lockmgr+0x387 > BUF_TIMELOCK(d2897fd8,10022,c687f6d8,c0522430,0) at BUF_TIMELOCK+0x80 > getblk(c687f6d8,1,0,1000,0) at getblk+0x141 > nfs_getcacheblk(c687f6d8,1,0,1000,c6729390) at nfs_getcacheblk+0xc9 > nfs_bioread(c687f6d8,e6dbccb4,0,c66f2d00,165) at nfs_bioread+0x87a > nfs_readdir(e6dbcc34,c05158ca,c05b6c20,c687f6d8,e6dbccb4) at nfs_readdir+0xd4 > VOP_READDIR(c687f6d8,e6dbccb4,c66f2d00,e6dbcc84,0) at VOP_READDIR+0x67 > getdirentries(c6729390,e6dbcd10,c0537b17,3fd,4) at getdirentries+0x11d > syscall(2f,2f,2f,80e2600,80d9040) at syscall+0x26e > Xint0x80_syscall() at Xint0x80_syscall+0x1d > > In this case, 'ls' had a vnode locked and was trying to lock a buf, and > 'nfsiod' was waiting to obtain a lock on the same vnode. > > > I finally dug around in the code and discovered that the problem is > fairly fundamental. If a thread calls VOP_STRATEGY() to for > asynchronous I/O on an NFS mounted filesystem, or if it calls > nfs_biord() which decides to do readahead, the request is handled by > nfs_asyncio(), which uses BUF_KERNPROC() to transfer ownership of the > buf lock to the system, and queues the buf on nmp->nm_bufq for nfsiod to > handle later. > > Everything is fine if nfsiod is able to service the request before > another thread requests the buf. The problem occurs when another thread > attempts to do I/O on the file, grabs the vnode lock and then tries to > grab the buf lock before nfsiod has gotten around servicing the request. > The thread requesting the I/O can't proceed until it gets the buf lock, > which won't happen until the the queue request has been serviced, and > nfsiod can handle the I/O request in the buf because it can't obtain the > vnode lock. The only reason that we don't see this failure is that > nfsiod is not requesting the vnode lock and is allowing nfs_doio() to > play with an unlocked vnode (or one locked by another thread). > > I came up with three possible ways of fixing this, none of which sound > very appealing: > > Fix nfs_doio() so that it and the functions that it calls don't > touch any vnode fields that require the vnode lock. > > When attempting to lock a buf whose current lockholder is > LK_KERNPROC, back off by dropping the vnode lock and retrying. There are several other places in the kernel that use the solution above. If you go backwards from the buf to the vnode lock you must trylock or equivalent. > > When attempting to lock a buf whose current lockholder is > LK_KERNPROC, steal the buf back and do the requested I/O > synchronously before proceeding if the previously requested I/O > was not already in progress. > > Comments? Suggestions? > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 21:16:57 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E576D37B401; Sun, 15 Jun 2003 21:16:56 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id BBB4043F75; Sun, 15 Jun 2003 21:16:54 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id OAA27571; Mon, 16 Jun 2003 14:16:35 +1000 Date: Mon, 16 Jun 2003 14:16:34 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Ian Dowse In-Reply-To: <200306151956.aa86884@salmon.maths.tcd.ie> Message-ID: <20030616125941.G26874@gamplex.bde.org> References: <200306151956.aa86884@salmon.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Don Lewis cc: freebsd-arch@freebsd.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 04:16:57 -0000 On Sun, 15 Jun 2003, Ian Dowse wrote: > In message <200306151826.h5FIPvM7046944@gw.catspoiler.org>, Don Lewis writes: > > > >> +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? > >\ > >> + (mbp)->msg_seqmod : 0)) > >> +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ > >> + (u_int)(mbp)->msg_size)) > >> +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2) > >)) > >> + Sorry I didn't reply to Ian's provate mail about all this last month. I'll try to get back to it. > >According to my copy of K&R, there is no guarantee that ((negative_int % > >postive_int) <= 0) on all platforms, though this is generally true. C99 guarantees this perfect brokenness of the % operator. Division should give remainders that have the same sign as the divisor, which corresponds to rounding towards minus infinity for positive divisors, but is now specified to be bug for bug compatible with most hardware and most C implementations (round towards zero). MSGBUF_SEQ_TO_POS() does extra work to get nonnegative remainders. This problem and many casts could be avoided by using unsigned types for most of the msgbuf fields. I forget the details of why we changed them back to signed. The log message for msgbuf.h 1.19 says that this is because we perform signed arithmetic on them. The details for this, can probably be handled by the macros now. > >If the sequence numbers wrap, there will be a discontinuity in the > >sequence of normalized sequence numbers unless msg_seqmod evenly divides > >the full integer range, which would indicate that msg_seqmod needs to be > >a power of two on the platforms of interest. > > > >Integer division is fairly slow operation for most CPUs, so why not just > >enforce the power of two constraint and just grab the bottom bits of the > >sequence numbers using a bitwise logical operation to normalize? > > The sequence number mechanism could do with a few further comments, > as it's not particularily obvious what is going on. As you point > out, a simple mapping from a binary sequence number to an index > using the modulo operation will suffer discontinuities when the > sequence numbers wrap, unless the size divides into the range of > the sequence numbers. > > The code here (unless I've missed something) deals with that by > ensuring that the range of the sequence numbers is always a multiple > of the message buffer size, and that's why the odd normalisation > macro is needed. The msg_seqmod field is initialised to 16 times > the message buffer size, so by using MSGBUF_SEQNORM() whenever the > sequence numbers are updated, there are no discontinuities in the > value of MSGBUF_SEQ_TO_POS() as the sequence numbers advance. By > using atomic_cmpset*, it can be guaranteed that sequence numbers > outside this range never make it to the pointers. The value 16 is > just chosen to make it quite unlikely for an old sequence number > to be interpreted as current. This seem correct but messy. > Bruce originally suggested this approach, and he suggested using a > power of 2 message buffer size so that a simple binary operation > could be performed in MSGBUF_SEQ_TO_POS(). The problem is that > MSGBUF_SIZE has been documented for a long time as only being > restricted to a multiple of the page size, and then the top few > bytes get taken by the msgbuf structure. This combined with the > fact that the message buffer is allocated in MD code would make it > waste memory (you'd always lose PAGE_SIZE - sizeof(struct msgbuf)), > messy to change, and would require many people to modify their > kernel configurations. I don't particularily like the divisions > either, but it seems very unlikely to me that they will be significant > in practice. I mainly suggested the power of 2 part. My original idea for the sequence numbers was interpret msg_bufx as a sequence number instead of as an index to fix the current races setting the index. The most serious race is resetting the index to 0 and this goes away with sequence numbers since sequence numbers can grow without bounds (actually to to the limit of the data type, which is almost enough for anyone with 32-bit ints). The details are messier in practice :-]. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Jun 15 22:07:10 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 058CC37B401 for ; Sun, 15 Jun 2003 22:07:10 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2097A43FBF for ; Sun, 15 Jun 2003 22:07:09 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5G56xM7047970; Sun, 15 Jun 2003 22:07:03 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306160507.h5G56xM7047970@gw.catspoiler.org> Date: Sun, 15 Jun 2003 22:06:59 -0700 (PDT) From: Don Lewis To: jroberson@chesapeake.net In-Reply-To: <20030615233553.N36168-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: arch@FreeBSD.org Subject: Re: vnode/buf locking deadlock between nfsiod and getblk() X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 05:07:10 -0000 On 15 Jun, Jeff Roberson wrote: > On Sat, 14 Jun 2003, Don Lewis wrote: > >> I finally dug around in the code and discovered that the problem is >> fairly fundamental. If a thread calls VOP_STRATEGY() to for >> asynchronous I/O on an NFS mounted filesystem, or if it calls >> nfs_biord() which decides to do readahead, the request is handled by >> nfs_asyncio(), which uses BUF_KERNPROC() to transfer ownership of the >> buf lock to the system, and queues the buf on nmp->nm_bufq for nfsiod to >> handle later. >> >> Everything is fine if nfsiod is able to service the request before >> another thread requests the buf. The problem occurs when another thread >> attempts to do I/O on the file, grabs the vnode lock and then tries to >> grab the buf lock before nfsiod has gotten around servicing the request. >> The thread requesting the I/O can't proceed until it gets the buf lock, >> which won't happen until the the queue request has been serviced, and >> nfsiod can handle the I/O request in the buf because it can't obtain the >> vnode lock. The only reason that we don't see this failure is that >> nfsiod is not requesting the vnode lock and is allowing nfs_doio() to >> play with an unlocked vnode (or one locked by another thread). >> >> I came up with three possible ways of fixing this, none of which sound >> very appealing: >> >> Fix nfs_doio() so that it and the functions that it calls don't >> touch any vnode fields that require the vnode lock. >> >> When attempting to lock a buf whose current lockholder is >> LK_KERNPROC, back off by dropping the vnode lock and retrying. > > There are several other places in the kernel that use the solution above. > If you go backwards from the buf to the vnode lock you must trylock or > equivalent. I had actually thought about doing this in nfssvc_iod(), which is where we try to grab the locks in the wrong order, but I didn't attempt to actually implement this because I wasn't convinced that the implementation I had in mind wouldn't spin in a tight loop in some circumstances. Now that I've looked at it in more detail, I'm pretty sure that we can't drop the lock on the buf in nfssvc_iod() to back all the way out since it looks like the buf needs to be remain locked until the I/O request is completed. The problem is nfsiod can't complete the I/O request without the vnode lock, and it can't get the vnode lock if some other thread has locked the vnode and is waiting to lock the buf. My suggestion above is to have nfs_getcacheblk() or getblk() temporarily fail and have it's caller drop any vnode locks and retry. This strikes me as worst of these three choices. >> When attempting to lock a buf whose current lockholder is >> LK_KERNPROC, steal the buf back and do the requested I/O >> synchronously before proceeding if the previously requested I/O >> was not already in progress. I actually like this one the best. If there large amount of I/O queued on the NFS mount point for nfsiod to handle, this has the effect of giving a higher priority to I/O requests that a process is waiting for than write behind and speculative read ahead requests. Another undesirable "fix" would be to give away the vnode lock to nfsiod at the same time as the the buf lock is given away. That should do wonders for performance ... From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 00:40:50 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4902B37B401 for ; Mon, 16 Jun 2003 00:40:50 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5D8D543F85 for ; Mon, 16 Jun 2003 00:40:49 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5G86EMo028154 for ; Mon, 16 Jun 2003 04:06:15 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5G7fMVc082801 for arch@freebsd.org; Mon, 16 Jun 2003 00:41:22 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 00:41:22 -0700 From: John-Mark Gurney To: arch@freebsd.org Message-ID: <20030616074122.GF73854@funkthat.com> Mail-Followup-To: arch@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html Subject: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 07:40:50 -0000 Does anyone have an objection to making /dev/pci really honor the permissions, and giving normal users (or just group wheel) premission to run pciconf -l. Right now the code requires the write bit set for any operation. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 02:42:52 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8A61837B41F for ; Mon, 16 Jun 2003 02:42:52 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id E6A2143FBF for ; Mon, 16 Jun 2003 02:42:50 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id TAA07280; Mon, 16 Jun 2003 19:42:44 +1000 Date: Mon, 16 Jun 2003 19:42:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: John-Mark Gurney In-Reply-To: <20030616074122.GF73854@funkthat.com> Message-ID: <20030616193932.X27844@gamplex.bde.org> References: <20030616074122.GF73854@funkthat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 09:42:52 -0000 On Mon, 16 Jun 2003, John-Mark Gurney wrote: > Does anyone have an objection to making /dev/pci really honor the > permissions, and giving normal users (or just group wheel) premission > to run pciconf -l. Right now the code requires the write bit set for > any operation. IIRC, it is like it is because reading it may have side effects (and thus isn't really just reading). If it honored the permissions then it should have mode 600 so that normal users can't run pciconf -l :-]. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 03:48:32 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 579BB37B401 for ; Mon, 16 Jun 2003 03:48:32 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id AB97443FA3 for ; Mon, 16 Jun 2003 03:48:31 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5GAmMM7048782; Mon, 16 Jun 2003 03:48:27 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306161048.h5GAmMM7048782@gw.catspoiler.org> Date: Mon, 16 Jun 2003 03:48:22 -0700 (PDT) From: Don Lewis To: bde@zeta.org.au In-Reply-To: <20030616125941.G26874@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: iedowse@maths.tcd.ie cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 10:48:32 -0000 On 16 Jun, Bruce Evans wrote: > On Sun, 15 Jun 2003, Ian Dowse wrote: > >> In message <200306151826.h5FIPvM7046944@gw.catspoiler.org>, Don Lewis writes: >> > >> >> +#define MSGBUF_SEQNORM(mbp, seq) ((seq) % (mbp)->msg_seqmod + ((seq) < 0 ? >> >\ >> >> + (mbp)->msg_seqmod : 0)) >> >> +#define MSGBUF_SEQ_TO_POS(mbp, seq) ((int)((u_int)(seq) % \ >> >> + (u_int)(mbp)->msg_size)) >> >> +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2) >> >)) >> >> + > > Sorry I didn't reply to Ian's provate mail about all this last month. I'll > try to get back to it. > >> >According to my copy of K&R, there is no guarantee that ((negative_int % >> >postive_int) <= 0) on all platforms, though this is generally true. > > C99 guarantees this perfect brokenness of the % operator. Division should > give remainders that have the same sign as the divisor, which corresponds > to rounding towards minus infinity for positive divisors, but is now > specified to be bug for bug compatible with most hardware and most C > implementations (round towards zero). > > MSGBUF_SEQ_TO_POS() does extra work to get nonnegative remainders. > > This problem and many casts could be avoided by using unsigned types > for most of the msgbuf fields. I forget the details of why we changed > them back to signed. The log message for msgbuf.h 1.19 says that this > is because we perform signed arithmetic on them. The details for this, > can probably be handled by the macros now. Using unsigned types was the first thing that I thought of. I was wondering if the reason that this wasn't done was some sort of portability problem with the atomic operations. It looks like MSGBUF_SEQNORM() could avoid the conditional code and any questions about signed remainders if it was defined like this: #define MSGBUF_SEQNORM(mbp, seq) (((seq) + (mbp)->msg_seqmod) % \ (mbp)->msg_seqmod) as long as msg_seqmod < INT_MAX/2. MSGBUF_SEQNORM() could be simplified further if msg_seqmod was added by the caller (such as MSGBUF_SEQSUB()) if the argument could be negative. From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 04:13:18 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 99A9E37B404; Mon, 16 Jun 2003 04:13:18 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4BA0343FAF; Mon, 16 Jun 2003 04:13:17 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id VAA16549; Mon, 16 Jun 2003 21:13:14 +1000 Date: Mon, 16 Jun 2003 21:13:13 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Don Lewis In-Reply-To: <200306161048.h5GAmMM7048782@gw.catspoiler.org> Message-ID: <20030616205631.F28116@gamplex.bde.org> References: <200306161048.h5GAmMM7048782@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: iedowse@maths.tcd.ie cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 11:13:18 -0000 On Mon, 16 Jun 2003, Don Lewis wrote: > On 16 Jun, Bruce Evans wrote: > > On Sun, 15 Jun 2003, Ian Dowse wrote: > >> >> ... > >> >> +#define MSGBUF_SEQSUB(mbp, seq1, seq2) (MSGBUF_SEQNORM(mbp, (seq1) - (seq2) > > ... > > This problem and many casts could be avoided by using unsigned types > > for most of the msgbuf fields. I forget the details of why we changed > > them back to signed. The log message for msgbuf.h 1.19 says that this > > is because we perform signed arithmetic on them. The details for this, > > can probably be handled by the macros now. > > Using unsigned types was the first thing that I thought of. I was > wondering if the reason that this wasn't done was some sort of > portability problem with the atomic operations. MSG_SEQSUB() takes differences of sequence numbers now. The differences can be negative. Though the macro could convert to a signed type, the range of sequence numbers must be limited for their differences to fit in a signed type, so the type for sequence numbers may as well be signed too. > It looks like MSGBUF_SEQNORM() could avoid the conditional code and any > questions about signed remainders if it was defined like this: > > #define MSGBUF_SEQNORM(mbp, seq) (((seq) + (mbp)->msg_seqmod) % \ > (mbp)->msg_seqmod) > > as long as msg_seqmod < INT_MAX/2. MSGBUF_SEQNORM() could be simplified > further if msg_seqmod was added by the caller (such as MSGBUF_SEQSUB()) > if the argument could be negative. Yes. The negative numbers of interest seem to be limited to at most differences of sequence numbers (or maybe differeces of indexes, which are smaller), so they are larger than -msg_seqmod. MSGBUF_SEQSUB() shouldn't add the bias, however, since it is used in contexts where we really want to see the negative values. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 10:06:21 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A634037B401 for ; Mon, 16 Jun 2003 10:06:21 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7CC0B43F85 for ; Mon, 16 Jun 2003 10:06:20 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GHVpMo031881; Mon, 16 Jun 2003 13:31:52 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GH6jM1009680; Mon, 16 Jun 2003 10:06:45 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 10:06:45 -0700 From: John-Mark Gurney To: Bruce Evans Message-ID: <20030616170645.GI73854@funkthat.com> Mail-Followup-To: Bruce Evans , arch@freebsd.org References: <20030616074122.GF73854@funkthat.com> <20030616193932.X27844@gamplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030616193932.X27844@gamplex.bde.org> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 17:06:21 -0000 Bruce Evans wrote this message on Mon, Jun 16, 2003 at 19:42 +1000: > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > Does anyone have an objection to making /dev/pci really honor the > > permissions, and giving normal users (or just group wheel) premission > > to run pciconf -l. Right now the code requires the write bit set for > > any operation. > > IIRC, it is like it is because reading it may have side effects (and > thus isn't really just reading). If it honored the permissions then > it should have mode 600 so that normal users can't run pciconf -l :-]. Now if we were reading the pci registers with -r, then yes, but -l just copys the data from pci_devinfo. If we wanted to make -r readable, we'd have to clamp the registers passed in, and make sure that all platforms didn't trap on PCI register reads (a patch for sparc should be going in soon). -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 10:56:43 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 89CC337B401 for ; Mon, 16 Jun 2003 10:56:43 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id CD2D043F75 for ; Mon, 16 Jun 2003 10:56:42 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h5GHt0YA008904; Mon, 16 Jun 2003 13:55:00 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h5GHsxhq008901; Mon, 16 Jun 2003 13:54:59 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 16 Jun 2003 13:54:59 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: John-Mark Gurney In-Reply-To: <20030616074122.GF73854@funkthat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 17:56:43 -0000 On Mon, 16 Jun 2003, John-Mark Gurney wrote: > Does anyone have an objection to making /dev/pci really honor the > permissions, and giving normal users (or just group wheel) premission to > run pciconf -l. Right now the code requires the write bit set for any > operation. I seem to recall that there was a problem wherein user processes could cause cause unaligned accesses using /dev/pci. There's also some rather odd use of useracc(), printf(), etc, in the ioctl code. I suspect this code needs some fairly thorough review and cleanup before we should reduce the level of privilege required to use the device (note that we make it world readable by default, so changes in the semantics of read permissions will affect all users in the system). Could you do that cleanup in the first pass, then revisit the permissions change? Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 11:40:18 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 863CB37B401 for ; Mon, 16 Jun 2003 11:40:18 -0700 (PDT) Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id F190943F75 for ; Mon, 16 Jun 2003 11:40:17 -0700 (PDT) (envelope-from scottl@freebsd.org) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h5GIe7815887; Mon, 16 Jun 2003 11:40:07 -0700 Received: from freebsd.org (hollin.btc.adaptec.com [10.100.253.56]) by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id LAA15915; Mon, 16 Jun 2003 11:40:16 -0700 (PDT) Message-ID: <3EEE0E65.1000304@freebsd.org> Date: Mon, 16 Jun 2003 12:37:25 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: John-Mark Gurney References: <20030616074122.GF73854@funkthat.com> <20030616193932.X27844@gamplex.bde.org> <20030616170645.GI73854@funkthat.com> In-Reply-To: <20030616170645.GI73854@funkthat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 18:40:18 -0000 John-Mark Gurney wrote: > Bruce Evans wrote this message on Mon, Jun 16, 2003 at 19:42 +1000: > >>On Mon, 16 Jun 2003, John-Mark Gurney wrote: >> >> >>>Does anyone have an objection to making /dev/pci really honor the >>>permissions, and giving normal users (or just group wheel) premission >>>to run pciconf -l. Right now the code requires the write bit set for >>>any operation. >> >>IIRC, it is like it is because reading it may have side effects (and >>thus isn't really just reading). If it honored the permissions then >>it should have mode 600 so that normal users can't run pciconf -l :-]. > > > Now if we were reading the pci registers with -r, then yes, but -l just > copys the data from pci_devinfo. If we wanted to make -r readable, we'd > have to clamp the registers passed in, and make sure that all platforms > didn't trap on PCI register reads (a patch for sparc should be going in > soon). > It sounds like a reasonable idea to me. Yes, actually reading the PCI config register space from userland is generally not something that an unpriviledged user should be allowed to do because of the side effects that others have mentioned. As long as 'pciconf -l' doesn't present an information security hole or DOS opportunity, it sounds like a good idea. Scott From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 11:41:33 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A366737B401; Mon, 16 Jun 2003 11:41:33 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 842DF43F75; Mon, 16 Jun 2003 11:41:32 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GJ77Mo015419; Mon, 16 Jun 2003 15:07:07 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GIg4gI011116; Mon, 16 Jun 2003 11:42:04 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 11:42:04 -0700 From: John-Mark Gurney To: Robert Watson Message-ID: <20030616184204.GL73854@funkthat.com> Mail-Followup-To: Robert Watson , arch@freebsd.org References: <20030616074122.GF73854@funkthat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 18:41:33 -0000 Robert Watson wrote this message on Mon, Jun 16, 2003 at 13:54 -0400: > > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > Does anyone have an objection to making /dev/pci really honor the > > permissions, and giving normal users (or just group wheel) premission to > > run pciconf -l. Right now the code requires the write bit set for any > > operation. > > I seem to recall that there was a problem wherein user processes could > cause cause unaligned accesses using /dev/pci. There's also some rather again, I just proposed -l, not -r to become user readable. I know that -r has problems. I've crashed the sparc box a number of times by specifing pciconf -r pci1:5:0 0x0:0xf. > odd use of useracc(), printf(), etc, in the ioctl code. I suspect this well, do you mean odd use of printf as in providing diagnostics to catch mismatched userland/kernel? for useracc, it checks to make sure that various pointers passed to it are either readable or writable. I don't see this as odd. Or is there another better method of checking user data when accessing user space buffers? other than a minor bug that could hit if there was more pci_devinfo's in the list than pci_numdevs (which should never happen, but will prevent a NULL deref), I didn't see anything wrong with -l. > code needs some fairly thorough review and cleanup before we should reduce > the level of privilege required to use the device (note that we make it > world readable by default, so changes in the semantics of read permissions > will affect all users in the system). Could you do that cleanup in the > first pass, then revisit the permissions change? sure, no problem. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 12:49:09 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 55E9337B401; Mon, 16 Jun 2003 12:49:09 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1F0E543F93; Mon, 16 Jun 2003 12:49:08 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GKEhMo026526; Mon, 16 Jun 2003 16:14:44 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GJnZg0012436; Mon, 16 Jun 2003 12:49:35 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 12:49:35 -0700 From: John-Mark Gurney To: Robert Watson Message-ID: <20030616194935.GR73854@funkthat.com> Mail-Followup-To: Robert Watson , arch@freebsd.org References: <20030616074122.GF73854@funkthat.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="FL5UXtIhxfXey3p5" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 19:49:09 -0000 --FL5UXtIhxfXey3p5 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Robert Watson wrote this message on Mon, Jun 16, 2003 at 13:54 -0400: > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > will affect all users in the system). Could you do that cleanup in the > first pass, then revisit the permissions change? ok, I've taken a look at it, and I don't see any major problems with it besides what I mentioned earlier. I've attached a patch that only lets you do pciconf -l (query pci devices we know about), but it also enforces the register to be in valid bounds and also makes sure it's aligned. I think it's safest to do it here. Should I copy the restrictions on read into the write block (actually, take them out of the case so they can be shared)? -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." --FL5UXtIhxfXey3p5 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="pci_user.patch" Index: pci_user.c =================================================================== RCS file: /home/ncvs/src/sys/dev/pci/pci_user.c,v retrieving revision 1.9 diff -u -r1.9 pci_user.c --- pci_user.c 2003/03/03 12:15:44 1.9 +++ pci_user.c 2003/06/16 19:44:39 @@ -176,7 +176,7 @@ const char *name; int error; - if (!(flag & FWRITE)) + if (!(flag & FWRITE) && cmd != PCIOCGETCONF) return EPERM; @@ -342,7 +342,7 @@ for (cio->num_matches = 0, error = 0, i = 0, dinfo = STAILQ_FIRST(devlist_head); (dinfo != NULL) && (cio->num_matches < ionum) - && (error == 0) && (i < pci_numdevs); + && (error == 0) && (i < pci_numdevs) && (dinfo != NULL); dinfo = STAILQ_NEXT(dinfo, pci_links), i++) { if (i < cio->offset) @@ -412,7 +412,10 @@ } case PCIOCREAD: io = (struct pci_io *)data; - switch(io->pi_width) { + if (io->pi_reg < 0 || io->pi_reg + io_pi_width > PCI_REGMAX || + io->pi_reg & (io->pi_width - 1)) + error = EINVAL; + else switch(io->pi_width) { case 4: case 2: case 1: @@ -439,7 +442,7 @@ } break; default: - error = ENODEV; + error = EINVAL; break; } break; --FL5UXtIhxfXey3p5-- From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 13:14:26 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D017937B401 for ; Mon, 16 Jun 2003 13:14:26 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2B35143F85 for ; Mon, 16 Jun 2003 13:14:26 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h5GKCfYA009788; Mon, 16 Jun 2003 16:12:41 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h5GKCf8d009785; Mon, 16 Jun 2003 16:12:41 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 16 Jun 2003 16:12:41 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: John-Mark Gurney In-Reply-To: <20030616184204.GL73854@funkthat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 20:14:27 -0000 On Mon, 16 Jun 2003, John-Mark Gurney wrote: > Robert Watson wrote this message on Mon, Jun 16, 2003 at 13:54 -0400: > > > > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > > > Does anyone have an objection to making /dev/pci really honor the > > > permissions, and giving normal users (or just group wheel) premission to > > > run pciconf -l. Right now the code requires the write bit set for any > > > operation. > > > > I seem to recall that there was a problem wherein user processes could > > cause cause unaligned accesses using /dev/pci. There's also some rather > > again, I just proposed -l, not -r to become user readable. I know that > -r has problems. I've crashed the sparc box a number of times by > specifing pciconf -r pci1:5:0 0x0:0xf. I guess I found the question unclear -- you didn't quite specify what the change you were making was... > > odd use of useracc(), printf(), etc, in the ioctl code. I suspect this > > well, do you mean odd use of printf as in providing diagnostics to catch > mismatched userland/kernel? Spamming the console with messages can have debilitating effects on the operation of the system if performed by unprivileged users... I.e., if you're using a serial console. > for useracc, it checks to make sure that various pointers passed to it > are either readable or writable. I don't see this as odd. Or is there > another better method of checking user data when accessing user space > buffers? Generally, calls to useracc() are redundant with the existing checks in our copyin/out routines, or are signs that the proper routines aren't being used. All of the fuword/suword/copyin/copyout/uio routines already perform any necessary checks; manual checking is race-prone in a multi-threaded smp environment regardless. The (cio==NULL) test is also redundant. It looks like (although I haven't tried), user processes can also cause the kernel to allocate unlimited amounts of kernel memory, which is another bit we probably need to tighten down. One of problems with exposing this sort of code to unprivileged consumers is that frequently (unfortunately), the code is not robust against a malicious consumer. > other than a minor bug that could hit if there was more pci_devinfo's in > the list than pci_numdevs (which should never happen, but will prevent a > NULL deref), I didn't see anything wrong with -l. Allowing arbitrary users to panic the system is bad :-). > > code needs some fairly thorough review and cleanup before we should reduce > > the level of privilege required to use the device (note that we make it > > world readable by default, so changes in the semantics of read permissions > > > will affect all users in the system). Could you do that cleanup in the > > first pass, then revisit the permissions change? > > sure, no problem. Great. I think that there's nothing with the idea of loosening up the restrictions here; one question I do have to wonder about in the long term is whether this is information that should be exported using one of the existing device general-purpose device configuration monitoring interfaces, such as the sysctls used to support devinfo(8) in some more generic form. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 13:20:34 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4076737B404 for ; Mon, 16 Jun 2003 13:20:34 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 681E343F75 for ; Mon, 16 Jun 2003 13:20:33 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h5GKInYA009818; Mon, 16 Jun 2003 16:18:49 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h5GKInTu009815; Mon, 16 Jun 2003 16:18:49 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 16 Jun 2003 16:18:49 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: John-Mark Gurney In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 20:20:34 -0000 On Mon, 16 Jun 2003, Robert Watson wrote: > Great. I think that there's nothing with the idea of loosening up the ^ wrong Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 13:43:15 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BFBD537B408; Mon, 16 Jun 2003 13:43:12 -0700 (PDT) Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0F71343F75; Mon, 16 Jun 2003 13:43:12 -0700 (PDT) (envelope-from scottl@freebsd.org) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h5GKh1807861; Mon, 16 Jun 2003 13:43:01 -0700 Received: from freebsd.org (hollin.btc.adaptec.com [10.100.253.56]) by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id NAA16821; Mon, 16 Jun 2003 13:43:11 -0700 (PDT) Message-ID: <3EEE2B31.4020406@freebsd.org> Date: Mon, 16 Jun 2003 14:40:17 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: John-Mark Gurney References: <20030616074122.GF73854@funkthat.com> <20030616194935.GR73854@funkthat.com> In-Reply-To: <20030616194935.GR73854@funkthat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: arch@freebsd.org cc: Robert Watson Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 20:43:16 -0000 You should not always assume that reading PCI registers has no side-effects. It is certainly legal and possible for a PCI device to detect the read request and alter the contents of the register (or some other register) as a side effect, or change an internal state machine. 'Fixing' the various bits to allow unpriviledged access to 'pciconf -r' is dangerous since you would have to teach the system about every pci device in existance and how to trap on registers that have side-effects. I see little reason why unpriviledged users should be given register-level access to anything. We don't let them read /dev/mem, do we? Fixing 'pciconf -l' is fine, but it really doesn't need to extend beyond that. I would consider 'pciconf -r' to be a security risk and would treat it as such when it comes time for a release. Scott John-Mark Gurney wrote: > Robert Watson wrote this message on Mon, Jun 16, 2003 at 13:54 -0400: > >>On Mon, 16 Jun 2003, John-Mark Gurney wrote: >>will affect all users in the system). Could you do that cleanup in the >>first pass, then revisit the permissions change? > > > ok, I've taken a look at it, and I don't see any major problems with it > besides what I mentioned earlier. I've attached a patch that only lets > you do pciconf -l (query pci devices we know about), but it also enforces > the register to be in valid bounds and also makes sure it's aligned. I > think it's safest to do it here. Should I copy the restrictions on > read into the write block (actually, take them out of the case so they > can be shared)? > > > > ------------------------------------------------------------------------ > > Index: pci_user.c > =================================================================== > RCS file: /home/ncvs/src/sys/dev/pci/pci_user.c,v > retrieving revision 1.9 > diff -u -r1.9 pci_user.c > --- pci_user.c 2003/03/03 12:15:44 1.9 > +++ pci_user.c 2003/06/16 19:44:39 > @@ -176,7 +176,7 @@ > const char *name; > int error; > > - if (!(flag & FWRITE)) > + if (!(flag & FWRITE) && cmd != PCIOCGETCONF) > return EPERM; > > > @@ -342,7 +342,7 @@ > for (cio->num_matches = 0, error = 0, i = 0, > dinfo = STAILQ_FIRST(devlist_head); > (dinfo != NULL) && (cio->num_matches < ionum) > - && (error == 0) && (i < pci_numdevs); > + && (error == 0) && (i < pci_numdevs) && (dinfo != NULL); > dinfo = STAILQ_NEXT(dinfo, pci_links), i++) { > > if (i < cio->offset) > @@ -412,7 +412,10 @@ > } > case PCIOCREAD: > io = (struct pci_io *)data; > - switch(io->pi_width) { > + if (io->pi_reg < 0 || io->pi_reg + io_pi_width > PCI_REGMAX || > + io->pi_reg & (io->pi_width - 1)) > + error = EINVAL; > + else switch(io->pi_width) { > case 4: > case 2: > case 1: > @@ -439,7 +442,7 @@ > } > break; > default: > - error = ENODEV; > + error = EINVAL; > break; > } > break; > > > ------------------------------------------------------------------------ > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 14:28:13 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2F85937B401; Mon, 16 Jun 2003 14:28:13 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 33C2043F3F; Mon, 16 Jun 2003 14:28:12 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GLrnMo011094; Mon, 16 Jun 2003 17:53:49 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GLSiFi014076; Mon, 16 Jun 2003 14:28:44 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 14:28:44 -0700 From: John-Mark Gurney To: Robert Watson Message-ID: <20030616212844.GU73854@funkthat.com> Mail-Followup-To: Robert Watson , arch@freebsd.org References: <20030616184204.GL73854@funkthat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 21:28:13 -0000 Robert Watson wrote this message on Mon, Jun 16, 2003 at 16:12 -0400: > I guess I found the question unclear -- you didn't quite specify what the > change you were making was... Yes, sorry, that is true. (and originally I was thinking of making -r publicly available too.) > > > odd use of useracc(), printf(), etc, in the ioctl code. I suspect this > > > > well, do you mean odd use of printf as in providing diagnostics to catch > > mismatched userland/kernel? > > Spamming the console with messages can have debilitating effects on the > operation of the system if performed by unprivileged users... I.e., if > you're using a serial console. Very true, I forget that there is always the DoS attack on logs too. Ok, I'll look at either adding return values or just passing EINVAL back. Also, since pciconf is currently the only consumer, most of the checks really are only for debugging purposes. > > for useracc, it checks to make sure that various pointers passed to it > > are either readable or writable. I don't see this as odd. Or is there > > another better method of checking user data when accessing user space > > buffers? > > Generally, calls to useracc() are redundant with the existing checks in > our copyin/out routines, or are signs that the proper routines aren't > being used. All of the fuword/suword/copyin/copyout/uio routines already > perform any necessary checks; manual checking is race-prone in a > multi-threaded smp environment regardless. The (cio==NULL) test is also Ahh, k, I did not know this, and nothing like this is mentioned about this in the manpage. I'll remove the useracc calls since they are unecessary. > redundant. It looks like (although I haven't tried), user processes can > also cause the kernel to allocate unlimited amounts of kernel memory, > which is another bit we probably need to tighten down. Hmmm. I'll take a look at this, but I did think the cio == NULL test was redundant, but "safe".. :) > > other than a minor bug that could hit if there was more pci_devinfo's in > > the list than pci_numdevs (which should never happen, but will prevent a > > NULL deref), I didn't see anything wrong with -l. > > Allowing arbitrary users to panic the system is bad :-). > > > > code needs some fairly thorough review and cleanup before we should reduce > > > the level of privilege required to use the device (note that we make it > > > world readable by default, so changes in the semantics of read permissions > > > > > will affect all users in the system). Could you do that cleanup in the > > > first pass, then revisit the permissions change? > > > > sure, no problem. > > Great. I think that there's nothing with the idea of loosening up the > restrictions here; one question I do have to wonder about in the long term > is whether this is information that should be exported using one of the > existing device general-purpose device configuration monitoring > interfaces, such as the sysctls used to support devinfo(8) in some more > generic form. Well, if this information can/will be exported via another mechanism, we might as well remove this interface. devinfo can support all the information the pciconf -l exports right now. It just hasn't been taught about /usr/shar/emisc/pci_vendors yet. Though I don't know of another interface that would let us read/write pci config registers safely though. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 14:32:41 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0648F37B401; Mon, 16 Jun 2003 14:32:41 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1313843FB1; Mon, 16 Jun 2003 14:32:40 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GLwHMo012012; Mon, 16 Jun 2003 17:58:17 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GLXCCO014173; Mon, 16 Jun 2003 14:33:12 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 14:33:12 -0700 From: John-Mark Gurney To: Scott Long Message-ID: <20030616213312.GV73854@funkthat.com> Mail-Followup-To: Scott Long , arch@freebsd.org, Robert Watson References: <20030616074122.GF73854@funkthat.com> <20030616194935.GR73854@funkthat.com> <3EEE2B31.4020406@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3EEE2B31.4020406@freebsd.org> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org cc: Robert Watson Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 21:32:41 -0000 Scott Long wrote this message on Mon, Jun 16, 2003 at 14:40 -0600: > You should not always assume that reading PCI registers has no > side-effects. It is certainly legal and possible for a PCI device to > detect the read request and alter the contents of the register (or some > other register) as a side effect, or change an internal state machine. > 'Fixing' the various bits to allow unpriviledged access to 'pciconf -r' > is dangerous since you would have to teach the system about every pci > device in existance and how to trap on registers that have side-effects. hmmm. are you sure about this? wouldn't it mean that by simply probing for a device you could end up locking up the system? > I see little reason why unpriviledged users should be given > register-level access to anything. We don't let them read /dev/mem, do > we? Fixing 'pciconf -l' is fine, but it really doesn't need to extend > beyond that. I would consider 'pciconf -r' to be a security risk and > would treat it as such when it comes time for a release. My only idea was for developers working on pci drivers. It was invaluable to be able to read the registers when debuging the sparc64 pci stuff and writing my zoran driver, but I didn't want to have to become root every time I wanted to look at this. The only problem is that this requires three levels of permission, list, read, and write.. changing it to support three is too much against (like overriding write to mean read, etc) POLA, so I abandonded it. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 14:54:46 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1FD5F37B401; Mon, 16 Jun 2003 14:54:46 -0700 (PDT) Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7BCEA43FB1; Mon, 16 Jun 2003 14:54:43 -0700 (PDT) (envelope-from scottl@freebsd.org) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h5GLsW810554; Mon, 16 Jun 2003 14:54:32 -0700 Received: from freebsd.org (hollin.btc.adaptec.com [10.100.253.56]) by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id OAA07414; Mon, 16 Jun 2003 14:54:41 -0700 (PDT) Message-ID: <3EEE3BF2.3020809@freebsd.org> Date: Mon, 16 Jun 2003 15:51:46 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: John-Mark Gurney References: <20030616074122.GF73854@funkthat.com> <20030616194935.GR73854@funkthat.com> <3EEE2B31.4020406@freebsd.org> <20030616213312.GV73854@funkthat.com> In-Reply-To: <20030616213312.GV73854@funkthat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: arch@freebsd.org cc: Robert Watson Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 21:54:46 -0000 John-Mark Gurney wrote: > Scott Long wrote this message on Mon, Jun 16, 2003 at 14:40 -0600: > >>You should not always assume that reading PCI registers has no >>side-effects. It is certainly legal and possible for a PCI device to >>detect the read request and alter the contents of the register (or some >>other register) as a side effect, or change an internal state machine. >>'Fixing' the various bits to allow unpriviledged access to 'pciconf -r' >>is dangerous since you would have to teach the system about every pci >>device in existance and how to trap on registers that have side-effects. > > > hmmm. are you sure about this? wouldn't it mean that by simply probing > for a device you could end up locking up the system? > The first 64 bytes in the space is likely safe, from bytes 65-255 it is entirely vendor specific. > >>I see little reason why unpriviledged users should be given >>register-level access to anything. We don't let them read /dev/mem, do >>we? Fixing 'pciconf -l' is fine, but it really doesn't need to extend >>beyond that. I would consider 'pciconf -r' to be a security risk and >>would treat it as such when it comes time for a release. > > > My only idea was for developers working on pci drivers. It was > invaluable to be able to read the registers when debuging the sparc64 > pci stuff and writing my zoran driver, but I didn't want to have to > become root every time I wanted to look at this. The only problem is > that this requires three levels of permission, list, read, and write.. > changing it to support three is too much against (like overriding write > to mean read, etc) POLA, so I abandonded it. > I'll not argue your development practices. However, I don't see it as unreasonable to ask that driver writers who are going to need root access to do their work anyways (modifying files, compiling kernels and/or loading modules) also use root to access the pci registers from userland. I seem to remember Linx having a similar feature a few years ago and naive sysadmins getting into serious trouble by pointing their tape backups at the /proc/pci directory. Scott From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 14:59:52 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D6CBB37B401; Mon, 16 Jun 2003 14:59:52 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id D2E4B43F93; Mon, 16 Jun 2003 14:59:49 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5GMPRMo016160; Mon, 16 Jun 2003 18:25:27 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5GM0Mjx014626; Mon, 16 Jun 2003 15:00:22 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 15:00:22 -0700 From: John-Mark Gurney To: Scott Long Message-ID: <20030616220022.GW73854@funkthat.com> Mail-Followup-To: Scott Long , arch@freebsd.org, Robert Watson References: <20030616074122.GF73854@funkthat.com> <20030616194935.GR73854@funkthat.com> <3EEE2B31.4020406@freebsd.org> <20030616213312.GV73854@funkthat.com> <3EEE3BF2.3020809@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3EEE3BF2.3020809@freebsd.org> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org cc: Robert Watson Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 21:59:53 -0000 Scott Long wrote this message on Mon, Jun 16, 2003 at 15:51 -0600: > John-Mark Gurney wrote: > >hmmm. are you sure about this? wouldn't it mean that by simply probing > >for a device you could end up locking up the system? > > > > The first 64 bytes in the space is likely safe, from bytes 65-255 it is > entirely vendor specific. ok, agreed.. Dare I ask that we let normal users read the first 64 bytes? and require write permissions to read about 64? :) just kidding.. > I'll not argue your development practices. However, I don't see it as > unreasonable to ask that driver writers who are going to need root > access to do their work anyways (modifying files, compiling kernels > and/or loading modules) also use root to access the pci registers from > userland. everything up to the loading modules I like to do as normal user.. > I seem to remember Linx having a similar feature a few years ago and > naive sysadmins getting into serious trouble by pointing their tape > backups at the /proc/pci directory. :) well, lucky for use /dev/pci doesn't have a write interface.. :) -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 16:43:47 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 634F437B404; Mon, 16 Jun 2003 16:43:47 -0700 (PDT) Received: from www.ambrisko.com (adsl-64-174-51-42.dsl.snfc21.pacbell.net [64.174.51.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5D43743FBD; Mon, 16 Jun 2003 16:43:46 -0700 (PDT) (envelope-from ambrisko@www.ambrisko.com) Received: from www.ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.8p1/8.12.8) with ESMTP id h5GNhdO7091212; Mon, 16 Jun 2003 16:43:39 -0700 (PDT) (envelope-from ambrisko@www.ambrisko.com) Received: (from ambrisko@localhost) by www.ambrisko.com (8.12.8p1/8.12.8/Submit) id h5GNhdnl091211; Mon, 16 Jun 2003 16:43:39 -0700 (PDT) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200306162343.h5GNhdnl091211@www.ambrisko.com> In-Reply-To: <3EEE2B31.4020406@freebsd.org> To: Scott Long Date: Mon, 16 Jun 2003 16:43:39 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII cc: arch@freebsd.org cc: John-Mark Gurney Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 23:43:47 -0000 Scott Long writes: | You should not always assume that reading PCI registers has no | side-effects. It is certainly legal and possible for a PCI device to | detect the read request and alter the contents of the register (or some | other register) as a side effect, or change an internal state machine. | 'Fixing' the various bits to allow unpriviledged access to 'pciconf -r' | is dangerous since you would have to teach the system about every pci | device in existance and how to trap on registers that have side-effects. I seem to recall reading some PCI chip spec. for a chip I was working on that did a reset on read of that register. I can't recall which or where so don't take this as fact but a distant memory. Doug A. From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 16:57:54 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 77D4937B401 for ; Mon, 16 Jun 2003 16:57:54 -0700 (PDT) Received: from www.ambrisko.com (adsl-64-174-51-42.dsl.snfc21.pacbell.net [64.174.51.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id D237943FAF for ; Mon, 16 Jun 2003 16:57:53 -0700 (PDT) (envelope-from ambrisko@www.ambrisko.com) Received: from www.ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.8p1/8.12.8) with ESMTP id h5GNvrO7091992 for ; Mon, 16 Jun 2003 16:57:53 -0700 (PDT) (envelope-from ambrisko@www.ambrisko.com) Received: (from ambrisko@localhost) by www.ambrisko.com (8.12.8p1/8.12.8/Submit) id h5GNvr2O091991 for arch@freebsd.org; Mon, 16 Jun 2003 16:57:53 -0700 (PDT) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200306162357.h5GNvr2O091991@www.ambrisko.com> In-Reply-To: <200306162343.h5GNhdnl091211@www.ambrisko.com> To: arch@freebsd.org Date: Mon, 16 Jun 2003 16:57:52 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 23:57:54 -0000 Doug Ambrisko writes: | Scott Long writes: | | You should not always assume that reading PCI registers has no | | side-effects. It is certainly legal and possible for a PCI device to | | detect the read request and alter the contents of the register (or some | | other register) as a side effect, or change an internal state machine. | | 'Fixing' the various bits to allow unpriviledged access to 'pciconf -r' | | is dangerous since you would have to teach the system about every pci | | device in existance and how to trap on registers that have side-effects. | | I seem to recall reading some PCI chip spec. for a chip I was working on | that did a reset on read of that register. I can't recall which or where | so don't take this as fact but a distant memory. I meant to add this but didn't ... If the register could get cleared then the device driver could get hosed and that would be bad. This is what I was thinking about. Doug A. From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 18:07:25 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ED0A937B401; Mon, 16 Jun 2003 18:07:25 -0700 (PDT) Received: from c104-254.bas1.prp.dublin.eircom.net (c104-254.bas1.prp.dublin.eircom.net [159.134.104.254]) by mx1.FreeBSD.org (Postfix) with SMTP id 65AE243FA3; Mon, 16 Jun 2003 18:07:24 -0700 (PDT) (envelope-from iedowse@maths.tcd.ie) To: Bruce Evans In-Reply-To: Your message of "Mon, 16 Jun 2003 21:13:13 +1000." <20030616205631.F28116@gamplex.bde.org> Date: Tue, 17 Jun 2003 01:59:34 +0100 From: Ian Dowse Message-ID: <200306170159.aa26127@salmon.maths.tcd.ie> cc: Don Lewis cc: iedowse@maths.tcd.ie cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 01:07:26 -0000 In message <20030616205631.F28116@gamplex.bde.org>, Bruce Evans writes: >On Mon, 16 Jun 2003, Don Lewis wrote: >> It looks like MSGBUF_SEQNORM() could avoid the conditional code and any >> questions about signed remainders if it was defined like this: >> >> #define MSGBUF_SEQNORM(mbp, seq) (((seq) + (mbp)->msg_seqmod) % \ >> (mbp)->msg_seqmod) >> >> as long as msg_seqmod < INT_MAX/2. MSGBUF_SEQNORM() could be simplified >> further if msg_seqmod was added by the caller (such as MSGBUF_SEQSUB()) >> if the argument could be negative. > >Yes. The negative numbers of interest seem to be limited to at most >differences of sequence numbers (or maybe differeces of indexes, which >are smaller), so they are larger than -msg_seqmod. MSGBUF_SEQSUB() >shouldn't add the bias, however, since it is used in contexts where >we really want to see the negative values. The only minor problem I see with the above is that it is fragile with respect to arbitrary input sequence numbers, in that it could return a negative value. However, the property of guaranteeing to return a normalised sequence number can be achieved by forcing an unsigned division like in MSGBUF_SEQ_TO_POS, i.e.: #define MSGBUF_SEQNORM(mbp, seq) ((int)((u_int)((seq) + \ (mbp)->msg_seqmod) % (mbp)->msg_seqmod)) This should do the right thing for the expected ranges, but also ensures that the macro itself can never return an out-of-range sequence number, whatever the input value. Ian From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 19:36:51 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C3A1E37B401; Mon, 16 Jun 2003 19:36:51 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6B89943FDD; Mon, 16 Jun 2003 19:36:50 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id MAA09776; Tue, 17 Jun 2003 12:36:46 +1000 Date: Tue, 17 Jun 2003 12:36:46 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Robert Watson In-Reply-To: Message-ID: <20030617120956.N30677@gamplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org cc: John-Mark Gurney Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 02:36:52 -0000 On Mon, 16 Jun 2003, Robert Watson wrote: > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > Robert Watson wrote this message on Mon, Jun 16, 2003 at 13:54 -0400: > > > > > > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > > > > > Does anyone have an objection to making /dev/pci really honor the > > > > permissions, and giving normal users (or just group wheel) premission to > > > > run pciconf -l. Right now the code requires the write bit set for any > > > > operation. > > > > > > I seem to recall that there was a problem wherein user processes could > > > cause cause unaligned accesses using /dev/pci. There's also some rather > > > > again, I just proposed -l, not -r to become user readable. I know that > > -r has problems. I've crashed the sparc box a number of times by > > specifing pciconf -r pci1:5:0 0x0:0xf. Yes, it seems that -l is fairly safe, and it wasn't the default just because no one wrote the small hack to change the permissions check for the -l case only. > > > odd use of useracc(), printf(), etc, in the ioctl code. I suspect this > > > > well, do you mean odd use of printf as in providing diagnostics to catch > > mismatched userland/kernel? > > Spamming the console with messages can have debilitating effects on the > operation of the system if performed by unprivileged users... I.e., if > you're using a serial console. I.e., if you are using a console. Spamming of syscons consoles messes them up too, and is more likely to cause panics since the syscons console output is less reentrant than sio console output. > > for useracc, it checks to make sure that various pointers passed to it > > are either readable or writable. I don't see this as odd. Or is there > > another better method of checking user data when accessing user space > > buffers? > > Generally, calls to useracc() are redundant with the existing checks in > our copyin/out routines, or are signs that the proper routines aren't > being used. useracc() can be used to "improve" error handling. That is done here. The "improvements" include dangerous printf()s and returning the undocumented errno EACCES instead of the incompletely documented one EFAULT. > All of the fuword/suword/copyin/copyout/uio routines already > perform any necessary checks; manual checking is race-prone in a > multi-threaded smp environment regardless. Please check the error handling carefully if you remove the useracc()'s. The one for writing the results back to userland gets replaced by checks in copyout() for each part of the results. I think we give up after the first copyout() error and return that error. That seems right. > The (cio==NULL) test is also > redundant. It is just bogus. cio is a kernel pointer and we need it to be more than non-NULL to access it. sys_generic.c:ioctl() guarantees this provided the definition of PCICGETCONF is correct. > It looks like (although I haven't tried), user processes can > also cause the kernel to allocate unlimited amounts of kernel memory, > which is another bit we probably need to tighten down. Much more serious. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 21:10:43 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5838837B401 for ; Mon, 16 Jun 2003 21:10:43 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9CC1D43F85 for ; Mon, 16 Jun 2003 21:10:42 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5H4AXM7050537; Mon, 16 Jun 2003 21:10:38 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306170410.h5H4AXM7050537@gw.catspoiler.org> Date: Mon, 16 Jun 2003 21:10:33 -0700 (PDT) From: Don Lewis To: iedowse@maths.tcd.ie In-Reply-To: <200306170159.aa26127@salmon.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 04:10:43 -0000 On 17 Jun, Ian Dowse wrote: > In message <20030616205631.F28116@gamplex.bde.org>, Bruce Evans writes: >>On Mon, 16 Jun 2003, Don Lewis wrote: >>> It looks like MSGBUF_SEQNORM() could avoid the conditional code and any >>> questions about signed remainders if it was defined like this: >>> >>> #define MSGBUF_SEQNORM(mbp, seq) (((seq) + (mbp)->msg_seqmod) % \ >>> (mbp)->msg_seqmod) >>> >>> as long as msg_seqmod < INT_MAX/2. MSGBUF_SEQNORM() could be simplified >>> further if msg_seqmod was added by the caller (such as MSGBUF_SEQSUB()) >>> if the argument could be negative. >> >>Yes. The negative numbers of interest seem to be limited to at most >>differences of sequence numbers (or maybe differeces of indexes, which >>are smaller), so they are larger than -msg_seqmod. MSGBUF_SEQSUB() >>shouldn't add the bias, however, since it is used in contexts where >>we really want to see the negative values. Since MSGBUF_SEQSUB() calls MSGBUF_SEQNORM() on the difference between the sequence numbers, a negative value will never be returned. If you want a signed result, you'll probably want to do something more like: tmp = MSGBUF_SEQNORM(mbp, (seq1) - (seq2) + (mbp)->seqmod); return (tmp < ((mbp)->seqmod / 2)) ? tmp : (tmp - (mbp)->seqmod)); and you'll have to use a slightly different function if you are comparing indexes. > The only minor problem I see with the above is that it is fragile > with respect to arbitrary input sequence numbers, in that it could > return a negative value. However, the property of guaranteeing to > return a normalised sequence number can be achieved by forcing an > unsigned division like in MSGBUF_SEQ_TO_POS, i.e.: > > #define MSGBUF_SEQNORM(mbp, seq) ((int)((u_int)((seq) + \ > (mbp)->msg_seqmod) % (mbp)->msg_seqmod)) > > This should do the right thing for the expected ranges, but also > ensures that the macro itself can never return an out-of-range > sequence number, whatever the input value. Wouldn't it be better to have assertions to detect obviously bogus sequence numbers rather than using them to generate a valid pointer to a random location in the message buffer? From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 21:52:54 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1B54037B401 for ; Mon, 16 Jun 2003 21:52:54 -0700 (PDT) Received: from harmony.village.org (rover.bsdimp.com [204.144.255.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id 20F6B43F93 for ; Mon, 16 Jun 2003 21:52:53 -0700 (PDT) (envelope-from imp@bsdimp.com) Received: from localhost (warner@rover2.village.org [10.0.0.1]) by harmony.village.org (8.12.8/8.12.3) with ESMTP id h5H4qmkA065185; Mon, 16 Jun 2003 22:52:49 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 16 Jun 2003 22:52:16 -0600 (MDT) Message-Id: <20030616.225216.115910026.imp@bsdimp.com> To: gurney_j@efn.org From: "M. Warner Losh" In-Reply-To: <20030616074122.GF73854@funkthat.com> References: <20030616074122.GF73854@funkthat.com> X-Mailer: Mew version 2.1 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 04:52:54 -0000 In message: <20030616074122.GF73854@funkthat.com> John-Mark Gurney writes: : Does anyone have an objection to making /dev/pci really honor the : permissions, and giving normal users (or just group wheel) premission : to run pciconf -l. Right now the code requires the write bit set for : any operation. Yes. That's too dangerous and will panic a machine. You can get the same information that pciconf -l does by going through the alternative interface of devinfo. Warner From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:08:44 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E357437B401 for ; Mon, 16 Jun 2003 22:08:44 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7C4C043FD7 for ; Mon, 16 Jun 2003 22:08:43 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5H5YMMo011491; Tue, 17 Jun 2003 01:34:23 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5H595qk021343; Mon, 16 Jun 2003 22:09:05 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 22:09:05 -0700 From: John-Mark Gurney To: "M. Warner Losh" Message-ID: <20030617050905.GE73854@funkthat.com> Mail-Followup-To: "M. Warner Losh" , arch@freebsd.org References: <20030616074122.GF73854@funkthat.com> <20030616.225216.115910026.imp@bsdimp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030616.225216.115910026.imp@bsdimp.com> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:08:45 -0000 M. Warner Losh wrote this message on Mon, Jun 16, 2003 at 22:52 -0600: > In message: <20030616074122.GF73854@funkthat.com> > John-Mark Gurney writes: > : Does anyone have an objection to making /dev/pci really honor the > : permissions, and giving normal users (or just group wheel) premission > : to run pciconf -l. Right now the code requires the write bit set for > : any operation. > > Yes. That's too dangerous and will panic a machine. I'm not talking about -r... and anyways, I have fixed part of the problem with unaligned reads/writes which has been posted to this thread already.... > You can get the same information that pciconf -l does by going through > the alternative interface of devinfo. So, I hear you proposing to remove pciconf and the /dev/pci interface? -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:09:16 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0CC7237B401 for ; Mon, 16 Jun 2003 22:09:16 -0700 (PDT) Received: from ns.aus.com (adsl-67-122-205-189.dsl.pltn13.pacbell.net [67.122.205.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0A82543FE0 for ; Mon, 16 Jun 2003 22:09:15 -0700 (PDT) (envelope-from rsharpe@richardsharpe.com) Received: from localhost (rsharpe@localhost) by ns.aus.com (8.11.6/8.11.6) with ESMTP id h5H5FgL05996 for ; Mon, 16 Jun 2003 22:15:42 -0700 X-Authentication-Warning: ns.aus.com: rsharpe owned process doing -bs Date: Mon, 16 Jun 2003 22:15:42 -0700 (PDT) From: Richard Sharpe X-X-Sender: To: In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: Re: sendfile(2) SF_NOPUSH flag proposal X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:09:16 -0000 On Tue, 27 May 2003, Igor Sysoev wrote: > On Tue, 27 May 2003, Peter Jeremy wrote: > > > 2) The new feature provides significant performance benefit. In this > > case, I believe the overhead of calling setsockopt(2) is negligible > > so the performance gain would be negligible. > > I think the calling setsockopt(TCP_NOPUSH, 1) syscall has huge overhead > as compared to several C operators inside sendfile(2). > > The turing TF_NOPUSH off has almost the same overhead as > setsockopt(TCP_NOPUSH, 0) if you need to call tcp_output(tp) inside > sendfile(2) and has no overhead at all if you do not need to call it. > > > At this stage, I would suggest that you need to do better than "the > > change is cheap" to justify adding this feature. Can you quantify > > the performance benefits, or provide some other justification? > > My point is not "the cheap change" but "the cheap overhead". While I was chasing down a performance problem with Samba using sendfile on FreeBSD 4.6.2, I changed sendfile to: 1. Use sosend for the header, and 2. Not to push the header out if there was data following (by passing MSG_MORE to sosend, and maybe frobbing sosend to do the right things).. I was also using TCP_NODELAY, and sendfile was being used to to handle SMB Read&X calls only. The performance impact of doing this as measured by tests like NetBench was negligible. I did not test raw throughput (as NetBench is not really about raw throughput), but I suspect that it would not make much difference either. I also modified sendfile so that it uses VOP_GETPAGES rather than VOP_READ, and this had more impact, I believe. Regards ----- Richard Sharpe, rsharpe[at]ns.aus.com, rsharpe[at]samba.org, sharpe[at]ethereal.com, http://www.richardsharpe.com From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:13:37 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1E1B437B401 for ; Mon, 16 Jun 2003 22:13:37 -0700 (PDT) Received: from harmony.village.org (rover.bsdimp.com [204.144.255.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3811A43F3F for ; Mon, 16 Jun 2003 22:13:36 -0700 (PDT) (envelope-from imp@bsdimp.com) Received: from localhost (warner@rover2.village.org [10.0.0.1]) by harmony.village.org (8.12.8/8.12.3) with ESMTP id h5H5DUkA065417; Mon, 16 Jun 2003 23:13:30 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 16 Jun 2003 23:12:58 -0600 (MDT) Message-Id: <20030616.231258.116352275.imp@bsdimp.com> To: gurney_j@efn.org From: "M. Warner Losh" In-Reply-To: <20030617050905.GE73854@funkthat.com> References: <20030616074122.GF73854@funkthat.com> <20030616.225216.115910026.imp@bsdimp.com> <20030617050905.GE73854@funkthat.com> X-Mailer: Mew version 2.1 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:13:37 -0000 In message: <20030617050905.GE73854@funkthat.com> John-Mark Gurney writes: : M. Warner Losh wrote this message on Mon, Jun 16, 2003 at 22:52 -0600: : > In message: <20030616074122.GF73854@funkthat.com> : > John-Mark Gurney writes: : > : Does anyone have an objection to making /dev/pci really honor the : > : permissions, and giving normal users (or just group wheel) premission : > : to run pciconf -l. Right now the code requires the write bit set for : > : any operation. : > : > Yes. That's too dangerous and will panic a machine. : : I'm not talking about -r... and anyways, I have fixed part of the : problem with unaligned reads/writes which has been posted to this : thread already.... Saw that. Still not sure it is a good idea, but only if the code is reviewed heavily... : > You can get the same information that pciconf -l does by going through : > the alternative interface of devinfo. : : So, I hear you proposing to remove pciconf and the /dev/pci interface? Not really. I'm saying that we should beef up devinfo interface so that it can get PNP information from more databases that just the PCI one. pciconf is the wrong place for -v to be placed. But it was the only place to place it when it was written. Now that other busses have begun to implement pnp info to userland, we should look at a good way to deal. Also, pciconf -r/-w provides inforamtion that devinfo cannot provide. And they serve a useful purpose for low level debugging. Warner From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:28:48 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E0DCF37B401; Mon, 16 Jun 2003 22:28:47 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id D4CF643FBF; Mon, 16 Jun 2003 22:28:46 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5H5sTMo014071; Tue, 17 Jun 2003 01:54:29 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5H5TH6g021743; Mon, 16 Jun 2003 22:29:17 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 22:29:17 -0700 From: John-Mark Gurney To: Bruce Evans Message-ID: <20030617052917.GF73854@funkthat.com> Mail-Followup-To: Bruce Evans , Robert Watson , arch@freebsd.org References: <20030617120956.N30677@gamplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030617120956.N30677@gamplex.bde.org> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org cc: Robert Watson Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:28:48 -0000 Bruce Evans wrote this message on Tue, Jun 17, 2003 at 12:36 +1000: > On Mon, 16 Jun 2003, Robert Watson wrote: > > > On Mon, 16 Jun 2003, John-Mark Gurney wrote: > > > > > again, I just proposed -l, not -r to become user readable. I know that > > > -r has problems. I've crashed the sparc box a number of times by > > > specifing pciconf -r pci1:5:0 0x0:0xf. > > Yes, it seems that -l is fairly safe, and it wasn't the default just because > no one wrote the small hack to change the permissions check for the -l case > only. Ok, sounds like most people agree on -l being readable... > > > > odd use of useracc(), printf(), etc, in the ioctl code. I suspect this > > > > > > well, do you mean odd use of printf as in providing diagnostics to catch > > > mismatched userland/kernel? > > > > Spamming the console with messages can have debilitating effects on the > > operation of the system if performed by unprivileged users... I.e., if > > you're using a serial console. > > I.e., if you are using a console. Spamming of syscons consoles messes them > up too, and is more likely to cause panics since the syscons console output > is less reentrant than sio console output. I removed those printf's since they really didn't contain any information more than the user already knew.. > > > for useracc, it checks to make sure that various pointers passed to it > > > are either readable or writable. I don't see this as odd. Or is there > > > another better method of checking user data when accessing user space > > > buffers? > > > > Generally, calls to useracc() are redundant with the existing checks in > > our copyin/out routines, or are signs that the proper routines aren't > > being used. > > useracc() can be used to "improve" error handling. That is done here. > The "improvements" include dangerous printf()s and returning the > undocumented errno EACCES instead of the incompletely documented one EFAULT. Doh, reminder, update documentation.. I'm going to let the copyout error be returned. > > All of the fuword/suword/copyin/copyout/uio routines already > > perform any necessary checks; manual checking is race-prone in a > > multi-threaded smp environment regardless. > > Please check the error handling carefully if you remove the useracc()'s. > The one for writing the results back to userland gets replaced by checks > in copyout() for each part of the results. I think we give up after the > first copyout() error and return that error. That seems right. actually, we had a problem where if we tried to copy out, but failed, we would still increment num_matches even though the last one was bogus.. > > The (cio==NULL) test is also > > redundant. > > It is just bogus. cio is a kernel pointer and we need it to be more than > non-NULL to access it. sys_generic.c:ioctl() guarantees this provided > the definition of PCICGETCONF is correct. > > > It looks like (although I haven't tried), user processes can > > also cause the kernel to allocate unlimited amounts of kernel memory, > > which is another bit we probably need to tighten down. > > Much more serious. Yep, the pattern_buf is allocated, and in some cases a berak happens w/o freeing it. So there is a memory leak her. Will be fixed soon. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:34:26 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 853C337B401 for ; Mon, 16 Jun 2003 22:34:26 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9ADA843F85 for ; Mon, 16 Jun 2003 22:34:25 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5H608Mo014906; Tue, 17 Jun 2003 02:00:08 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5H5YutT021835; Mon, 16 Jun 2003 22:34:56 -0700 (PDT) (envelope-from jmg) Date: Mon, 16 Jun 2003 22:34:56 -0700 From: John-Mark Gurney To: "M. Warner Losh" Message-ID: <20030617053456.GG73854@funkthat.com> Mail-Followup-To: "M. Warner Losh" , arch@freebsd.org References: <20030616074122.GF73854@funkthat.com> <20030616.225216.115910026.imp@bsdimp.com> <20030617050905.GE73854@funkthat.com> <20030616.231258.116352275.imp@bsdimp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030616.231258.116352275.imp@bsdimp.com> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:34:26 -0000 M. Warner Losh wrote this message on Mon, Jun 16, 2003 at 23:12 -0600: > In message: <20030617050905.GE73854@funkthat.com> > John-Mark Gurney writes: > : M. Warner Losh wrote this message on Mon, Jun 16, 2003 at 22:52 -0600: > : > In message: <20030616074122.GF73854@funkthat.com> > : > John-Mark Gurney writes: > : > : Does anyone have an objection to making /dev/pci really honor the > : > : permissions, and giving normal users (or just group wheel) premission > : > : to run pciconf -l. Right now the code requires the write bit set for > : > : any operation. > : > > : > Yes. That's too dangerous and will panic a machine. > : > : I'm not talking about -r... and anyways, I have fixed part of the > : problem with unaligned reads/writes which has been posted to this > : thread already.... > > Saw that. Still not sure it is a good idea, but only if the code is > reviewed heavily... You can take a look at it if you want (I'll send you my patchse shortly.). Shouldn't take but about 5-10 minutes of your time. It's a VERY simple interface/code. > : > You can get the same information that pciconf -l does by going through > : > the alternative interface of devinfo. > : > : So, I hear you proposing to remove pciconf and the /dev/pci interface? > > Not really. I'm saying that we should beef up devinfo interface so > that it can get PNP information from more databases that just the PCI > one. pciconf is the wrong place for -v to be placed. But it was the > only place to place it when it was written. Now that other busses > have begun to implement pnp info to userland, we should look at a good > way to deal. I agree... devinfo is much beter than pciconf for device information, but this whole bit was just started to get one small piece of code committed, at least it'll be better while it lasts. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Mon Jun 16 22:40:13 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2AA7737B401 for ; Mon, 16 Jun 2003 22:40:13 -0700 (PDT) Received: from harmony.village.org (rover.bsdimp.com [204.144.255.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id 58A4143FAF for ; Mon, 16 Jun 2003 22:40:12 -0700 (PDT) (envelope-from imp@bsdimp.com) Received: from localhost (warner@rover2.village.org [10.0.0.1]) by harmony.village.org (8.12.8/8.12.3) with ESMTP id h5H5e0kA065562; Mon, 16 Jun 2003 23:40:00 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 16 Jun 2003 23:39:08 -0600 (MDT) Message-Id: <20030616.233908.94890442.imp@bsdimp.com> To: gurney_j@efn.org From: "M. Warner Losh" In-Reply-To: <20030617053456.GG73854@funkthat.com> References: <20030617050905.GE73854@funkthat.com> <20030616.231258.116352275.imp@bsdimp.com> <20030617053456.GG73854@funkthat.com> X-Mailer: Mew version 2.1 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit cc: arch@freebsd.org Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 05:40:13 -0000 In message: <20030617053456.GG73854@funkthat.com> John-Mark Gurney writes: : You can take a look at it if you want (I'll send you my patchse : shortly.). Shouldn't take but about 5-10 minutes of your time. It's : a VERY simple interface/code. Actually, looks like the interactions with others have done a great job of doing that. : > : > You can get the same information that pciconf -l does by going through : > : > the alternative interface of devinfo. : > : : > : So, I hear you proposing to remove pciconf and the /dev/pci interface? : > : > Not really. I'm saying that we should beef up devinfo interface so : > that it can get PNP information from more databases that just the PCI : > one. pciconf is the wrong place for -v to be placed. But it was the : > only place to place it when it was written. Now that other busses : > have begun to implement pnp info to userland, we should look at a good : > way to deal. : : I agree... devinfo is much beter than pciconf for device information, : but this whole bit was just started to get one small piece of code : committed, at least it'll be better while it lasts. True. It will be a while before devinfo can get the TLC I really want to give it. Warner From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 01:48:23 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C7E3737B401 for ; Tue, 17 Jun 2003 01:48:23 -0700 (PDT) Received: from park.rambler.ru (park.rambler.ru [81.19.64.101]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5237D43F85 for ; Tue, 17 Jun 2003 01:48:22 -0700 (PDT) (envelope-from is@rambler-co.ru) Received: from is.park.rambler.ru (is.park.rambler.ru [81.19.64.102]) by park.rambler.ru (8.12.6/8.12.6) with ESMTP id h5H8mKmF043199; Tue, 17 Jun 2003 12:48:20 +0400 (MSD) Date: Tue, 17 Jun 2003 12:48:20 +0400 (MSD) From: Igor Sysoev X-Sender: is@is To: Richard Sharpe In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: sendfile(2) SF_NOPUSH flag proposal X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 08:48:24 -0000 On Mon, 16 Jun 2003, Richard Sharpe wrote: > While I was chasing down a performance problem with Samba using sendfile > on FreeBSD 4.6.2, I changed sendfile to: > > 1. Use sosend for the header, and > 2. Not to push the header out if there was data following (by passing > MSG_MORE to sosend, and maybe frobbing sosend to do the right things).. > > I was also using TCP_NODELAY, and sendfile was being used to to handle SMB > Read&X calls only. > > The performance impact of doing this as measured by tests like NetBench > was negligible. > > I did not test raw throughput (as NetBench is not really > about raw throughput), but I suspect that it would not make much > difference either. The sending the header and the first file part in the separate packets is the one part of the problem. The second part is that file pages can be sent in incomplete packets, e.g. three pages can be sent as 1460, 1460, 1176, 1460, 1460, 1176, 1460, 1460, ... or as 1460, 1460, 1460, 1460, 1460, 892, 1460, 1460, ... And this can be fixed by PRUS_MORETOCOME flag while sending the pages. > I also modified sendfile so that it uses VOP_GETPAGES rather than > VOP_READ, and this had more impact, I believe. What is the difference of these operation ? Igor Sysoev http://sysoev.ru/en/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 03:54:29 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4167F37B401 for ; Tue, 17 Jun 2003 03:54:29 -0700 (PDT) Received: from mailhub.fokus.fraunhofer.de (mailhub.fokus.fraunhofer.de [193.174.154.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1D5A743FDD for ; Tue, 17 Jun 2003 03:54:28 -0700 (PDT) (envelope-from brandt@fokus.fraunhofer.de) Received: from beagle (beagle [193.175.132.100])h5HAsQQ16910 for ; Tue, 17 Jun 2003 12:54:26 +0200 (MEST) Date: Tue, 17 Jun 2003 12:54:26 +0200 (CEST) From: Harti Brandt To: arch@freebsd.org Message-ID: <20030617124004.Y77677@beagle.fokus.fraunhofer.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: busdma sync problem X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 10:54:29 -0000 Hi, I have drivers for several cards that use shared host memory for communication between the card and the driver. In most cases these memory areas are used for queues: the driver writes on one end, the card reads on the other or the other way around. The problem is, that with the current bus_dma_sync call there is no way to correctly synchronize these queues, because bus_dma_sync synchonizes always the complete map. NetBSD bus_dma_sync has two additional parameters for that case: an offset and a length describing the part of the map to synchronize. Unless I got the above entirely wrong, I suppose we also need this functionality. The question is how to introduce it. The easy way is to implement a new function, say bus_dma_sync_size with the same signature as the NetBSD one. The formally more correct way would be to just change our function and all its callers. This would, of course, break the interface and I suppose it's too late for 5.X. So should I go the first way? Is there anybody who would be willing to look at the patch? harti -- harti brandt, http://www.fokus.fraunhofer.de/research/cc/cats/employees/hartmut.brandt/private brandt@fokus.fraunhofer.de, harti@freebsd.org From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 04:53:30 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E97C337B401 for ; Tue, 17 Jun 2003 04:53:30 -0700 (PDT) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8310143FB1 for ; Tue, 17 Jun 2003 04:53:30 -0700 (PDT) (envelope-from mux@freebsd.org) Received: by elvis.mu.org (Postfix, from userid 1920) id 6714B2ED442; Tue, 17 Jun 2003 04:53:30 -0700 (PDT) Date: Tue, 17 Jun 2003 13:53:30 +0200 From: Maxime Henrion To: Harti Brandt Message-ID: <20030617115330.GS21011@elvis.mu.org> References: <20030617124004.Y77677@beagle.fokus.fraunhofer.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030617124004.Y77677@beagle.fokus.fraunhofer.de> User-Agent: Mutt/1.4.1i cc: arch@freebsd.org Subject: Re: busdma sync problem X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 11:53:31 -0000 Harti Brandt wrote: > > Hi, > > I have drivers for several cards that use shared host memory for > communication between the card and the driver. In most cases these memory > areas are used for queues: the driver writes on one end, the card reads on > the other or the other way around. The problem is, that with the current > bus_dma_sync call there is no way to correctly synchronize these queues, > because bus_dma_sync synchonizes always the complete map. NetBSD > bus_dma_sync has two additional parameters for that case: an offset and > a length describing the part of the map to synchronize. Unless I got the > above entirely wrong, I suppose we also need this functionality. Indeed. This can also be used to slightly improve performance with PAE kernels because PAE causes lots of bounce buffers to be used, and with an offset and length parameter, we can reduce the number of bytes that will be copied on bus_dmamap_sync(). I'm not sure the performance difference would be measurable though. I'm getting a bit off topic there, but for what it's worth, a way to greatly improve network performance with PAE kernels would be to teach the mbuf allocator how to try to allocate mbufs in memory where no bounce buffers are required. > The question is how to introduce it. The easy way is to implement a new > function, say bus_dma_sync_size with the same signature as the NetBSD one. > The formally more correct way would be to just change our function and all > its callers. This would, of course, break the interface and I suppose it's > too late for 5.X. > > So should I go the first way? Is there anybody who would be willing to > look at the patch? I don't really like how the NetBSD interface forces us to always specify the length and offset, even though in most cases we want to sync the whole map. However, I believe it would be more beneficial at this point in time to try to make our busdma API as close as NetBSD's one as possible. I bet we'll end up using more macros, as the NetBSD folks do in their drivers, to circumvent the fact that it's annoying to always have to specify the length and offset to bus_dmamap_sync(). When you have this done, I'll be happy to take a look at it if you want me to. Cheers, Maxime From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 08:09:39 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2955837B401 for ; Tue, 17 Jun 2003 08:09:39 -0700 (PDT) Received: from ns.aus.com (adsl-67-122-204-43.dsl.snfc21.pacbell.net [67.122.204.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id 56FF743FDF for ; Tue, 17 Jun 2003 08:09:38 -0700 (PDT) (envelope-from rsharpe@richardsharpe.com) Received: from localhost (rsharpe@localhost) by ns.aus.com (8.11.6/8.11.6) with ESMTP id h5HFG4p03792; Tue, 17 Jun 2003 08:16:05 -0700 X-Authentication-Warning: ns.aus.com: rsharpe owned process doing -bs Date: Tue, 17 Jun 2003 08:16:04 -0700 (PDT) From: Richard Sharpe X-X-Sender: To: Igor Sysoev In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: sendfile(2) SF_NOPUSH flag proposal X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 15:09:39 -0000 On Tue, 17 Jun 2003, Igor Sysoev wrote: > On Mon, 16 Jun 2003, Richard Sharpe wrote: > > > While I was chasing down a performance problem with Samba using sendfile > > on FreeBSD 4.6.2, I changed sendfile to: > > > > 1. Use sosend for the header, and > > 2. Not to push the header out if there was data following (by passing > > MSG_MORE to sosend, and maybe frobbing sosend to do the right things).. > > > > I was also using TCP_NODELAY, and sendfile was being used to to handle SMB > > Read&X calls only. > > > > The performance impact of doing this as measured by tests like NetBench > > was negligible. > > > > I did not test raw throughput (as NetBench is not really > > about raw throughput), but I suspect that it would not make much > > difference either. > > The sending the header and the first file part in the separate packets > is the one part of the problem. The second part is that file pages can > be sent in incomplete packets, e.g. three pages can be sent as > 1460, 1460, 1176, 1460, 1460, 1176, 1460, 1460, ... > or as 1460, 1460, 1460, 1460, 1460, 892, 1460, 1460, ... > > And this can be fixed by PRUS_MORETOCOME flag while sending the pages. Yes, I noticed that. What I didn't emphasize enough was that in my environment, ie an SMB server (CIFS, or Samba) and NetBench, the majority of the requests I care about are reads of less than 5100 bytes or so, and a very large proportion of them are less than 1448 bytes (we use SACK). So the less than full segment every third segment is not such an issue for me. This is especially so when I have bigger problems to worry about, like moving much of the SMB server into the kernel, making NFSv4 and SMB use much of the same infrastructure, and creating a unified credentials system where I can carry UIDs/GIDS and/or SIDs and/or KRB tickets around in the kernel. > > I also modified sendfile so that it uses VOP_GETPAGES rather than > > VOP_READ, and this had more impact, I believe. > > What is the difference of these operation ? It was a throw away comment. Please ignore it. Regards ----- Richard Sharpe, rsharpe[at]ns.aus.com, rsharpe[at]samba.org, sharpe[at]ethereal.com, http://www.richardsharpe.com From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 11:06:24 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 24CD737B407; Tue, 17 Jun 2003 11:06:24 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id 213DC43F75; Tue, 17 Jun 2003 11:06:21 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5HIW6Mo015646; Tue, 17 Jun 2003 14:32:07 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5HI6iIq033086; Tue, 17 Jun 2003 11:06:44 -0700 (PDT) (envelope-from jmg) Date: Tue, 17 Jun 2003 11:06:44 -0700 From: John-Mark Gurney To: Maxime Henrion Message-ID: <20030617180644.GL73854@funkthat.com> Mail-Followup-To: Maxime Henrion , Harti Brandt , arch@freebsd.org References: <20030617124004.Y77677@beagle.fokus.fraunhofer.de> <20030617115330.GS21011@elvis.mu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030617115330.GS21011@elvis.mu.org> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html cc: arch@freebsd.org Subject: Re: busdma sync problem X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 18:06:24 -0000 Maxime Henrion wrote this message on Tue, Jun 17, 2003 at 13:53 +0200: > whole map. However, I believe it would be more beneficial at this point > in time to try to make our busdma API as close as NetBSD's one as > possible. I bet we'll end up using more macros, as the NetBSD folks do Actually, right now, our bus/device interface is so far away that it'll take a lot of work. NetBSD shares bus_tag's between DMA and bus_space, so w/ FreeBSD, you already have to add a bus dma tag for each bus dmamap that you use in the driver. The dmamap is not an opaque type and you do not require callbacks to get the address for each segment. Sure you can start with some small things, but I don't think these are a big deal, (heck it might even confuse some people), till the rest of the bus_dma interface has been "merged". -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 13:32:54 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CC90A37B401; Tue, 17 Jun 2003 13:32:54 -0700 (PDT) Received: from vexpert.dbai.tuwien.ac.at (vexpert.dbai.tuwien.ac.at [128.131.111.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 88AC843F3F; Tue, 17 Jun 2003 13:32:53 -0700 (PDT) (envelope-from pfeifer@dbai.tuwien.ac.at) Received: from [128.131.111.52] (naos [128.131.111.52]) by vexpert.dbai.tuwien.ac.at (Postfix) with ESMTP id DA5C013787; Tue, 17 Jun 2003 22:32:51 +0200 (CEST) Date: Tue, 17 Jun 2003 22:32:48 +0200 (CEST) From: Gerald Pfeifer To: deischen@freebsd.org, freebsd-arch@freebsd.org Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: _THREAD_SAFE and gcc man page X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 20:32:55 -0000 The following changed to include/stdio.h revision 1.25 date: 2001/01/24 13:01:47; author: deischen; state: Exp; lines: +5 -55 Add a lock to DIR to make telldir and friends MT-safe. Clean up stdio.h a bit and remove _THREAD_SAFE. Some of the usual macros getc, putc, getchar, putchar are no longer macros. Approved by: -arch probably also should have removed the following part of the GCC man page: FreeBSD SPECIFIC OPTIONS -pthread Link a user-threaded process against libc_r instead of libc. Ob- jects linked into user-threaded processes should be compiled with -D_THREAD_SAFE. or at least the -D_THREAD_SAFE part of it. Could one of you please take care of that? (GCC maintainers Cc:ed.) Gerald PS: This is now . -- Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.pfeifer.com/gerald/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 15:32:12 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 41B2937B40F for ; Tue, 17 Jun 2003 15:32:12 -0700 (PDT) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id DB02543FBD for ; Tue, 17 Jun 2003 15:32:11 -0700 (PDT) (envelope-from mux@freebsd.org) Received: by elvis.mu.org (Postfix, from userid 1920) id B6C982ED416; Tue, 17 Jun 2003 15:32:11 -0700 (PDT) Date: Wed, 18 Jun 2003 00:32:11 +0200 From: Maxime Henrion To: Harti Brandt , arch@freebsd.org Message-ID: <20030617223211.GT21011@elvis.mu.org> References: <20030617124004.Y77677@beagle.fokus.fraunhofer.de> <20030617115330.GS21011@elvis.mu.org> <20030617180644.GL73854@funkthat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030617180644.GL73854@funkthat.com> User-Agent: Mutt/1.4.1i Subject: Re: busdma sync problem X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 22:32:12 -0000 John-Mark Gurney wrote: > Maxime Henrion wrote this message on Tue, Jun 17, 2003 at 13:53 +0200: > > whole map. However, I believe it would be more beneficial at this point > > in time to try to make our busdma API as close as NetBSD's one as > > possible. I bet we'll end up using more macros, as the NetBSD folks do > > Actually, right now, our bus/device interface is so far away that it'll > take a lot of work. NetBSD shares bus_tag's between DMA and bus_space, > so w/ FreeBSD, you already have to add a bus dma tag for each bus dmamap > that you use in the driver. The dmamap is not an opaque type and you > do not require callbacks to get the address for each segment. I am well aware of how our busdma API and NetBSD's one diverge. > Sure you can start with some small things, but I don't think these are > a big deal, (heck it might even confuse some people), till the rest of > the bus_dma interface has been "merged". The current goal is to merge with NetBSD rather than diverge, so it is meaningful to have bus_dmamap_sync() be the same, especially if you consider that bus_dmamap_sync() is a function that we call very often in the busdma aware drivers. Furthermore, since what we are talking about now isn't even implemented, it doesn't cost us much to implement it the NetBSD's way. The only drawback is that we'll have to change the existing busdma drivers, and we don't have so many busdma drivers, plus it's a mechanical change... Cheers, Maxime From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 17:22:44 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D304437B404; Tue, 17 Jun 2003 17:22:44 -0700 (PDT) Received: from c104-254.bas1.prp.dublin.eircom.net (c104-254.bas1.prp.dublin.eircom.net [159.134.104.254]) by mx1.FreeBSD.org (Postfix) with SMTP id 4EB6343F3F; Tue, 17 Jun 2003 17:22:43 -0700 (PDT) (envelope-from iedowse@maths.tcd.ie) To: Don Lewis In-Reply-To: Your message of "Mon, 16 Jun 2003 21:10:33 PDT." <200306170410.h5H4AXM7050537@gw.catspoiler.org> Date: Wed, 18 Jun 2003 01:18:59 +0100 From: Ian Dowse Message-ID: <200306180119.aa03806@salmon.maths.tcd.ie> cc: iedowse@maths.tcd.ie cc: freebsd-arch@FreeBSD.org Subject: Re: Message buffer and printf reentrancy patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 00:22:45 -0000 In message <200306170410.h5H4AXM7050537@gw.catspoiler.org>, Don Lewis writes: >Since MSGBUF_SEQSUB() calls MSGBUF_SEQNORM() on the difference between >the sequence numbers, a negative value will never be returned. If you >want a signed result, you'll probably want to do something more like: > tmp = MSGBUF_SEQNORM(mbp, (seq1) - (seq2) + (mbp)->seqmod); > return (tmp < ((mbp)->seqmod / 2)) ? tmp : (tmp - (mbp)->seqmod)); > >and you'll have to use a slightly different function if you are >comparing indexes. Oops, you're quite right - MSGBUF_SEQSUB was intended to return negative values, but got broken somewhere along the way. This appears not to affect the code that uses it, so I guess that means that the sequence numbers might as well be unsigned after all. >> The only minor problem I see with the above is that it is fragile >> with respect to arbitrary input sequence numbers, in that it could >> return a negative value. However, the property of guaranteeing to >> return a normalised sequence number can be achieved by forcing an >> unsigned division like in MSGBUF_SEQ_TO_POS, i.e.: > >Wouldn't it be better to have assertions to detect obviously bogus >sequence numbers rather than using them to generate a valid pointer to a >random location in the message buffer? It would if the assertion didn't trigger a panic that gets written to the message buffer via the same macros :-) Ian From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 23:20:51 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EB98837B405 for ; Tue, 17 Jun 2003 23:20:51 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC26943F93 for ; Tue, 17 Jun 2003 23:20:50 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5I6KmBE036656 for ; Wed, 18 Jun 2003 08:20:49 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: arch@freebsd.org From: Poul-Henning Kamp Date: Wed, 18 Jun 2003 08:20:48 +0200 Message-ID: <36655.1055917248@critter.freebsd.dk> Subject: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 06:20:52 -0000 Now that we have a bunch of kernel threads which participate in the running of the system, I find that it is a tad more time consuming to figure out what the state of a crashed or hung system is. So I was wondering if we should instigate a simple convention for the sleep identifiers to make it easier to spot, or rather: ignore, kthreads which are in their normal idle position. Since thread names are longer than the space we have in ps(1) output using the thread name is not feasible solution. I notice that the interrupt threads all seem to sleep on "-", and all things considered, I like that. Should we adopt that as our convention ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Tue Jun 17 23:48:12 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9232637B404 for ; Tue, 17 Jun 2003 23:48:12 -0700 (PDT) Received: from mail.chesapeake.net (chesapeake.net [208.142.252.6]) by mx1.FreeBSD.org (Postfix) with ESMTP id BA40843FBF for ; Tue, 17 Jun 2003 23:48:11 -0700 (PDT) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h5I6m4m42683; Wed, 18 Jun 2003 02:48:04 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 18 Jun 2003 02:48:04 -0400 (EDT) From: Jeff Roberson To: Poul-Henning Kamp In-Reply-To: <36655.1055917248@critter.freebsd.dk> Message-ID: <20030618024448.I36168-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 06:48:12 -0000 On Wed, 18 Jun 2003, Poul-Henning Kamp wrote: > > Now that we have a bunch of kernel threads which participate in the > running of the system, I find that it is a tad more time consuming > to figure out what the state of a crashed or hung system is. > > So I was wondering if we should instigate a simple convention for > the sleep identifiers to make it easier to spot, or rather: ignore, > kthreads which are in their normal idle position. > > Since thread names are longer than the space we have in ps(1) output > using the thread name is not feasible solution. > > I notice that the interrupt threads all seem to sleep on "-", and > all things considered, I like that. > > Should we adopt that as our convention ? I like the idea of having a convention. I think most any consistent identifier will do. I vote yes. Cheers, Jeff From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 04:53:34 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1EF7A37B401; Wed, 18 Jun 2003 04:53:34 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0146643F75; Wed, 18 Jun 2003 04:53:33 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IBrTBE039082; Wed, 18 Jun 2003 13:53:30 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: Dmitry Sivachenko From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 15:22:26 +0400." <20030618112226.GA42606@fling-wing.demos.su> Date: Wed, 18 Jun 2003 13:53:29 +0200 Message-ID: <39081.1055937209@critter.freebsd.dk> cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 11:53:34 -0000 In message <20030618112226.GA42606@fling-wing.demos.su>, Dmitry Sivachenko writes : [I've moved this to arch@] >> The main problems with nullfs seem to be locking and trying to create clones >> of the lower vnode (wrt. the VM system and special files). Once kern/51583 > >BTW, what is the reason for creating these clone vnodes? >Why we can't simply return the original vnode? This is a question in the same caliber as a kid asking mom where the babies come from :-) Back in history, when vnodes first appeared as part of stacking filesystems, there were no merged vm/buffer cache. There were also some suboptimal design "decisions" made in the VFS implementation, made to expedite the implementation, but introducing issues which "could be cleaned up later". NFS added a few interesting wrinkles to the vnode area, mostly because it does not follow the model implicitly assumed in the VFS layering. The buffer cache expects a disk device behind all buffers, that took some hacking too. Then we got a semi-merged vm/buffer cache. Semi, becuase it was never finished so it became some sort of hybrid almost but not quite entirely unlike either state. A few filesystems got VOP_GETPAGES, none of them got VOP_PUTPAGES as far as I recall. Then we got softupdates and snapshots, which due to shortcomings in the vm/buf area could not be implemented in the architecturally obvious way, but instead had to put fingers into specfs and the buffer cache to get the job done. All of this have tangled the simple component formerly known as the buffer cache up in so many ways, that it is very hard for anybody to make heads and tails of it any more. So I am tempted to answer you question with: "Because it is all a mess" A number of us heavy-duty people have started to say rude things and do menacing gestures with our flow-diagram templates in the general direction of the buffer cache, but any real solution is unlikely to happen until we are talking 6-current. The cleanup would probably be easier to perform if we could ditch the stuff and layers which have been glued on and reduce the code to its core functionality first, and this may indeed be what we have to do, but considering the list of the stuff which are talking about, it is unlikely to be a politically feasible path to take: vinum -- abuses getebuf(), should be GEOM class. raidframe -- abuses getebuf(), should be GEOM class. cluster code -- must be rewritten snapshots -- must be untangled from the bio path. softupdates -- ditto. unionfs -- does not correctly layer VOP_STRATEGY nullfs -- maybe same problem. swap_pager -- abuses bogus vnode I am hoping that we may be able to carve a path by changing the bio structure operate on vm pages rather than KVM mapped byte arrays (most disk device drivers don't care for thing being mapped, they use bus-master DMA and only need physical location). Next, giving buffers a set of object methods could maybe avoid the detour around VOP_BMAP and VOP_STRATEGY thereby possibly making it possible for softupdates and snapshots to be implemented entirely inside UFS/FFS. I have a couple of other ideas I want to explore as well, one of them being not doing I/O via VCHR vnodes, but either at the fdesc level (when from userland) or via a dedicated API (for disk I/O from buf/vm). But I have only just started seriously investigating how all this can be done, and as I said, it is a royal mess, so it will take time no matter what I and others find. With that said, I will also add, that I will take an incredibly dim view of anybody who tries to add more gunk in this area, and that I am perfectly willing to derail unionfs and nullfs (or pretty much anything else on the list above) if that is what it takes to clean up the buffer cache. Any of those facilities can be reintroduced later on in a cleaner fashion. I agree that nullfs and unionfs are useful technologies, but if they have to be reimplemented to fit our kernel, then so be it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 05:16:31 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F22A637B404; Wed, 18 Jun 2003 05:16:30 -0700 (PDT) Received: from axl.seasidesoftware.co.za (axl.seasidesoftware.co.za [196.31.7.201]) by mx1.FreeBSD.org (Postfix) with ESMTP id D4E1343FB1; Wed, 18 Jun 2003 05:16:26 -0700 (PDT) (envelope-from sheldonh@starjuice.net) Received: from sheldonh by axl.seasidesoftware.co.za with local (Exim 4.20) id 19SbrR-0000tH-E2; Wed, 18 Jun 2003 14:16:21 +0200 Date: Wed, 18 Jun 2003 14:16:21 +0200 From: Sheldon Hearn To: Poul-Henning Kamp Message-ID: <20030618121620.GG835@starjuice.net> Mail-Followup-To: Poul-Henning Kamp , Dmitry Sivachenko , "Tim J. Robbins" , arch@FreeBSD.org References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <39081.1055937209@critter.freebsd.dk> User-Agent: Mutt/1.5.4i Sender: Sheldon Hearn cc: Dmitry Sivachenko cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 12:16:31 -0000 On (2003/06/18 13:53), Poul-Henning Kamp wrote: > With that said, I will also add, that I will take an incredibly > dim view of anybody who tries to add more gunk in this area, and > that I am perfectly willing to derail unionfs and nullfs (or pretty > much anything else on the list above) if that is what it takes to > clean up the buffer cache. Makes sense. After all, these filesystems are only just now recovering from "we can fix these later" breakage introduced years ago. What's a few more years without 'em? :-) Ciao, Sheldon. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 05:20:58 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F03FB37B401; Wed, 18 Jun 2003 05:20:58 -0700 (PDT) Received: from demos.su (mx.demos.su [194.87.0.32]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5829F43F85; Wed, 18 Jun 2003 05:20:57 -0700 (PDT) (envelope-from mitya@fling-wing.demos.su) Received: from [194.87.5.69] (HELO fling-wing.demos.su) by demos.su (CommuniGate Pro SMTP 4.1b7/D) with ESMTP-TLS id 76308768; Wed, 18 Jun 2003 16:20:55 +0400 Received: from fling-wing.demos.su (localhost [127.0.0.1]) by fling-wing.demos.su (8.12.9/8.12.6) with ESMTP id h5ICKt5R056795; Wed, 18 Jun 2003 16:20:55 +0400 (MSD) (envelope-from mitya@fling-wing.demos.su) Received: (from mitya@localhost) by fling-wing.demos.su (8.12.9/8.12.6/Submit) id h5ICKssV056794; Wed, 18 Jun 2003 16:20:54 +0400 (MSD) Date: Wed, 18 Jun 2003 16:20:54 +0400 From: Dmitry Sivachenko To: Poul-Henning Kamp Message-ID: <20030618122054.GA55870@fling-wing.demos.su> References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline In-Reply-To: <39081.1055937209@critter.freebsd.dk> WWW-Home-Page: http://mitya.pp.ru/ X-PGP-Key: http://mitya.pp.ru/mitya.asc User-Agent: Mutt/1.5.4i cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 12:20:59 -0000 On Wed, Jun 18, 2003 at 01:53:29PM +0200, Poul-Henning Kamp wrote: > All of this have tangled the simple component formerly known as the > buffer cache up in so many ways, that it is very hard for anybody > to make heads and tails of it any more. > > So I am tempted to answer you question with: "Because it is all a > mess" > Are there any more-or-less correct FS implementation in the system one could learn how things should be done from? Or any papers to read? From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 05:28:49 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7885C37B401; Wed, 18 Jun 2003 05:28:49 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 80EA643FDF; Wed, 18 Jun 2003 05:28:48 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5ICSlBE039516; Wed, 18 Jun 2003 14:28:47 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: Dmitry Sivachenko From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 16:20:54 +0400." <20030618122054.GA55870@fling-wing.demos.su> Date: Wed, 18 Jun 2003 14:28:47 +0200 Message-ID: <39515.1055939327@critter.freebsd.dk> cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 12:28:49 -0000 In message <20030618122054.GA55870@fling-wing.demos.su>, Dmitry Sivachenko writes : >On Wed, Jun 18, 2003 at 01:53:29PM +0200, Poul-Henning Kamp wrote: >> All of this have tangled the simple component formerly known as the >> buffer cache up in so many ways, that it is very hard for anybody >> to make heads and tails of it any more. >> >> So I am tempted to answer you question with: "Because it is all a >> mess" >> > >Are there any more-or-less correct FS implementation in the system one >could learn how things should be done from? The only arguably correct FS we have is by fiat of most use UFS. Unfortunately that is also the most complex FS we have. This is also a situation we should try to fix. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 06:33:22 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EFA3D37B401 for ; Wed, 18 Jun 2003 06:33:22 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 47A1943F93 for ; Wed, 18 Jun 2003 06:33:21 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h5IDXIKJ004580; Wed, 18 Jun 2003 09:33:18 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h5IDXIXS004577; Wed, 18 Jun 2003 09:33:18 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 18 Jun 2003 09:33:18 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Poul-Henning Kamp In-Reply-To: <36655.1055917248@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 13:33:23 -0000 On Wed, 18 Jun 2003, Poul-Henning Kamp wrote: > Now that we have a bunch of kernel threads which participate in the > running of the system, I find that it is a tad more time consuming to > figure out what the state of a crashed or hung system is. > > So I was wondering if we should instigate a simple convention for the > sleep identifiers to make it easier to spot, or rather: ignore, kthreads > which are in their normal idle position. > > Since thread names are longer than the space we have in ps(1) output > using the thread name is not feasible solution. > > I notice that the interrupt threads all seem to sleep on "-", and all > things considered, I like that. > > Should we adopt that as our convention ? I agree with the concern -- I've similarly noticed an increase in the amount of time I spend diagnosing apparent deadlocks as I attempt to determine if kernel threads are simply idle, or stuck on locks. I don't really mind what the convention is; "-" is probably as good as any. Another possible convention would be to name the state fooidle -- i.e., pageridle, acpiidle, ... Given that the purpose of the thread is documented in the thread name, generally, this is probably overkill and unnecessarilly extends the number and length of strings involved. A final option that comes to mind would be simply to call the state "idle". One disadvantage of changing to a common name with no distinct string is that it makes it quite a bit harder to track down the sleep call in the kernel; you can no longer glimpse/grep on the state to find the stage in the thread event loop you've reached, which would be one reason to prefer a fooidle approach. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 06:36:07 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3C8A037B401; Wed, 18 Jun 2003 06:36:07 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5112D43F85; Wed, 18 Jun 2003 06:36:06 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IDa4BE040090; Wed, 18 Jun 2003 15:36:04 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: Robert Watson From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 09:33:18 EDT." Date: Wed, 18 Jun 2003 15:36:04 +0200 Message-ID: <40089.1055943364@critter.freebsd.dk> cc: arch@freebsd.org Subject: Re: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 13:36:07 -0000 In message , Robert Watson writes: >> Should we adopt that as our convention ? > >One >disadvantage of changing to a common name with no distinct string is that >it makes it quite a bit harder to track down the sleep call in the kernel; >you can no longer glimpse/grep on the state to find the stage in the >thread event loop you've reached, which would be one reason to prefer a >fooidle approach. I actually thought a bit more about that. I think all sleeping calls should allow a string to be specified, also when we sleep on semaphores for instance. And I think we should capture __FILE__ and __LINE__ as well, at least when DDB or DIAGNOSTIC is in play. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 08:35:22 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B9F8337B401 for ; Wed, 18 Jun 2003 08:35:22 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id B413443F85 for ; Wed, 18 Jun 2003 08:35:21 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IFZJBE041013 for ; Wed, 18 Jun 2003 17:35:20 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: arch@freebsd.org From: Poul-Henning Kamp Date: Wed, 18 Jun 2003 17:35:19 +0200 Message-ID: <41012.1055950519@critter.freebsd.dk> Subject: userland access to devices is moving! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 15:35:23 -0000 I sat down and hacked up a simple prototype to test the concept I have been rambling about for some years: Going directly from filedescriptor to device driver thus bypassing the vnode, devfs and specfs layer. I implemented this for /dev/null and /dev/zero, and ran the simple benchmark "dd if=/dev/zero of=/dev/null count=1000000" Before: N 3 Average: 44.900752667 Stddev: 0.049906338 After: N 3 Average: 18.460190333 Stddev: 0.074019507 That is 26.4 microseconds saved for each read(2)+write(2) operation, or 41% improvement on my Athlon 700MHz machine. A bit more locking will probably be needed, so this will erode some of this number, but there will be something left I'm sure :-) The largest impact of this is that VOP_OPEN(), vn_open() and vn_open_cred() grows an argument (the fdesc index) which existing callers need to pass a -1, the rest is relatively local hacking in devfs and some adjustments in the descriptor code. I have overall found that the implementation of this is not as hard as I imagined, and if I doubted it before, I am now certain that this is the right way to go. I should have tried this long time ago... patch at: http://phk.freebsd.dk/patch/fdesc.patch -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 09:40:31 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0A87037B401 for ; Wed, 18 Jun 2003 09:40:31 -0700 (PDT) Received: from sccrmhc11.attbi.com (sccrmhc11.comcast.net [204.127.202.55]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4BFDD43F75 for ; Wed, 18 Jun 2003 09:40:30 -0700 (PDT) (envelope-from julian@elischer.org) Received: from interjet.elischer.org ([12.233.125.100]) by attbi.com (sccrmhc11) with ESMTP id <2003061816402801100807u7e>; Wed, 18 Jun 2003 16:40:29 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id JAA38689; Wed, 18 Jun 2003 09:40:26 -0700 (PDT) Date: Wed, 18 Jun 2003 09:40:25 -0700 (PDT) From: Julian Elischer To: Poul-Henning Kamp In-Reply-To: <41012.1055950519@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: userland access to devices is moving! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 16:40:31 -0000 cool.. how does 'mount' do this however, as it doesn't have a fd list to wok with. Or are you still leaving the vnode access path in place? On Wed, 18 Jun 2003, Poul-Henning Kamp wrote: > > I sat down and hacked up a simple prototype to test the concept I > have been rambling about for some years: Going directly from > filedescriptor to device driver thus bypassing the vnode, devfs and > specfs layer. > > I implemented this for /dev/null and /dev/zero, and ran the simple > benchmark "dd if=/dev/zero of=/dev/null count=1000000" > > Before: > N 3 Average: 44.900752667 Stddev: 0.049906338 > > After: > N 3 Average: 18.460190333 Stddev: 0.074019507 > > That is 26.4 microseconds saved for each read(2)+write(2) operation, > or 41% improvement on my Athlon 700MHz machine. > > A bit more locking will probably be needed, so this will erode some > of this number, but there will be something left I'm sure :-) > > The largest impact of this is that VOP_OPEN(), vn_open() and > vn_open_cred() grows an argument (the fdesc index) which existing > callers need to pass a -1, the rest is relatively local hacking in > devfs and some adjustments in the descriptor code. > > I have overall found that the implementation of this is not as hard > as I imagined, and if I doubted it before, I am now certain that > this is the right way to go. > > I should have tried this long time ago... > > patch at: > http://phk.freebsd.dk/patch/fdesc.patch > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 09:46:36 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0485237B401 for ; Wed, 18 Jun 2003 09:46:36 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 061AD43F75 for ; Wed, 18 Jun 2003 09:46:35 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IGkRBE041955; Wed, 18 Jun 2003 18:46:29 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: Julian Elischer From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 09:40:25 PDT." Date: Wed, 18 Jun 2003 18:46:27 +0200 Message-ID: <41954.1055954787@critter.freebsd.dk> cc: arch@freebsd.org Subject: Re: userland access to devices is moving! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 16:46:36 -0000 In message , Julia n Elischer writes: >cool.. >how does 'mount' do this however, as it doesn't have a fd list to wok >with. Or are you still leaving the vnode access path in place? The vnode access path remains in order to support buf/vm -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 10:47:37 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A5E2437B401 for ; Wed, 18 Jun 2003 10:47:37 -0700 (PDT) Received: from regina.plastikos.com (216-107-106-250.wan.networktel.net [216.107.106.250]) by mx1.FreeBSD.org (Postfix) with ESMTP id A8FD143F85 for ; Wed, 18 Jun 2003 10:47:36 -0700 (PDT) (envelope-from fullermd@over-yonder.net) Received: from mortis.over-yonder.net (adsl-33-235-56.jan.bellsouth.net [67.33.235.56]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by regina.plastikos.com (Postfix) with ESMTP id DDDCE6EEB9; Wed, 18 Jun 2003 13:47:35 -0400 (EDT) Received: by mortis.over-yonder.net (Postfix, from userid 100) id D2E9620F30; Wed, 18 Jun 2003 12:47:33 -0500 (CDT) Date: Wed, 18 Jun 2003 12:47:33 -0500 From: "Matthew D. Fuller" To: Poul-Henning Kamp Message-ID: <20030618174733.GC10127@over-yonder.net> References: <41012.1055950519@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <41012.1055950519@critter.freebsd.dk> User-Agent: Mutt/1.4.1i-fullermd.1 X-Editor: vi X-OS: FreeBSD cc: arch@freebsd.org Subject: Re: userland access to devices is moving! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 17:47:37 -0000 On Wed, Jun 18, 2003 at 05:35:19PM +0200 I heard the voice of Poul-Henning Kamp, and lo! it spake thus: > > I sat down and hacked up a simple prototype to test the concept I > have been rambling about for some years: Going directly from > filedescriptor to device driver thus bypassing the vnode, devfs and > specfs layer. Speaking as somebody whose reach of mailing lists notably exceeds his grasp (as it always should be; otherwise what fun is it?), I often find myself a little in the dark on what these sort of things really /mean/ to the system in the end, and I think it would be a nice extension of these sort of posts/proposals to have a sentence of summary, along the lines of: What does this change /mean/ to the system as a whole? Is this A) Cleaner code, so bugs can be found and fixed quicker and better, B) Architectural improvement, so new features are easier to add on cleanly and well, or C) A real-world performance improvement. The benchmark you posted certainly shows a significant improvement in SOMETHING; but is it a something that will make mail servers or web servers or file servers or workstations perk up? I realize that they're not really exclusive conditions, and are mostly intertangled. And, for that matter, that most changes don't get done because of A, B, C, or any combination thereof, but more often because "This is the ugliest mess I've ever seen and it's been haunting my dreams for years, and I anyway I thought it would be fun to mess with." But I for one would appreciate a quick note of a higher-level view of where this can move us. Of course this means I'm out of my depth. But everybody needs a hobby :-} -- Matthew Fuller (MF4839) | fullermd@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet" From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 11:42:19 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 22A5A37B404 for ; Wed, 18 Jun 2003 11:42:18 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4C2D143FCB for ; Wed, 18 Jun 2003 11:42:16 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IIgEBE042576; Wed, 18 Jun 2003 20:42:15 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: "Matthew D. Fuller" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 12:47:33 CDT." <20030618174733.GC10127@over-yonder.net> Date: Wed, 18 Jun 2003 20:42:14 +0200 Message-ID: <42575.1055961734@critter.freebsd.dk> cc: arch@freebsd.org Subject: Re: userland access to devices is moving! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 18:42:19 -0000 In message <20030618174733.GC10127@over-yonder.net>, "Matthew D. Fuller" writes: > >Speaking as somebody whose reach of mailing lists notably exceeds his >grasp (as it always should be; otherwise what fun is it?), I often find >myself a little in the dark on what these sort of things really /mean/ to >the system in the end, and I think it would be a nice extension of these >sort of posts/proposals to have a sentence of summary, along the lines >of: Well, what can I say but: "You're right". I do on the other hand not think that emails to arch@ is the best forum for the in-depth explanations. The "blueprint" articles which I am trying to restart in DæmonNews may not be either, but I think they are more the right kind of forum for it. I am still warming up to the article format, and the amount of feedback so far has not really given me any feel for how complex issues I can tackle without the readers just skipping the article. The _real_ problem of course is that whenever I try to explain some of this stuff in text I end up with a totally incomprehensible piece of prose which entirely fails to communicate the simplicity of the underlying issue (and I suck at drawing graphics too). It is in other words equally frustrating from this end to not be able to communicate it better :-( -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 12:00:35 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 85D8E37B401 for ; Wed, 18 Jun 2003 12:00:35 -0700 (PDT) Received: from regina.plastikos.com (216-107-106-250.wan.networktel.net [216.107.106.250]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8564243F75 for ; Wed, 18 Jun 2003 12:00:34 -0700 (PDT) (envelope-from fullermd@over-yonder.net) Received: from mortis.over-yonder.net (adsl-33-235-56.jan.bellsouth.net [67.33.235.56]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by regina.plastikos.com (Postfix) with ESMTP id A72BD6EEB9; Wed, 18 Jun 2003 15:00:33 -0400 (EDT) Received: by mortis.over-yonder.net (Postfix, from userid 100) id 362ED20F03; Wed, 18 Jun 2003 14:00:32 -0500 (CDT) Date: Wed, 18 Jun 2003 14:00:32 -0500 From: "Matthew D. Fuller" To: Poul-Henning Kamp Message-ID: <20030618190032.GG10127@over-yonder.net> References: <20030618174733.GC10127@over-yonder.net> <42575.1055961734@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=unknown-8bit Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <42575.1055961734@critter.freebsd.dk> User-Agent: Mutt/1.4.1i-fullermd.1 X-Editor: vi X-OS: FreeBSD cc: arch@freebsd.org Subject: Meta: explain what where when? (was Re: userland access to devices is moving!) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 19:00:36 -0000 On Wed, Jun 18, 2003 at 08:42:14PM +0200 I heard the voice of Poul-Henning Kamp, and lo! it spake thus: > In message <20030618174733.GC10127@over-yonder.net>, "Matthew D. Fuller" writes: > > > >Speaking as somebody whose reach of mailing lists notably exceeds his > >grasp (as it always should be; otherwise what fun is it?), I often find > >myself a little in the dark on what these sort of things really /mean/ to > >the system in the end, and I think it would be a nice extension of these > >sort of posts/proposals to have a sentence of summary, along the lines > >of: > > Well, what can I say but: "You're right". > > I do on the other hand not think that emails to arch@ is the best forum > for the in-depth explanations. Oh, absolutely! And to understand an in-depth explanation, I'd have to dig into the code anyway to understand it. I was aiming more at a throwaway statement like, "This will clean up a lot of code and make it easier to understand and debug," or "This straightens out the code structure and let us add more things onto it more easily," or "This will improve performance for things that do a lot of grubbing to /dev nodes. Your dump(8) will run 3% faster." That last is actually more specific than I have in mind. No guarantee, just a general statement of direction. From my armchair, it's fairly easy to give a "75% cleanup, 20% architecture, 5% performance" 3-axis guesstimate of how this moves us forward, and even such a general direction is an enormous aid to those of us who don't really understand where this plugs into the system. Reading the discussion of this change, I'd say "This is a structural cleanup that eliminates some complexity and makes it easier to understand and add onto, with the 'cleanup' features related to the reduced complexity. It may also yield a small real-world performance improvement for things that do a lot of /dev/* fiddling." Just a thumbnail sketch of whether this is moving us down the path, or hacking out thorns that are keeping us from moving down the path, etc. > The "blueprint" articles which I am trying to restart in DæmonNews > may not be either, but I think they are more the right kind of > forum for it. I am still warming up to the article format, and the > amount of feedback so far has not really given me any feel for how > complex issues I can tackle without the readers just skipping the > article. That's an interesting way to go, and certainly something worthwhile. Poking at the site now, I see the one from May on userland/kernel interfaces; has there been more? -- Matthew Fuller (MF4839) | fullermd@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet" From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 12:05:20 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D16C137B401 for ; Wed, 18 Jun 2003 12:05:20 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id E4A9543FCB for ; Wed, 18 Jun 2003 12:05:19 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5IJ5IBE042800; Wed, 18 Jun 2003 21:05:18 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: "Matthew D. Fuller" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 18 Jun 2003 14:00:32 CDT." <20030618190032.GG10127@over-yonder.net> Date: Wed, 18 Jun 2003 21:05:18 +0200 Message-ID: <42799.1055963118@critter.freebsd.dk> cc: arch@freebsd.org Subject: Re: Meta: explain what where when? (was Re: userland access to devices is moving!) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 19:05:21 -0000 In message <20030618190032.GG10127@over-yonder.net>, "Matthew D. Fuller" writes: >> I do on the other hand not think that emails to arch@ is the best forum >> for the in-depth explanations. > >Oh, absolutely! And to understand an in-depth explanation, I'd have to >dig into the code anyway to understand it. > >I was aiming more at a throwaway statement like, "This will clean up a >lot of code and make it easier to understand and debug," or [...] Ahh, ok. Yeah, I guess we have a tendency to write with the usual suspects as implied target. I'll try to remember that. >> The "blueprint" articles which I am trying to restart in DæmonNews >> may not be either, but I think they are more the right kind of >> forum for it. I am still warming up to the article format, and the >> amount of feedback so far has not really given me any feel for how >> complex issues I can tackle without the readers just skipping the >> article. > >That's an interesting way to go, and certainly something worthwhile. >Poking at the site now, I see the one from May on userland/kernel >interfaces; has there been more? I hope the next one is in the pipeline... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 16:28:45 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BFE1537B401; Wed, 18 Jun 2003 16:28:45 -0700 (PDT) Received: from mobile.hub.org (u153n214.eastlink.ca [24.224.153.214]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE0E743F75; Wed, 18 Jun 2003 16:28:44 -0700 (PDT) (envelope-from scrappy@hub.org) Received: by mobile.hub.org (Postfix, from userid 1001) id 59347553; Wed, 18 Jun 2003 20:28:38 -0300 (ADT) Received: from localhost (localhost [127.0.0.1]) by mobile.hub.org (Postfix) with ESMTP id D2E273B9; Wed, 18 Jun 2003 20:28:37 -0300 (ADT) Date: Wed, 18 Jun 2003 20:28:37 -0300 (ADT) From: The Hermit Hacker To: Sheldon Hearn In-Reply-To: <20030618121620.GG835@starjuice.net> Message-ID: <20030618202302.W51411@hub.org> References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030618121620.GG835@starjuice.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 23:28:46 -0000 On Wed, 18 Jun 2003, Sheldon Hearn wrote: > On (2003/06/18 13:53), Poul-Henning Kamp wrote: > > > With that said, I will also add, that I will take an incredibly > > dim view of anybody who tries to add more gunk in this area, and > > that I am perfectly willing to derail unionfs and nullfs (or pretty > > much anything else on the list above) if that is what it takes to > > clean up the buffer cache. > > Makes sense. After all, these filesystems are only just now recovering > from "we can fix these later" breakage introduced years ago. What's a > few more years without 'em? :-) 'K, this kinda hurts ... there are a growing # of us that are actually using unionfs and nullfs on production systems ... not small servers, but several thousand processes with over 100 union mounts ... other then the vnode leak stuff that David has been investigating, I've yet to see anything that I would considering warranting the 'DO NOT USE / CAVEAT EMPTOR' that is in the man pages ... :( From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 19:14:36 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AB48E37B401; Wed, 18 Jun 2003 19:14:36 -0700 (PDT) Received: from smtp01.syd.iprimus.net.au (smtp01.syd.iprimus.net.au [210.50.30.52]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0DAB043FB1; Wed, 18 Jun 2003 19:14:36 -0700 (PDT) (envelope-from tim@robbins.dropbear.id.au) Received: from dilbert.robbins.dropbear.id.au (210.50.200.191) by smtp01.syd.iprimus.net.au (7.0.015) id 3EED7FDA0016D9D2; Thu, 19 Jun 2003 12:14:34 +1000 Received: by dilbert.robbins.dropbear.id.au (Postfix, from userid 1000) id ED659C91F; Thu, 19 Jun 2003 12:14:30 +1000 (EST) Date: Thu, 19 Jun 2003 12:14:30 +1000 From: "Tim J. Robbins" To: The Hermit Hacker Message-ID: <20030619121430.A29274@dilbert.robbins.dropbear.id.au> References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030618121620.GG835@starjuice.net> <20030618202302.W51411@hub.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030618202302.W51411@hub.org>; from scrappy@hub.org on Wed, Jun 18, 2003 at 08:28:37PM -0300 cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 02:14:37 -0000 On Wed, Jun 18, 2003 at 08:28:37PM -0300, The Hermit Hacker wrote: > On Wed, 18 Jun 2003, Sheldon Hearn wrote: > > > On (2003/06/18 13:53), Poul-Henning Kamp wrote: > > > > > With that said, I will also add, that I will take an incredibly > > > dim view of anybody who tries to add more gunk in this area, and > > > that I am perfectly willing to derail unionfs and nullfs (or pretty > > > much anything else on the list above) if that is what it takes to > > > clean up the buffer cache. > > > > Makes sense. After all, these filesystems are only just now recovering > > from "we can fix these later" breakage introduced years ago. What's a > > few more years without 'em? :-) > > 'K, this kinda hurts ... there are a growing # of us that are actually > using unionfs and nullfs on production systems ... not small servers, but > several thousand processes with over 100 union mounts ... other then the > vnode leak stuff that David has been investigating, I've yet to see > anything that I would considering warranting the 'DO NOT USE / CAVEAT > EMPTOR' that is in the man pages ... :( At least one of the sections is well-deserved: umapfs is horribly broken on -current, and only works by accident on previous releases. I'm actually considering putting an even stronger warning on that one. The others (null and union) aren't nearly as bad, and have been fixed significantly since the notice was put on the manpages. Tim From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 19:17:05 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A09C537B401; Wed, 18 Jun 2003 19:17:05 -0700 (PDT) Received: from hub.org (hub.org [64.117.225.220]) by mx1.FreeBSD.org (Postfix) with ESMTP id 019F543F3F; Wed, 18 Jun 2003 19:17:05 -0700 (PDT) (envelope-from scrappy@hub.org) Received: from hub.org (unknown [64.117.225.220]) by hub.org (Postfix) with ESMTP id 98D336BA8F2; Wed, 18 Jun 2003 23:16:58 -0300 (ADT) Date: Wed, 18 Jun 2003 23:16:58 -0300 (ADT) From: "Marc G. Fournier" To: "Tim J. Robbins" In-Reply-To: <20030619121430.A29274@dilbert.robbins.dropbear.id.au> Message-ID: <20030618231640.T8920@hub.org> References: <20030618112226.GA42606@fling-wing.demos.su> <20030618121620.GG835@starjuice.net> <20030619121430.A29274@dilbert.robbins.dropbear.id.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 02:17:05 -0000 On Thu, 19 Jun 2003, Tim J. Robbins wrote: > At least one of the sections is well-deserved: umapfs is horribly broken > on -current, and only works by accident on previous releases. I'm > actually considering putting an even stronger warning on that one. The > others (null and union) aren't nearly as bad, and have been fixed > significantly since the notice was put on the manpages. what is umapfs? I don't have a man page for that one on my 4.x box ... ? From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 22:01:18 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C03B237B401; Wed, 18 Jun 2003 22:01:18 -0700 (PDT) Received: from smtp3.server.rpi.edu (smtp3.server.rpi.edu [128.113.2.3]) by mx1.FreeBSD.org (Postfix) with ESMTP id 102B143F75; Wed, 18 Jun 2003 22:01:18 -0700 (PDT) (envelope-from drosih@rpi.edu) Received: from [128.113.24.47] (gilead.netel.rpi.edu [128.113.24.47]) by smtp3.server.rpi.edu (8.12.9/8.12.9) with ESMTP id h5J51GiJ017578; Thu, 19 Jun 2003 01:01:16 -0400 Mime-Version: 1.0 X-Sender: drosih@mail.rpi.edu Message-Id: In-Reply-To: References: Date: Thu, 19 Jun 2003 01:01:15 -0400 To: Robert Watson , Poul-Henning Kamp From: Garance A Drosihn Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Scanned-By: MIMEDefang 2.28 cc: arch@freebsd.org Subject: Re: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 05:01:19 -0000 At 9:33 AM -0400 6/18/03, Robert Watson wrote: >On Wed, 18 Jun 2003, Poul-Henning Kamp wrote: > > > Since thread names are longer than the space we have in ps(1) > > output using the thread name is not feasible solution. > > > > I notice that the interrupt threads all seem to sleep on "-", > > and all things considered, I like that. > > >> Should we adopt that as our convention ? > >I agree with the concern -- I've similarly noticed an increase >in the amount of time I spend diagnosing apparent deadlocks as >I attempt to determine if kernel threads are simply idle, or >stuck on locks. I don't really mind what the convention is; >"-" is probably as good as any. Another possible convention >would be to name the state fooidle -- i.e., pageridle, acpiidle, ... Long ago and in an operating-system far away (and which is not running anywhere now), we had a similar problem. We ended up adding a mechanism here the sleeper could specify a character string which would show up in our equivalent of 'ps'. This was implemented by having one hardware register which held the address of the string to display. Perhaps something similar could be done in freebsd. -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu From owner-freebsd-arch@FreeBSD.ORG Wed Jun 18 23:40:35 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6A96E37B401; Wed, 18 Jun 2003 23:40:35 -0700 (PDT) Received: from mail.chesapeake.net (chesapeake.net [208.142.252.6]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4481B43FE3; Wed, 18 Jun 2003 23:40:33 -0700 (PDT) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h5J6eJF22750; Thu, 19 Jun 2003 02:40:20 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 19 Jun 2003 02:40:19 -0400 (EDT) From: Jeff Roberson To: The Hermit Hacker In-Reply-To: <20030618202302.W51411@hub.org> Message-ID: <20030619023935.W36168-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 06:40:35 -0000 On Wed, 18 Jun 2003, The Hermit Hacker wrote: > On Wed, 18 Jun 2003, Sheldon Hearn wrote: > > > On (2003/06/18 13:53), Poul-Henning Kamp wrote: > > > > > With that said, I will also add, that I will take an incredibly > > > dim view of anybody who tries to add more gunk in this area, and > > > that I am perfectly willing to derail unionfs and nullfs (or pretty > > > much anything else on the list above) if that is what it takes to > > > clean up the buffer cache. > > > > Makes sense. After all, these filesystems are only just now recovering > > from "we can fix these later" breakage introduced years ago. What's a > > few more years without 'em? :-) > > 'K, this kinda hurts ... there are a growing # of us that are actually > using unionfs and nullfs on production systems ... not small servers, but > several thousand processes with over 100 union mounts ... other then the > vnode leak stuff that David has been investigating, I've yet to see > anything that I would considering warranting the 'DO NOT USE / CAVEAT > EMPTOR' that is in the man pages ... :( > Yes, I also have great issue with breaking the stacking layers. Fixing the buffer cache should have no impact on them if this is done correctly. Lets please not try to break any more functionality. Cheers, Jeff From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 00:10:07 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2C5B837B401 for ; Thu, 19 Jun 2003 00:10:07 -0700 (PDT) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 71B3843F93 for ; Thu, 19 Jun 2003 00:10:06 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfk2f.dialup.mindspring.com ([165.247.208.79] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19StYP-0007Rg-00; Thu, 19 Jun 2003 00:09:53 -0700 Message-ID: <3EF1617F.C1EC5C12@mindspring.com> Date: Thu, 19 Jun 2003 00:08:47 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Matthew D. Fuller" References: <20030618174733.GC10127@over-yonder.net> <20030618190032.GG10127@over-yonder.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4b3d42763fbf89967bbac5bd940dc7857a7ce0e8f8d31aa3f350badd9bab72f9c350badd9bab72f9c cc: arch@freebsd.org cc: Poul-Henning Kamp Subject: Re: Meta: explain what where when? (was Re: userland access todevicesis moving!) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 07:10:07 -0000 "Matthew D. Fuller" wrote: > Reading the discussion of this change, I'd say "This is a structural > cleanup that eliminates some complexity and makes it easier to understand > and add onto, with the 'cleanup' features related to the reduced > complexity. It may also yield a small real-world performance improvement > for things that do a lot of /dev/* fiddling." Just a thumbnail sketch of > whether this is moving us down the path, or hacking out thorns that are > keeping us from moving down the path, etc. That's more like a marketing blurb. It does not evenly present both the perceived benefits, and the potential negative consequences. I can see several. I think much of the claim to gain here can be won back by not gathering per-layer statistics at the GOEM level, and collapsing the GEOM layers to direct block references, when possible (for example). I also think it's sort of a half-approach to getting rid of struct fileops, which is the real source of the problem here, not the fact that the thing holding the struct fileops pointer happens to be a vnode. How's this going to effect diskless boots? What about the mmap() of /dev/zero for anonymous pages? What about doing descriptor passing it off to another program? What does the author honestly think it will break, such that it needs a "Heads Up!" warning? Does everyone value the things that will break as little as the author, or is it just something he doesn't use, so it's not important to him? I really hate when someone posts something that is effectively nothing more than propaganda in favor of something that they haven't documented in sufficient detail and/or provided their own list of the negative consequences, such that people can form an informed opinion on the merits of the idea, instead of deciding based on personalities, or how effective someone is at writing propaganda in favor of what they intend to do anyway. -- Terry From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 01:19:33 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9F78D37B404; Thu, 19 Jun 2003 01:19:33 -0700 (PDT) Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by mx1.FreeBSD.org (Postfix) with ESMTP id 542D743F3F; Thu, 19 Jun 2003 01:19:32 -0700 (PDT) (envelope-from des@ofug.org) Received: by flood.ping.uio.no (Postfix, from userid 2602) id 49FD8530F; Thu, 19 Jun 2003 10:19:30 +0200 (CEST) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Garance A Drosihn References: From: Dag-Erling Smorgrav Date: Thu, 19 Jun 2003 10:19:30 +0200 In-Reply-To: (Garance A. Drosihn's message of "Thu, 19 Jun 2003 01:01:15 -0400") Message-ID: User-Agent: Gnus/5.1001 (Gnus v5.10.1) Emacs/21.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii cc: arch@freebsd.org cc: Poul-Henning Kamp cc: Robert Watson Subject: Re: marking normal sleep identifiers as such. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 08:19:34 -0000 Garance A Drosihn writes: > Long ago and in an operating-system far away (and which is not > running anywhere now), we had a similar problem. We ended up > adding a mechanism here the sleeper could specify a character > string which would show up in our equivalent of 'ps'. This > was implemented by having one hardware register which held the > address of the string to display. > > Perhaps something similar could be done in freebsd. You mean like this? des@meali ~% ps -opid,mwchan,command PID MWCHAN COMMAND 7761 pause -zsh (zsh) 16447 select ssh mikrobe 79090 pause -zsh (zsh) 41402 select ssh flood.ping.uio.no 41403 select ssh flood.ping.uio.no 80739 pause -zsh (zsh) 7801 ttyin -zsh (zsh) 41672 select /usr/local/bin/emacs -geometry 90x56-0+0 44723 ttyin -zsh (zsh) 92780 ttyin -zsh (zsh) 2316 select xscreensaver 3767 ttyin -zsh (zsh) 2711 ttyin -zsh (zsh) 15511 ttyin -zsh (zsh) 19014 pause -zsh (zsh) 19033 - ps -opid,mwchan,command DES -- Dag-Erling Smorgrav - des@ofug.org From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 01:25:37 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DBCD637B401; Thu, 19 Jun 2003 01:25:37 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0DC9343FD7; Thu, 19 Jun 2003 01:25:36 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfk2f.dialup.mindspring.com ([165.247.208.79] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19Suix-0007Eg-00; Thu, 19 Jun 2003 01:24:52 -0700 Message-ID: <3EF172EF.1248AD97@mindspring.com> Date: Thu, 19 Jun 2003 01:23:11 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: The Hermit Hacker References: <20030618112226.GA42606@fling-wing.demos.su> <20030618121620.GG835@starjuice.net> <20030618202302.W51411@hub.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4480afc112eafc866ef1b3e2e8a97c640a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: fs@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 08:25:38 -0000 The Hermit Hacker wrote: > 'K, this kinda hurts ... there are a growing # of us that are actually > using unionfs and nullfs on production systems ... not small servers, but > several thousand processes with over 100 union mounts ... other then the > vnode leak stuff that David has been investigating, I've yet to see > anything that I would considering warranting the 'DO NOT USE / CAVEAT > EMPTOR' that is in the man pages ... :( Use mmap on a bunch of files on a nullfs, and don't do msync() to perform an explicit coherency cycle. Modofiy the original underlying files. Do this for different areas of partial pages on both the nullfs and the FS the nullfs is covering. 1) There is no explicit coherency notification to the covering FS when the covered FS's vnode data is modified. 2) There is no explicit coherency cycle for mapped pages when a write occurs, if the page being written is in core. Basically, in order to support this, you will have to unmap the pages for write, take the fault, and then restart the write with the knowledge that you need to trigger a write-through (or a write-back) as a result of having triggered the fault: in other words, an explicit coherency cycle. The current nullfs code avoids this by having a 1:1 page mapping and using a trick I came up with, which is to get the underlying vm_object_t from the underlying vnode, instead of the nullfs vnode. But it pays a rather large performance penalty. The other problem is that it gives the wrong impression about FS stacking in FreeBSD: it give the impression that it works in other than the specialized contrived case of nullfs. This does not (and can not) work with transformative stacking layers, such as a crypto stacking layer, a character set translation stacking layer (e.g. a Koi-8 FS NFS mounted on an ISO-8859-1 Locale system, which needs the Koi-8 data UTF-8 encoded before it can be displayed in a file browser), and a number of other layers. The page trick suggested above also fails in some cases; for example, consider the case where you have a very fast disk for the first 2K of each file, and a slower disk for the remainder of each file (if any). The data break spans a page boundary, and therefore you can't deal with it. In a similar vein, if you proxy your VOP descriptors to another address space, you are screwed, because vnodes are assumed to contain vmobject_t's, and these are assumed to be locally accessible to the address space in question (how do you implement a VOP_GETVOBJECT() when the vnode you are referencing is in user space? Is on another node? Etc.?). Paging VOPs almost need an internal payload of a page or page set, coupled with an address space descriptor, in order to let them know if the called party can access them directly, rather than needing to call a rendevous data copy operation. If you read John Heidemann's Master's thesis (ftp.cs.ucla.edu), or the Ficus documentation (same FTP server), which are the basis of the stacking vnode framework in BSD4.4-Lite2, and thus in FreeBSD, you'll see that these problems have already got answers, they just aren't being implemented in FreeBSD, and as FreeBSD moves further from the original intended design, it's only going to get harder to recover the functionality. Really, the stacking in FreeBSD today is pretty much a toy. The reason FFS can stack on UFS is that the VOP's that are being exported are not really stacked, because they represent two non-intersecting set of VOP's: one is for a flat numeric namespace (inode numbers) FS, called UFS (or UFS2, or also... formerly.. MFS), and the upper layer FFS implements a hierarchical namespace in the context of the underlying flat numeric namespace. There are a couple of interesting things you can do without really stacking (causing the VOP namespaces to intersect, thus introducing the coherency issue); one of these would be to seperate out the disk quota interface. With the exception of the quota VOP that's needed, everything else is non-intersecting in the same way that the nullfs is non-intersecting: there's no upper layer vmobject_t reference needed to implement it. Combine that with the VOP for the quota control operations being non-intersecting in the VOP namespace (like the VOP for directory operations not being in the UFS namespace), and you have sufficient seperation to implement quotas in the context of a decoherent stacked cache, because you never need to reference bth the upper and lower vnode's vmobject_t for a given particular vnode. But the FreeBSD implementation is probably far from useful, without the coherency notification mechanisms for "upper dirty/write through to lower" and "lower dirty/invalidate upper cached copy". Those just aren't there, and the framework totally lacks the necessary semantics for the second one, at the present time. There are a number of deadlock issues in the unionfs case; most people don'y use that, and use the union mount option, which is not the same thing at all. Most of these problems are centered around things like relookup, etc., which have to drop and then reacquire a lock to avoid an internal deadlok (e.g. "rename"); by doing this, they open a small race window, in which it's possible, with the right call-path pressure, to create a deadlock between concurrently executing threads of control. The window is much more pronounced on SMP systems, which are statistically much more likely to hit it. Followups set to Freebsd-FS. -- Terry From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 01:58:37 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A7D1137B401; Thu, 19 Jun 2003 01:58:37 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id B4E4D43FA3; Thu, 19 Jun 2003 01:58:36 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfk2f.dialup.mindspring.com ([165.247.208.79] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19SvFa-0002fR-00; Thu, 19 Jun 2003 01:58:35 -0700 Message-ID: <3EF17AEC.D3AC9D3F@mindspring.com> Date: Thu, 19 Jun 2003 01:57:16 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Marc G. Fournier" References: <20030618112226.GA42606@fling-wing.demos.su> <20030618121620.GG835@starjuice.net><20030618231640.T8920@hub.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a40faa8c204324bffae55c50f52f33395d387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 08:58:38 -0000 "Marc G. Fournier" wrote: > On Thu, 19 Jun 2003, Tim J. Robbins wrote: > > At least one of the sections is well-deserved: umapfs is horribly broken > > on -current, and only works by accident on previous releases. I'm > > actually considering putting an even stronger warning on that one. The > > others (null and union) aren't nearly as bad, and have been fixed > > significantly since the notice was put on the manpages. > > what is umapfs? I don't have a man page for that one on my 4.x box ... ? The umapfs attempts to map user ids between one namespace and another, such as you might need to do when merging a company, each with a UNIX box, with intersecting UID sets with collisions (i.e. is uid 105 "john" or "frank"?). Using umapfs, you can have both "john" and "frank" keep their local accounts, and for network mounts, translate the uid space to present a different view on it. The umapfs is actually just another toy FS (per my other posting inre: unionfs-not-union and nullfs). You could easily "fix" umapfs by copying the nullfs code, and using the underlying vmobject_t's directly via VOP_GETVOJECT, since there is no data associated with the upper vnode that's not in the lower vnode. The translation and lookup would need to be done in two more places than it is now, so there would be a slightly increased performance penalty over that of unionfs. The only thing you really care about is the credentials, which can be crdup/crmod on the way down, and the stat information when can be reverse translated and passed back up. This would probably take about a day to implement, and it would be a good introduction to the code for anyone who thinks there aren't real problems that can't be avoided without some architectural changes. -- Terry From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 04:35:00 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C9AB037B401; Thu, 19 Jun 2003 04:35:00 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id E12EC43F3F; Thu, 19 Jun 2003 04:34:59 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5JBYwJa081635; Thu, 19 Jun 2003 04:34:58 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5JBYwZc081634; Thu, 19 Jun 2003 04:34:58 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Date: Thu, 19 Jun 2003 04:34:58 -0700 From: David Schultz To: Poul-Henning Kamp Message-ID: <20030619113457.GA80739@HAL9000.homeunix.com> Mail-Followup-To: Poul-Henning Kamp , Dmitry Sivachenko , "Tim J. Robbins" , arch@freebsd.org References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <39081.1055937209@critter.freebsd.dk> cc: Dmitry Sivachenko cc: "Tim J. Robbins" cc: arch@FreeBSD.ORG Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 11:35:01 -0000 On Wed, Jun 18, 2003, Poul-Henning Kamp wrote: > In message <20030618112226.GA42606@fling-wing.demos.su>, Dmitry Sivachenko writes > I am hoping that we may be able to carve a path by changing the > bio structure operate on vm pages rather than KVM mapped > byte arrays (most disk device drivers don't care for thing being > mapped, they use bus-master DMA and only need physical location). It would seem to me that we would ultimately want filesystems to be doing page-based I/O, rather than crafting ellaborate illusions to deal with FS blocks being smaller or larger than the VM page size. As a side note, I also think it's important that the new implementation have a clean separation between user data and FS metadata, so that they are not in direct competition with each other for memory. The present buffer cache may be too limited for the massive number of dependencies softupdates needs to track for FS-intensive loads, but we also don't want lots of accumulated dirty buffers from heavy FS activity to force application data out of memory. > With that said, I will also add, that I will take an incredibly > dim view of anybody who tries to add more gunk in this area, and > that I am perfectly willing to derail unionfs and nullfs (or pretty > much anything else on the list above) if that is what it takes to > clean up the buffer cache. Any of those facilities can be reintroduced > later on in a cleaner fashion. > > I agree that nullfs and unionfs are useful technologies, but if > they have to be reimplemented to fit our kernel, then so be it. The original buffer cache design is untenable largely because Dyson wanted to maintain compatibility with existing FS interfaces. Therefore, I would expect that moving forward and doing things right would require changes to existing filesystems. However, if your changes make nullfs and unionfs substantially *more* difficult to implement, then you've done something wrong. If I'm kept informed, I'm willing to contribute to this aspect of the task to the extent that I have time. From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 04:43:51 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5786737B401; Thu, 19 Jun 2003 04:43:51 -0700 (PDT) Received: from critter.freebsd.dk (esplanaden.cybercity.dk [212.242.40.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 616BC43F93; Thu, 19 Jun 2003 04:43:50 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h5JBhnjR002688; Thu, 19 Jun 2003 13:43:49 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: David Schultz From: "Poul-Henning Kamp" In-Reply-To: Your message of "Thu, 19 Jun 2003 04:34:58 PDT." <20030619113457.GA80739@HAL9000.homeunix.com> Date: Thu, 19 Jun 2003 13:43:49 +0200 Message-ID: <2687.1056023029@critter.freebsd.dk> cc: Dmitry Sivachenko cc: "Tim J. Robbins" cc: arch@FreeBSD.ORG Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 11:43:51 -0000 You are pretty much spot on with your observations. I don't think there is much fundamental disagreement about what needs to happen, where we need to go, if you ask the different people who have studied this mess of code, so I am not so worried about us not ending up the right place. The danger of course is if somebody attacks a subset of the problem without holding the entire problem in focus. My comments about nullfs and unionfs, shoul not be construed as I want to kill those features, it was more meant as "they will not be my primary priorities and if they break temporarily, so be it." Stackable filesystems are not exactly mandatory, but I think we need to have them for a number of important applications. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 08:16:41 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 97BBB37B401; Thu, 19 Jun 2003 08:16:41 -0700 (PDT) Received: from mobile.hub.org (u153n214.eastlink.ca [24.224.153.214]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2793F43F93; Thu, 19 Jun 2003 08:16:40 -0700 (PDT) (envelope-from scrappy@hub.org) Received: by mobile.hub.org (Postfix, from userid 1001) id C16B6397; Thu, 19 Jun 2003 12:16:38 -0300 (ADT) Received: from localhost (localhost [127.0.0.1]) by mobile.hub.org (Postfix) with ESMTP id BCBBC1A0; Thu, 19 Jun 2003 12:16:38 -0300 (ADT) Date: Thu, 19 Jun 2003 12:16:38 -0300 (ADT) From: The Hermit Hacker To: Poul-Henning Kamp In-Reply-To: <2687.1056023029@critter.freebsd.dk> Message-ID: <20030619121550.H51411@hub.org> References: <2687.1056023029@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Dmitry Sivachenko cc: David Schultz cc: "Tim J. Robbins" cc: arch@FreeBSD.ORG Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 15:16:42 -0000 On Thu, 19 Jun 2003, Poul-Henning Kamp wrote: > My comments about nullfs and unionfs, shoul not be construed as I want > to kill those features, it was more meant as "they will not be my > primary priorities and if they break temporarily, so be it." Ah, well, 'break temporarily' doesn't worry me too much ... I just got the impression of "rip out completely" from your one note, which had me concerned :( From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 08:28:53 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3B77B37B401 for ; Thu, 19 Jun 2003 08:28:53 -0700 (PDT) Received: from axl.seasidesoftware.co.za (axl.seasidesoftware.co.za [196.31.7.201]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9865543F75 for ; Thu, 19 Jun 2003 08:28:49 -0700 (PDT) (envelope-from sheldonh@starjuice.net) Received: from sheldonh by axl.seasidesoftware.co.za with local (Exim 4.20) id 19T1LB-0004Bi-Dp; Thu, 19 Jun 2003 17:28:45 +0200 Date: Thu, 19 Jun 2003 17:28:45 +0200 From: Sheldon Hearn To: The Hermit Hacker Message-ID: <20030619152845.GP13111@starjuice.net> Mail-Followup-To: The Hermit Hacker , arch@FreeBSD.ORG References: <2687.1056023029@critter.freebsd.dk> <20030619121550.H51411@hub.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030619121550.H51411@hub.org> User-Agent: Mutt/1.5.4i Sender: Sheldon Hearn cc: arch@FreeBSD.ORG Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 15:28:53 -0000 On (2003/06/19 12:16), The Hermit Hacker wrote: > > My comments about nullfs and unionfs, shoul not be construed as I want > > to kill those features, it was more meant as "they will not be my > > primary priorities and if they break temporarily, so be it." > > Ah, well, 'break temporarily' doesn't worry me too much ... I just got the > impression of "rip out completely" from your one note, which had me > concerned :( That's not a completely unrealistic impression to walk away with. The last time unionfs and nullfs were broken, it was also temporarily. It's all relative. :-) Ciao, Sheldon. From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 11:28:27 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7DCB037B401 for ; Thu, 19 Jun 2003 11:28:27 -0700 (PDT) Received: from smtp4.server.rpi.edu (smtp4.server.rpi.edu [128.113.2.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id BEE0643F75 for ; Thu, 19 Jun 2003 11:28:26 -0700 (PDT) (envelope-from drosih@rpi.edu) Received: from [128.113.24.47] (gilead.netel.rpi.edu [128.113.24.47]) by smtp4.server.rpi.edu (8.12.9/8.12.9) with ESMTP id h5JISOPx028268; Thu, 19 Jun 2003 14:28:24 -0400 Mime-Version: 1.0 X-Sender: drosih@mail.rpi.edu Message-Id: In-Reply-To: <20030619121550.H51411@hub.org> References: <2687.1056023029@critter.freebsd.dk> <20030619121550.H51411@hub.org> Date: Thu, 19 Jun 2003 14:28:23 -0400 To: The Hermit Hacker , Poul-Henning Kamp From: Garance A Drosihn Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Scanned-By: MIMEDefang 2.28 cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 18:28:27 -0000 At 12:16 PM -0300 6/19/03, The Hermit Hacker wrote: >On Thu, 19 Jun 2003, Poul-Henning Kamp wrote: > > > My comments about nullfs and unionfs, shoul not be construed > > as I want to kill those features, it was more meant as "they > > will not be my primary priorities and if they break > > temporarily, so be it." > >Ah, well, 'break temporarily' doesn't worry me too much ... I >just got the impression of "rip out completely" from your one >note, which had me concerned :( If you have a bad enough situation, then sometimes the best way to start fixing it is by just ripping out what's there... But if I remember right, Poul was talking about work that would be started in the 6.x-branch. There should be plenty of time before 6.x is the branch that anyone would *need* to use in production, and presumably the new improved versions would be in place before that time. -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 12:03:51 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 330B737B401 for ; Thu, 19 Jun 2003 12:03:51 -0700 (PDT) Received: from regina.plastikos.com (216-107-106-250.wan.networktel.net [216.107.106.250]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4E29643FBF for ; Thu, 19 Jun 2003 12:03:50 -0700 (PDT) (envelope-from fullermd@over-yonder.net) Received: from mortis.over-yonder.net (adsl-33-235-56.jan.bellsouth.net [67.33.235.56]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits)) (No client certificate requested) by regina.plastikos.com (Postfix) with ESMTP id E08F66EEB9; Thu, 19 Jun 2003 15:03:48 -0400 (EDT) Received: by mortis.over-yonder.net (Postfix, from userid 100) id 4F8D520F03; Thu, 19 Jun 2003 14:03:47 -0500 (CDT) Date: Thu, 19 Jun 2003 14:03:47 -0500 From: "Matthew D. Fuller" To: Terry Lambert Message-ID: <20030619190346.GT10127@over-yonder.net> References: <20030618174733.GC10127@over-yonder.net> <20030618190032.GG10127@over-yonder.net> <3EF1617F.C1EC5C12@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3EF1617F.C1EC5C12@mindspring.com> User-Agent: Mutt/1.4.1i-fullermd.1 X-Editor: vi X-OS: FreeBSD cc: arch@freebsd.org cc: Poul-Henning Kamp Subject: Re: Meta: explain what where when? (was Re: userland access todevicesis moving!) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 19:03:51 -0000 On Thu, Jun 19, 2003 at 12:08:47AM -0700 I heard the voice of Terry Lambert, and lo! it spake thus: > > That's more like a marketing blurb. It does not evenly Hi, Terry! Welcome to a completely different tangent than I was aiming for in this subthread! [ Bunches of in-depth detail-overview snipped ] The point of the exercise is not "marketing blurb". The point of the exercise is to provide a hint so those of us whose heads only work in userland-mode have a clue what this change is expected to mean to us. -- Matthew Fuller (MF4839) | fullermd@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet" From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 12:29:53 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4D89837B401 for ; Thu, 19 Jun 2003 12:29:53 -0700 (PDT) Received: from mail.soaustin.net (mail.soaustin.net [207.200.4.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id A6E4543F93 for ; Thu, 19 Jun 2003 12:29:52 -0700 (PDT) (envelope-from linimon@lonesome.com) Received: from lonesome.lonesome.com (cs242746-11.austin.rr.com [24.27.46.11]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mail.soaustin.net (Postfix) with ESMTP id 454DD140CB; Thu, 19 Jun 2003 14:29:51 -0500 (CDT) From: Mark Linimon Organization: Lonesome Dove Computing Services To: Terry Lambert , "Matthew D. Fuller" Date: Thu, 19 Jun 2003 14:34:00 -0500 User-Agent: KMail/1.5.2 References: <20030618174733.GC10127@over-yonder.net> <20030618190032.GG10127@over-yonder.net> <3EF1617F.C1EC5C12@mindspring.com> In-Reply-To: <3EF1617F.C1EC5C12@mindspring.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200306191434.00922.linimon@lonesome.com> cc: arch@freebsd.org Subject: Congratulations Terry! (was: Re: Meta: explain what where when?) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 19:29:53 -0000 On Thursday 19 June 2003 02:08 am, Terry Lambert wrote: > I really hate when someone posts something that is effectively > nothing more than propaganda in favor of something that they > haven't documented in sufficient detail and/or provided their > own list of the negative consequences, such that people can > form an informed opinion on the merits of the idea, instead of > deciding based on personalities, or how effective someone is at > writing propaganda in favor of what they intend to do anyway. Congratulations! According to my mailing list scan scripts, this is your One Thousandth Post to freebsd-* since January 1st of this year, making you by far and away the most frequent poster to the combined lists. In fact, this total is almost twice as many as the next two posters who are exactly tied at 568. For comparison purposes, Poul-Henning comes in with a mere 283 posts, making your own postings more than 3 times as frequent. I'm too amazed by this to even comment on the above quoted paragraph. Mark Linimon From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 22:08:53 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 225F237B405; Thu, 19 Jun 2003 22:08:53 -0700 (PDT) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 06E1343FE5; Thu, 19 Jun 2003 22:08:52 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-uinj93o.dialup.mindspring.com ([165.121.164.120] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19TE8o-00066q-00; Thu, 19 Jun 2003 22:08:51 -0700 Message-ID: <3EF2969F.4EE7D6D4@mindspring.com> Date: Thu, 19 Jun 2003 22:07:43 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: David Schultz References: <20030618112226.GA42606@fling-wing.demos.su> <20030619113457.GA80739@HAL9000.homeunix.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a499737279834c55796b66a5bdf9b10eb8a7ce0e8f8d31aa3f350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: "Tim J. Robbins" cc: arch@FreeBSD.ORG Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 05:08:53 -0000 David Schultz wrote: > As a side note, I also think it's important that the new > implementation have a clean separation between user data and FS > metadata, so that they are not in direct competition with each > other for memory. This was the rationale behind the original VM and buffer cache separation. Instead of coming from a limited system resource shared between the two, they came from a limited system resource shared between the two, and scavanged pages from each other and caused thrashing. This was especially obvious in programs that mmap'ed a lot of file data into memory (e.g. "ld"), and then by seeking around, thrashed all the code pages out of core. The net result of this approach is an HI disconnect when doing large compiles uin an X term, when all of X's pages are thrashed out, and you move the mose and the cursor does... nothing... for... a... very... long... time... -- not a good situation. > The present buffer cache may be too limited for > the massive number of dependencies softupdates needs to track for > FS-intensive loads, but we also don't want lots of accumulated dirty > buffers from heavy FS activity to force application data out of memory. This basically says that you need to stall dependency memory allocation at a high watermark, and force the update clock to tick until the problem is eliminated. The acceleration of the update clock that takes place today is insufficient for this: you need to force the tick, wait for the completion, and force the next tick, etc., until you get back to your low water mark. If you just accelerate the clock, the hysteresis will keep you in a constant state of thrashing. > The original buffer cache design is untenable largely because > Dyson wanted to maintain compatibility with existing FS > interfaces. At the time, the problem was that the vmobject_t's were not reference counted, and allowed to be aliased. This was more or less a debugging decision, which was made because there were a couple of places where the system created unintentional aliases for VM objects, and had some pretty severe crashes as a result. Once these were tracked down, intentional aliases would have been an acceptable approach. But instead, what happpened was that the buffer cache entry became married to the vnode structure, on a 1:1 basis, forever more. When the pager changed to assume this, then everyones fate was irevvocably sealed. 8-(. -- Terry From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 22:53:54 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2391437B401 for ; Thu, 19 Jun 2003 22:53:54 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1EA8143FAF for ; Thu, 19 Jun 2003 22:53:53 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-uinj93o.dialup.mindspring.com ([165.121.164.120] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19TEqN-0000L2-00; Thu, 19 Jun 2003 22:53:51 -0700 Message-ID: <3EF2A12D.58A9524E@mindspring.com> Date: Thu, 19 Jun 2003 22:52:45 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Matthew D. Fuller" References: <20030618174733.GC10127@over-yonder.net> <20030618190032.GG10127@over-yonder.net> <3EF1617F.C1EC5C12@mindspring.com> <20030619190346.GT10127@over-yonder.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a43e56f85eb96dc12c5e72f4cd7080ebe7350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c cc: arch@freebsd.org cc: Poul-Henning Kamp Subject: Re: Meta: explain what where when? (was Re: userland accesstodevicesis moving!) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 05:53:54 -0000 "Matthew D. Fuller" wrote: > [ Bunches of in-depth detail-overview snipped ] > > The point of the exercise is not "marketing blurb". The point of the > exercise is to provide a hint so those of us whose heads only work in > userland-mode have a clue what this change is expected to mean to us. The author of a patch *ALWAYS* expects that things will be better than they were, and that *NO ONE* uses any feature that they want to deprecate, and *SOMETIMES* doesn't even agree that the things being deprecated were in fact features in the first place. The last case is the one which worries me most. -- Terry From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 23:10:14 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D481937B401; Thu, 19 Jun 2003 23:10:14 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 00BC243FBD; Thu, 19 Jun 2003 23:10:14 -0700 (PDT) (envelope-from das@freebsd.org) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5K6ABJa085977; Thu, 19 Jun 2003 23:10:11 -0700 (PDT) (envelope-from das@freebsd.org) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5K6AAQr085976; Thu, 19 Jun 2003 23:10:10 -0700 (PDT) (envelope-from das@freebsd.org) Date: Thu, 19 Jun 2003 23:10:10 -0700 From: David Schultz To: Terry Lambert Message-ID: <20030620061010.GA85747@HAL9000.homeunix.com> Mail-Followup-To: Terry Lambert , Poul-Henning Kamp , Dmitry Sivachenko , "Tim J. Robbins" , arch@freebsd.org References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030619113457.GA80739@HAL9000.homeunix.com> <3EF2969F.4EE7D6D4@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3EF2969F.4EE7D6D4@mindspring.com> cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 06:10:15 -0000 On Thu, Jun 19, 2003, Terry Lambert wrote: > David Schultz wrote: > > As a side note, I also think it's important that the new > > implementation have a clean separation between user data and FS > > metadata, so that they are not in direct competition with each > > other for memory. > > This was the rationale behind the original VM and buffer cache > separation. Instead of coming from a limited system resource > shared between the two, they came from a limited system resource > shared between the two, and scavanged pages from each other and > caused thrashing. This was especially obvious in programs that > mmap'ed a lot of file data into memory (e.g. "ld"), and then by > seeking around, thrashed all the code pages out of core. Yes, and my point was that it's important to maintain the separation, at least implicitly, in any new design. I think this point was obvious to the people concerned before I even mentioned it, so there's no need to rehash it, but the designers of certain other operating systems seem to have missed it. > > The present buffer cache may be too limited for > > the massive number of dependencies softupdates needs to track for > > FS-intensive loads, but we also don't want lots of accumulated dirty > > buffers from heavy FS activity to force application data out of memory. > > This basically says that you need to stall dependency memory > allocation at a high watermark, and force the update clock to > tick until the problem is eliminated. The acceleration of the > update clock that takes place today is insufficient for this: > you need to force the tick, wait for the completion, and force > the next tick, etc., until you get back to your low water mark. > If you just accelerate the clock, the hysteresis will keep you > in a constant state of thrashing. Last year I was saying something similar to what you just said, before Kirk convinced me that I was wrong. ;-) The main problem isn't metastability or the lack of deadlock detection, it's that some workloads reasonably require more dependency tracking than the buffer cache can accomodate. At present, we can't track more than about 50 directories in the buffer cache. Still, the opposite problem of allowing the accumulation of many dependencies that have to be written anyway concerns me. I guess that's where a clever flushing algorithm comes in. [1] points out that Solaris 2.6 and 7 had a clever balancing algorithm between the FS and VM caches, too, but that wound up being tossed out in favor of a separate FS metadata cache in Solaris 8. But Solaris doesn't do softupdates, so it doesn't have a tradeoff between memory pressure and effective dependency tracking. So I don't know what the right answer is for FreeBSD. > > The original buffer cache design is untenable largely because > > Dyson wanted to maintain compatibility with existing FS > > interfaces. > > At the time, the problem was that the vmobject_t's were not > reference counted, and allowed to be aliased. [...] You're describing a separate problem from the one I'm thinking of, but probably also a valid one. My knowledge of BSD doesn't extend back that far. [1] Mauro and McDougall. Solaris Internals: Core Kernel Architecture, Prentice Hall (2001). From owner-freebsd-arch@FreeBSD.ORG Fri Jun 20 01:34:40 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 32B7C37B401; Fri, 20 Jun 2003 01:34:40 -0700 (PDT) Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5007C43F75; Fri, 20 Jun 2003 01:34:39 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-uinj93o.dialup.mindspring.com ([165.121.164.120] helo=mindspring.com) by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19THLv-0003oS-00; Fri, 20 Jun 2003 01:34:36 -0700 Message-ID: <3EF2C67D.65F8A635@mindspring.com> Date: Fri, 20 Jun 2003 01:31:57 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: David Schultz References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030619113457.GA80739@HAL9000.homeunix.com> <3EF2969F.4EE7D6D4@mindspring.com> <20030620061010.GA85747@HAL9000.homeunix.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a463f9e0c5b10abbcd11017d1b92c96a5b548b785378294e88350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 08:34:40 -0000 David Schultz wrote: > Yes, and my point was that it's important to maintain the > separation, at least implicitly, in any new design. I think this > point was obvious to the people concerned before I even mentioned > it, so there's no need to rehash it, but the designers of certain > other operating systems seem to have missed it. Well, Solaris "reinvented" the seperate VM and buffer cache in Solaris 2.8. 8-(. I wasn't sure what you were recommending from what you said. > > This basically says that you need to stall dependency memory > > allocation at a high watermark, and force the update clock to > > tick until the problem is eliminated. The acceleration of the > > update clock that takes place today is insufficient for this: > > you need to force the tick, wait for the completion, and force > > the next tick, etc., until you get back to your low water mark. > > If you just accelerate the clock, the hysteresis will keep you > > in a constant state of thrashing. > > Last year I was saying something similar to what you just said, > before Kirk convinced me that I was wrong. ;-) 8-) 8-). > The main problem isn't metastability or the lack of deadlock > detection, it's that some workloads reasonably require more > dependency tracking than the buffer cache can accomodate. At > present, we can't track more than about 50 directories in the > buffer cache. I don't know if I buy this directly. It's probably possible to commit an incomplete tree, as long as it's complete from the root, at any subtree point. Doing this, though, you would have to switch from isosynchronous to synchronus processing on the subtree for the remainder of its duration. This works, because you use the associative property of the tree above to replace it with a single edge segment; other orphan subtrees of the same tree all have to fall into the same mode. This seems smart, but it's incredibly nasty, if you don't insert a stall barrier, and permit the existing elements to flush out synchronously before adding more dependencies. Otherwise, you end up getting a load spice, and (effectively) switching from soft dependencies to synchronous writes, at least for the most busy part of your dependency graph, Hmmph. I don't see a way around this, short of making the update clock wheel bigger, and I don't see an easy way of doing that while it has entries in it at all. I think no matter what, such a workload is going to end up in a degenerate case and get you thrashing. What was Kirk's answer? > Still, the opposite problem of allowing the accumulation of many > dependencies that have to be written anyway concerns me. I guess > that's where a clever flushing algorithm comes in. [1] points out > that Solaris 2.6 and 7 had a clever balancing algorithm between > the FS and VM caches, too, but that wound up being tossed out in > favor of a separate FS metadata cache in Solaris 8. But Solaris > doesn't do softupdates, so it doesn't have a tradeoff between > memory pressure and effective dependency tracking. So I don't know > what the right answer is for FreeBSD. Solaris 8 and up has its own bogons because of their reseperation of the cache (as previously noted). I understand from a complexity perspective why they made the choice, but I'm not sure it was right, even if they did have to face the problem FreeBSD faces in this case. Maybe the answer is to not let the relationship graph ever get that big in the first place; effectively, you would have to be however many edges deep as it took to circle the entire soft updates clock wheel. One thing that occurs to me is to not tick over the wheel until you have data on it, and run a third hand on the two-handed clock to make the decision on advancing the insertion pointer vs. advancing the flushing pointer. This would keep time-sparse, locality-dense operations (e.g. put operation A in slot X, and operation B in slot X+n) from getting too seperated on the wheel. This wouldn't solve the problem, of course, but it would greatly reduce the sparseness for consecutive dependent operations that didn't happen back-to-back temporally. It would probably save you a factor of exp2(n-1), on average, for a forced insertion seperation between dependent operations of 'n'. Your wheel could handle that times more depth to the graph. But the quoted "50" is the ideal, when all dependent operations occur in the same tick, given the current wheel size; all this strategy does is up the number (the real number isn't 50, it's unfortunately 'size - max_n - 1') by making them occur virtually in the same tick, even if they are spread out temporally otherwise. I think the only answer is to come up with something other than the wheel, or bite the bullet and stall all operations that want to write to the wheel and flush it completely, when you hit some high water mark. The interactive response in this case could be a pretty long dramatic pause... the same thing we had when the lock on the buffers was owned by the writers, once queued, instead of the queue (so they could be second-chanced and re-queued... Matt Dillon's work, if I remember correctly). > > > The original buffer cache design is untenable largely because > > > Dyson wanted to maintain compatibility with existing FS > > > interfaces. > > > > At the time, the problem was that the vmobject_t's were not > > reference counted, and allowed to be aliased. [...] > > You're describing a separate problem from the one I'm thinking of, > but probably also a valid one. My knowledge of BSD doesn't extend > back that far. Not really; the issue arose in the first place because the VM implementation was, as Poul put it, "incomplete". That was an apt insight by Poul, and an important one. The point I wanted to make is that FreeBSD should not throw the baby out with the bathwater: the VM and buffer cache unification was right, IMO, for a lot of reasons, even if disallowing the intentional aliases after the unintentional ones were fixed, and making every vnode require a seperate vmobject_t. Yes, FreeBSD has historical baggage that needs someone to clear it away, but undoing the VM and buffer cache unification is not part of that, it's just something that happened concurrently. The unification *process* is probably the root cause of some of the current evil, but the unification *per se* is not. I just wanted to make it very clear. I would deperately hate to lose the page-flipping trick that FreeBSD plays that make its pipe and UNIX domain sockets so balzingly fast, compared to everyone else (and that' just one example of many). -- Terry From owner-freebsd-arch@FreeBSD.ORG Fri Jun 20 02:30:22 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AF38D37B401; Fri, 20 Jun 2003 02:30:22 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id CED1243F3F; Fri, 20 Jun 2003 02:30:21 -0700 (PDT) (envelope-from das@freebsd.org) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5K9UCJa087120; Fri, 20 Jun 2003 02:30:12 -0700 (PDT) (envelope-from das@freebsd.org) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5K9U4Iw087119; Fri, 20 Jun 2003 02:30:04 -0700 (PDT) (envelope-from das@freebsd.org) Date: Fri, 20 Jun 2003 02:30:04 -0700 From: David Schultz To: Terry Lambert Message-ID: <20030620093004.GA86924@HAL9000.homeunix.com> Mail-Followup-To: Terry Lambert , Poul-Henning Kamp , Dmitry Sivachenko , "Tim J. Robbins" , arch@freebsd.org References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030619113457.GA80739@HAL9000.homeunix.com> <3EF2969F.4EE7D6D4@mindspring.com> <20030620061010.GA85747@HAL9000.homeunix.com> <3EF2C67D.65F8A635@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3EF2C67D.65F8A635@mindspring.com> cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 09:30:23 -0000 On Fri, Jun 20, 2003, Terry Lambert wrote: > David Schultz wrote: > > Yes, and my point was that it's important to maintain the > > separation, at least implicitly, in any new design. I think this > > point was obvious to the people concerned before I even mentioned > > it, so there's no need to rehash it, but the designers of certain > > other operating systems seem to have missed it. > > Well, Solaris "reinvented" the seperate VM and buffer cache > in Solaris 2.8. 8-(. I wasn't sure what you were recommending > from what you said. Let me make it clear that I'm not advocating the Solaris 8 approach. But it would seem that the FS metadata cache needs to be insulated from the VM cache better than priority paging can provide. Perhaps it would be possible to enforce a sort of self-tuning version of separate VM and buffer caches, where the buffer cache has a carefully managed RSS that can scale based on both FS activity and memory pressure. That way, I/O-intensive workloads will not be allowed to suck too many pages away from user processes and the VM system will be able to better estimate actual memory pressure. > > The main problem isn't metastability or the lack of deadlock > > detection, it's that some workloads reasonably require more > > dependency tracking than the buffer cache can accomodate. At > > present, we can't track more than about 50 directories in the > > buffer cache. > > I don't know if I buy this directly. It's probably possible > to commit an incomplete tree, as long as it's complete from > the root, at any subtree point. Doing this, though, you would > have to switch from isosynchronous to synchronus processing on > the subtree for the remainder of its duration. This works, > because you use the associative property of the tree above to > replace it with a single edge segment; other orphan subtrees > of the same tree all have to fall into the same mode. I don't understand what you're getting at here. If you don't have enough space to cache more than 50 dependencies, you lose performance when your working set exceeds 50 directories, period. Trying to address this issue by making the softupdates flushing code smarter is only working around the limitations of the present buffer cache. > What was Kirk's answer? He didn't give me one, aside from advocating backing dependencies with the VM system. This issue just came up in passing a while ago in relation to a pathological case for softupdates that resulted in an explosion of dependencies that filled up the buffer cache and caused a deadlock. ;-) (The problem has since been hacked around, BTW.) > But the quoted "50" is the ideal, when all dependent operations > occur in the same tick, given the current wheel size; all this > strategy does is up the number (the real number isn't 50, it's > unfortunately 'size - max_n - 1') by making them occur virtually > in the same tick, even if they are spread out temporally otherwise. I think 50 is merely a number that makes softupdates not fill up the buffer cache and deadlock. Keep in mind that the dependency graph could have a large fanout, or it could be a multigraph. There's no magical association between ~50 directories and the maximum path length in the graph. Again, it's the buffer cache that's the primary problem, not softupdates. From owner-freebsd-arch@FreeBSD.ORG Fri Jun 20 18:09:36 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 157C437B401; Fri, 20 Jun 2003 18:09:36 -0700 (PDT) Received: from mail.cyberonic.com (mail.cyberonic.com [4.17.179.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id E491B43F3F; Fri, 20 Jun 2003 18:09:34 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (node-40244c0a.sfo.onnet.us.uu.net [64.36.76.10]) by mail.cyberonic.com (8.12.8/8.12.5) with ESMTP id h5L1aVMo017653; Fri, 20 Jun 2003 21:36:32 -0400 Received: (from jmg@localhost) by hydrogen.funkthat.com (8.12.9/8.11.6) id h5L1A2B5030917; Fri, 20 Jun 2003 18:10:02 -0700 (PDT) (envelope-from jmg) Date: Fri, 20 Jun 2003 18:10:02 -0700 From: John-Mark Gurney To: Bruce Evans , Robert Watson , arch@freebsd.org Message-ID: <20030621011002.GG15336@funkthat.com> Mail-Followup-To: Bruce Evans , Robert Watson , arch@freebsd.org References: <20030617120956.N30677@gamplex.bde.org> <20030617052917.GF73854@funkthat.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="ReaqsoxgOBHFXBhH" Content-Disposition: inline In-Reply-To: <20030617052917.GF73854@funkthat.com> User-Agent: Mutt/1.4.1i X-Operating-System: FreeBSD 4.2-RELEASE i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html Subject: Re: make /dev/pci really readable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jun 2003 01:09:36 -0000 --ReaqsoxgOBHFXBhH Content-Type: text/plain; charset=us-ascii Content-Disposition: inline John-Mark Gurney wrote this message on Mon, Jun 16, 2003 at 22:29 -0700: > Bruce Evans wrote this message on Tue, Jun 17, 2003 at 12:36 +1000: > > On Mon, 16 Jun 2003, Robert Watson wrote: > > > It looks like (although I haven't tried), user processes can > > > also cause the kernel to allocate unlimited amounts of kernel memory, > > > which is another bit we probably need to tighten down. > > > > Much more serious. > > Yep, the pattern_buf is allocated, and in some cases a berak happens > w/o freeing it. So there is a memory leak her. Will be fixed soon. Ok, I think I have a good patch. It's attached. Fixes the memory leak. I have also fix the pci manpage to talk about the errors, but it isn't included in the patch. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." --ReaqsoxgOBHFXBhH Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="pci_user2.patch" Index: pci_user.c =================================================================== RCS file: /home/ncvs/src/sys/dev/pci/pci_user.c,v retrieving revision 1.9 diff -u -r1.9 pci_user.c --- pci_user.c 2003/03/03 12:15:44 1.9 +++ pci_user.c 2003/06/21 00:29:48 @@ -176,9 +176,14 @@ const char *name; int error; - if (!(flag & FWRITE)) + if (!(flag & FWRITE) && cmd != PCIOCGETCONF) return EPERM; + /* make sure register is in bounds and aligned */ + if (cmd == PCIOCREAD || cmd == PCIOCWRITE) + if (io->pi_reg < 0 || io->pi_reg + io->pi_width > PCI_REGMAX || + io->pi_reg & (io->pi_width - 1)) + error = EINVAL; switch(cmd) { case PCIOCGETCONF: @@ -197,15 +202,6 @@ dinfo = NULL; /* - * Hopefully the user won't pass in a null pointer, but it - * can't hurt to check. - */ - if (cio == NULL) { - error = EINVAL; - break; - } - - /* * If the user specified an offset into the device list, * but the list has changed since they last called this * ioctl, tell them that the list has changed. They will @@ -272,42 +268,22 @@ sizeof(struct pci_match_conf)) != cio->pat_buf_len){ /* The user made a mistake, return an error*/ cio->status = PCI_GETCONF_ERROR; - printf("pci_ioctl: pat_buf_len %d != " - "num_patterns (%d) * sizeof(struct " - "pci_match_conf) (%d)\npci_ioctl: " - "pat_buf_len should be = %d\n", - cio->pat_buf_len, cio->num_patterns, - (int)sizeof(struct pci_match_conf), - (int)sizeof(struct pci_match_conf) * - cio->num_patterns); - printf("pci_ioctl: do your headers match your " - "kernel?\n"); cio->num_matches = 0; error = EINVAL; break; } /* - * Check the user's buffer to make sure it's readable. - */ - if (!useracc((caddr_t)cio->patterns, - cio->pat_buf_len, VM_PROT_READ)) { - printf("pci_ioctl: pattern buffer %p, " - "length %u isn't user accessible for" - " READ\n", cio->patterns, - cio->pat_buf_len); - error = EACCES; - break; - } - /* * Allocate a buffer to hold the patterns. */ pattern_buf = malloc(cio->pat_buf_len, M_TEMP, M_WAITOK); error = copyin(cio->patterns, pattern_buf, cio->pat_buf_len); - if (error != 0) - break; + if (error != 0) { + error = EINVAL; + goto getconfexit; + } num_patterns = cio->num_patterns; } else if ((cio->num_patterns > 0) @@ -317,32 +293,19 @@ */ cio->status = PCI_GETCONF_ERROR; cio->num_matches = 0; - printf("pci_ioctl: invalid GETCONF arguments\n"); error = EINVAL; break; } else pattern_buf = NULL; /* - * Make sure we can write to the match buffer. - */ - if (!useracc((caddr_t)cio->matches, - cio->match_buf_len, VM_PROT_WRITE)) { - printf("pci_ioctl: match buffer %p, length %u " - "isn't user accessible for WRITE\n", - cio->matches, cio->match_buf_len); - error = EACCES; - break; - } - - /* * Go through the list of devices and copy out the devices * that match the user's criteria. */ for (cio->num_matches = 0, error = 0, i = 0, dinfo = STAILQ_FIRST(devlist_head); (dinfo != NULL) && (cio->num_matches < ionum) - && (error == 0) && (i < pci_numdevs); + && (error == 0) && (i < pci_numdevs) && (dinfo != NULL); dinfo = STAILQ_NEXT(dinfo, pci_links), i++) { if (i < cio->offset) @@ -375,10 +338,12 @@ if (cio->num_matches >= ionum) break; - error = copyout(&dinfo->conf, - &cio->matches[cio->num_matches], - sizeof(struct pci_conf)); - cio->num_matches++; + /* only if can copy it out do we count it */ + if (!(error = copyout(&dinfo->conf, + &cio->matches[cio->num_matches], + sizeof(struct pci_conf)))) { + cio->num_matches++; + } } } @@ -405,6 +370,7 @@ else cio->status = PCI_GETCONF_MORE_DEVS; +getconfexit: if (pattern_buf != NULL) free(pattern_buf, M_TEMP); @@ -439,7 +405,7 @@ } break; default: - error = ENODEV; + error = EINVAL; break; } break; @@ -473,7 +439,7 @@ } break; default: - error = ENODEV; + error = EINVAL; break; } break; --ReaqsoxgOBHFXBhH-- From owner-freebsd-arch@FreeBSD.ORG Fri Jun 20 19:22:02 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 355D737B401 for ; Fri, 20 Jun 2003 19:22:02 -0700 (PDT) Received: from magic.adaptec.com (magic-mail.adaptec.com [208.236.45.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id AB30943F3F for ; Fri, 20 Jun 2003 19:22:01 -0700 (PDT) (envelope-from scott_long@btc.adaptec.com) Received: from redfish.adaptec.com (redfish.adaptec.com [162.62.50.11]) by magic.adaptec.com (8.11.6/8.11.6) with ESMTP id h5L2LZ810122 for ; Fri, 20 Jun 2003 19:21:35 -0700 Received: from btc.adaptec.com (hollin.btc.adaptec.com [10.100.253.56]) by redfish.adaptec.com (8.8.8p2+Sun/8.8.8) with ESMTP id TAA26139 for ; Fri, 20 Jun 2003 19:21:59 -0700 (PDT) Message-ID: <3EF3C12F.9060303@btc.adaptec.com> Date: Fri, 20 Jun 2003 20:21:35 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030425 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-arch@freebsd.org Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Subject: API change for bus_dma X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jun 2003 02:22:02 -0000 All, As I work towards locking down storage drivers, I'm also preening their use of busdma. A common theme among them is a misuse of bus_dmamap_load() and the associated callback mechanism. For most, the consequence is harmless as long as the card can support the amount of physical memory in the system (systems with IOMMU's not withstanding). However, in cases such as PAE where busdma might have to use bounce buffers, most drivers don't handle the possibility of bus_dmamap_load() returning EINPROGRESS. The consequence of this is twofold: bus_dmamap_load() returns without the callback being called, but the driver doesn't detect this and merrily goes on its way. Later on the callback does get called, and any state that was shared with it gets corrupted. This is a problem even for drivers that are under Giant. The solution for this is mostly a mechanical cut-n-paste of the code dealing with the callback. However, locking down the drivers presents a new problem with the callback. Since the callback can be called asynchronously from an SWI, it needs some way to synchronize with the driver. Adding code to each callback to conditionally grab the driver mutex incurs a penalty (albiet small) and requires more effort. The better solution is to export the driver mutex to busdma and have the SWI that runs the callback lock the mutex before calling the callback. This requires adding a 'struct mtx *' argument to bus_dma_tag_create() to hold the mutex to be exported. For drivers that are under Giant and/or decide not to use this functionality, passing NULL for this argument is accepted. Therefore, the change is fairly low-impact, though it incurs an API change. Since locking the peripheral drivers is a major goal for 5.2 and 5-STABLE, it is time to bite the bullet and do this now. A few 3rd-party drivers stand to be affected, and hopefully their maintainers will react accordingly. Comments? Scott From owner-freebsd-arch@FreeBSD.ORG Sat Jun 21 17:51:33 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 661A037B401 for ; Sat, 21 Jun 2003 17:51:33 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 82B0943FBD for ; Sat, 21 Jun 2003 17:51:31 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5M0pQJa059806 for ; Sat, 21 Jun 2003 17:51:27 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5M0pOYY059805 for arch@FreeBSD.ORG; Sat, 21 Jun 2003 17:51:24 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Date: Sat, 21 Jun 2003 17:51:24 -0700 From: David Schultz To: arch@FreeBSD.ORG Message-ID: <20030622005124.GA59673@HAL9000.homeunix.com> Mail-Followup-To: arch@FreeBSD.ORG Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Subject: Per-source CFLAGS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jun 2003 00:51:33 -0000 The following patch adds support for per-file CFLAGS, which gets appended to the command line after the global CFLAGS. I would like to commit this in a few days. Do people want a similar feature for LDFLAGS? If we do add it to LDFLAGS, I will have to rework the following patch to use ${CFLAGS_${.TARGET}} instead of ${CFLAGS_${.IMPSRC}}, for consistency. The present version only works on C sources. I intend to use this feature for gdtoa, which is technically part of libc, but also on a vendor branch and intended to stay that way. The problem being addressed is that gcc at higher warning levels has some inane warnings that the vendor and I consider wrong, and yet people want to be able to compile libc cleanly at these warning levels. As an example, gcc complains that the expression 'a << b - c' must have parentheses because obviously nobody remembers C's precedence rules. So here's just one potential use of the new feature: Index: lib/libc/gdtoa/Makefile.inc =================================================================== RCS file: /cvs/src/lib/libc/gdtoa/Makefile.inc,v retrieving revision 1.3 diff -u -r1.3 Makefile.inc --- lib/libc/gdtoa/Makefile.inc 5 Apr 2003 22:10:13 -0000 1.3 +++ lib/libc/gdtoa/Makefile.inc 2 May 2003 09:31:15 -0000 @@ -16,6 +16,7 @@ .for src in ${GDTOASRCS} MISRCS+=gdtoa_${src} CLEANFILES+=gdtoa_${src} +CFLAGS_gdtoa_${src}+=-w gdtoa_${src}: ln -sf ${.CURDIR}/../../contrib/gdtoa/${src} ${.TARGET} .endfor The patch I would actually like reviewed is this one: Index: share/mk/bsd.lib.mk =================================================================== RCS file: /cvs/src/share/mk/bsd.lib.mk,v retrieving revision 1.140 diff -u -r1.140 bsd.lib.mk --- share/mk/bsd.lib.mk 10 Jun 2003 04:47:49 -0000 1.140 +++ share/mk/bsd.lib.mk 21 Jun 2003 08:39:35 -0000 @@ -53,17 +53,18 @@ touch ${.TARGET} .c.o: - ${CC} ${CFLAGS} -c ${.IMPSRC} -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} -o ${.TARGET} @${LD} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .c.po: - ${CC} -pg ${CFLAGS} -c ${.IMPSRC} -o ${.TARGET} + ${CC} -pg ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} -o ${.TARGET} @${LD} -o ${.TARGET}.tmp -X -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .c.So: - ${CC} ${PICFLAG} -DPIC ${CFLAGS} -c ${.IMPSRC} -o ${.TARGET} + ${CC} ${PICFLAG} -DPIC ${CFLAGS} ${CFLAGS_${.IMPSRC}} \ + -c ${.IMPSRC} -o ${.TARGET} @${LD} ${LDFLAGS} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} @@ -113,36 +114,38 @@ @mv ${.TARGET}.tmp ${.TARGET} .s.o .asm.o: - ${CC} -x assembler-with-cpp ${CFLAGS} ${AINC} -c \ + ${CC} -x assembler-with-cpp ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${AINC} -c \ ${.IMPSRC} -o ${.TARGET} @${LD} ${LDFLAGS} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .s.po .asm.po: - ${CC} -x assembler-with-cpp -DPROF ${CFLAGS} ${AINC} -c \ - ${.IMPSRC} -o ${.TARGET} + ${CC} -x assembler-with-cpp -DPROF ${CFLAGS} ${CFLAGS_${.IMPSRC}} \ + ${AINC} -c ${.IMPSRC} -o ${.TARGET} @${LD} ${LDFLAGS} -o ${.TARGET}.tmp -X -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .s.So .asm.So: ${CC} -x assembler-with-cpp ${PICFLAG} -DPIC ${CFLAGS} \ - ${AINC} -c ${.IMPSRC} -o ${.TARGET} + ${CFLAGS_${.IMPSRC}} ${AINC} -c ${.IMPSRC} -o ${.TARGET} @${LD} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .S.o: - ${CC} ${CFLAGS} ${AINC} -c ${.IMPSRC} -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${AINC} \ + -c ${.IMPSRC} -o ${.TARGET} @${LD} ${LDFLAGS} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .S.po: - ${CC} -DPROF ${CFLAGS} ${AINC} -c ${.IMPSRC} -o ${.TARGET} + ${CC} -DPROF ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${AINC} \ + -c ${.IMPSRC} -o ${.TARGET} @${LD} ${LDFLAGS} -o ${.TARGET}.tmp -X -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} .S.So: - ${CC} ${PICFLAG} -DPIC ${CFLAGS} ${AINC} -c ${.IMPSRC} \ - -o ${.TARGET} + ${CC} ${PICFLAG} -DPIC ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${AINC} \ + -c ${.IMPSRC} -o ${.TARGET} @${LD} -o ${.TARGET}.tmp -x -r ${.TARGET} @mv ${.TARGET}.tmp ${.TARGET} Index: share/mk/sys.mk =================================================================== RCS file: /cvs/src/share/mk/sys.mk,v retrieving revision 1.67 diff -u -r1.67 sys.mk --- share/mk/sys.mk 1 Jun 2003 22:13:45 -0000 1.67 +++ share/mk/sys.mk 21 Jun 2003 08:56:15 -0000 @@ -117,7 +117,8 @@ # SINGLE SUFFIX RULES .c: - ${CC} ${CFLAGS} ${LDFLAGS} -o ${.TARGET} ${.IMPSRC} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} \ + -o ${.TARGET} ${.IMPSRC} .f: ${FC} ${FFLAGS} ${LDFLAGS} -o ${.TARGET} ${.IMPSRC} @@ -129,20 +130,20 @@ # DOUBLE SUFFIX RULES .c.o: - ${CC} ${CFLAGS} -c ${.IMPSRC} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} .f.o: ${FC} ${FFLAGS} -c ${.IMPSRC} .y.o: ${YACC} ${YFLAGS} ${.IMPSRC} - ${CC} ${CFLAGS} -c y.tab.c + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c y.tab.c rm -f y.tab.c mv y.tab.o ${.TARGET} .l.o: ${LEX} ${LFLAGS} ${.IMPSRC} - ${CC} ${CFLAGS} -c lex.yy.c + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c lex.yy.c rm -f lex.yy.c mv lex.yy.o ${.TARGET} @@ -155,7 +156,7 @@ mv lex.yy.c ${.TARGET} .c.a: - ${CC} ${CFLAGS} -c ${.IMPSRC} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} ${AR} ${ARFLAGS} ${.TARGET} ${.PREFIX}.o rm -f ${.PREFIX}.o @@ -181,10 +182,11 @@ touch ${.TARGET} .c: - ${CC} ${CFLAGS} ${LDFLAGS} ${.IMPSRC} ${LDLIBS} -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} ${.IMPSRC} ${LDLIBS} \ + -o ${.TARGET} .c.o: - ${CC} ${CFLAGS} -c ${.IMPSRC} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} .cc .cpp .cxx .C: ${CXX} ${CXXFLAGS} ${LDFLAGS} ${.IMPSRC} ${LDLIBS} -o ${.TARGET} @@ -206,7 +208,7 @@ ${FC} ${RFLAGS} ${EFLAGS} ${FFLAGS} -c ${.IMPSRC} .S.o: - ${CC} ${CFLAGS} -c ${.IMPSRC} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.IMPSRC} .s.o .asm.o: ${AS} ${AFLAGS} -o ${.TARGET} ${.IMPSRC} @@ -214,12 +216,12 @@ # XXX not -j safe .y.o: ${YACC} ${YFLAGS} ${.IMPSRC} - ${CC} ${CFLAGS} -c y.tab.c -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c y.tab.c -o ${.TARGET} rm -f y.tab.c .l.o: ${LEX} -t ${LFLAGS} ${.IMPSRC} > ${.PREFIX}.tmp.c - ${CC} ${CFLAGS} -c ${.PREFIX}.tmp.c -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} -c ${.PREFIX}.tmp.c -o ${.TARGET} rm -f ${.PREFIX}.tmp.c # XXX not -j safe @@ -231,7 +233,8 @@ ${LEX} -t ${LFLAGS} ${.IMPSRC} > ${.TARGET} .s.out .c.out .o.out: - ${CC} ${CFLAGS} ${LDFLAGS} ${.IMPSRC} ${LDLIBS} -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} ${.IMPSRC} ${LDLIBS} \ + -o ${.TARGET} .f.out .F.out .r.out .e.out: ${FC} ${EFLAGS} ${RFLAGS} ${FFLAGS} ${LDFLAGS} ${.IMPSRC} \ @@ -241,12 +244,14 @@ # XXX not -j safe .y.out: ${YACC} ${YFLAGS} ${.IMPSRC} - ${CC} ${CFLAGS} ${LDFLAGS} y.tab.c ${LDLIBS} -ly -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} y.tab.c ${LDLIBS} \ + -ly -o ${.TARGET} rm -f y.tab.c .l.out: ${LEX} -t ${LFLAGS} ${.IMPSRC} > ${.PREFIX}.tmp.c - ${CC} ${CFLAGS} ${LDFLAGS} ${.PREFIX}.tmp.c ${LDLIBS} -ll -o ${.TARGET} + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} ${.PREFIX}.tmp.c \ + ${LDLIBS} -ll -o ${.TARGET} rm -f ${.PREFIX}.tmp.c .endif From owner-freebsd-arch@FreeBSD.ORG Sat Jun 21 19:08:05 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BCECD37B401 for ; Sat, 21 Jun 2003 19:08:05 -0700 (PDT) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8851743FAF for ; Sat, 21 Jun 2003 19:08:04 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id MAA30555; Sun, 22 Jun 2003 12:07:58 +1000 Date: Sun, 22 Jun 2003 12:07:58 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: David Schultz In-Reply-To: <20030622005124.GA59673@HAL9000.homeunix.com> Message-ID: <20030622114150.L54976@gamplex.bde.org> References: <20030622005124.GA59673@HAL9000.homeunix.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: arch@freebsd.org Subject: Re: Per-source CFLAGS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jun 2003 02:08:06 -0000 On Sat, 21 Jun 2003, David Schultz wrote: > The following patch adds support for per-file CFLAGS, which gets > appended to the command line after the global CFLAGS. I would > ... > I intend to use this feature for gdtoa, which is technically part > of libc, but also on a vendor branch and intended to stay that > way. The problem being addressed is that gcc at higher warning > levels has some inane warnings that the vendor and I consider > wrong, and yet people want to be able to compile libc cleanly at > these warning levels. As an example, gcc complains that the > expression 'a << b - c' must have parentheses because obviously > nobody remembers C's precedence rules. So here's just one > potential use of the new feature: For this, you really want per-file WARNS, since among other reasons compiler-dependent flags shouldn't be put in individual Makefiles. > Index: lib/libc/gdtoa/Makefile.inc > =================================================================== > RCS file: /cvs/src/lib/libc/gdtoa/Makefile.inc,v > retrieving revision 1.3 > diff -u -r1.3 Makefile.inc > --- lib/libc/gdtoa/Makefile.inc 5 Apr 2003 22:10:13 -0000 1.3 > +++ lib/libc/gdtoa/Makefile.inc 2 May 2003 09:31:15 -0000 > @@ -16,6 +16,7 @@ > .for src in ${GDTOASRCS} > MISRCS+=gdtoa_${src} > CLEANFILES+=gdtoa_${src} > +CFLAGS_gdtoa_${src}+=-w Do you need to turn off all warnings or just ones for non-broken precedence and a few other non-broken things? gcc doesn't give enough control over individual warnings, but it has -Wno-parentheses for turning off not much more than bogus warnings about natural precedence. > The patch I would actually like reviewed is this one: > ... > Index: share/mk/sys.mk > =================================================================== > RCS file: /cvs/src/share/mk/sys.mk,v > retrieving revision 1.67 > diff -u -r1.67 sys.mk > --- share/mk/sys.mk 1 Jun 2003 22:13:45 -0000 1.67 > +++ share/mk/sys.mk 21 Jun 2003 08:56:15 -0000 > ... > @@ -117,7 +117,8 @@ > > # SINGLE SUFFIX RULES > .c: > - ${CC} ${CFLAGS} ${LDFLAGS} -o ${.TARGET} ${.IMPSRC} > + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} \ > + -o ${.TARGET} ${.IMPSRC} > ... Some rules are specified by POSIX, so they can't be changed. I don't see how ${CFLAGS} can be per-file directly, so the POSIX spec seems to be actively opposed to per-file CFLAGS. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Jun 21 20:53:10 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DDDFA37B401 for ; Sat, 21 Jun 2003 20:53:09 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id 48BEC43F93 for ; Sat, 21 Jun 2003 20:53:08 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5M3r1Ja060590; Sat, 21 Jun 2003 20:53:03 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5M3qwQg060589; Sat, 21 Jun 2003 20:52:58 -0700 (PDT) (envelope-from dschultz@OCF.Berkeley.EDU) Date: Sat, 21 Jun 2003 20:52:58 -0700 From: David Schultz To: Bruce Evans Message-ID: <20030622035258.GB60460@HAL9000.homeunix.com> Mail-Followup-To: Bruce Evans , arch@freebsd.org References: <20030622005124.GA59673@HAL9000.homeunix.com> <20030622114150.L54976@gamplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030622114150.L54976@gamplex.bde.org> cc: arch@freebsd.org Subject: Re: Per-source CFLAGS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jun 2003 03:53:10 -0000 On Sun, Jun 22, 2003, Bruce Evans wrote: > On Sat, 21 Jun 2003, David Schultz wrote: > > > The following patch adds support for per-file CFLAGS, which gets > > appended to the command line after the global CFLAGS. I would > > ... > > I intend to use this feature for gdtoa, which is technically part > > of libc, but also on a vendor branch and intended to stay that > > way. The problem being addressed is that gcc at higher warning > > levels has some inane warnings that the vendor and I consider > > wrong, and yet people want to be able to compile libc cleanly at > > these warning levels. As an example, gcc complains that the > > expression 'a << b - c' must have parentheses because obviously > > nobody remembers C's precedence rules. So here's just one > > potential use of the new feature: > > For this, you really want per-file WARNS, since among other reasons > compiler-dependent flags shouldn't be put in individual Makefiles. > > > Index: lib/libc/gdtoa/Makefile.inc > > =================================================================== > > RCS file: /cvs/src/lib/libc/gdtoa/Makefile.inc,v > > retrieving revision 1.3 > > diff -u -r1.3 Makefile.inc > > --- lib/libc/gdtoa/Makefile.inc 5 Apr 2003 22:10:13 -0000 1.3 > > +++ lib/libc/gdtoa/Makefile.inc 2 May 2003 09:31:15 -0000 > > @@ -16,6 +16,7 @@ > > .for src in ${GDTOASRCS} > > MISRCS+=gdtoa_${src} > > CLEANFILES+=gdtoa_${src} > > +CFLAGS_gdtoa_${src}+=-w > > Do you need to turn off all warnings or just ones for non-broken > precedence and a few other non-broken things? gcc doesn't give > enough control over individual warnings, but it has -Wno-parentheses > for turning off not much more than bogus warnings about natural > precedence. In this case, we really do want to ignore all the warnings. This is vendor code, written in a style that makes it easiest for the author to maintain. It so happens that -w is a de facto (if not de jura) standard; it is supported by the GNU, Intel, and Sun C compilers at least. Per-file CFLAGS can't be used to disable warnings both selectively and portably, but I believe that this mechanism is more generically useful than are per-file WARNS. The latter would be useful, too, but notice that it is a natural extension of per-file CFLAGS. ;-) > > # SINGLE SUFFIX RULES > > .c: > > - ${CC} ${CFLAGS} ${LDFLAGS} -o ${.TARGET} ${.IMPSRC} > > + ${CC} ${CFLAGS} ${CFLAGS_${.IMPSRC}} ${LDFLAGS} \ > > + -o ${.TARGET} ${.IMPSRC} > > ... > > Some rules are specified by POSIX, so they can't be changed. I don't > see how ${CFLAGS} can be per-file directly, so the POSIX spec seems to > be actively opposed to per-file CFLAGS. ??? You mean we can't add a variable that will normally expand to nil? This seems like a compatible change, unless you're worried about someone's makefile breaking because they defined CFLAGS_foo.c to mean something else. From owner-freebsd-arch@FreeBSD.ORG Sat Jun 21 21:55:31 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9205E37B401 for ; Sat, 21 Jun 2003 21:55:31 -0700 (PDT) Received: from ns1.xcllnt.net (209-128-86-226.BAYAREA.NET [209.128.86.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 73E3F43F85 for ; Sat, 21 Jun 2003 21:55:30 -0700 (PDT) (envelope-from marcel@xcllnt.net) Received: from dhcp01.pn.xcllnt.net (dhcp01.pn.xcllnt.net [192.168.4.201]) by ns1.xcllnt.net (8.12.9/8.12.9) with ESMTP id h5M4tUDZ088555 for ; Sat, 21 Jun 2003 21:55:30 -0700 (PDT) (envelope-from marcel@piii.pn.xcllnt.net) Received: from dhcp01.pn.xcllnt.net (localhost [127.0.0.1]) by dhcp01.pn.xcllnt.net (8.12.9/8.12.9) with ESMTP id h5M4tTSx080511 for ; Sat, 21 Jun 2003 21:55:30 -0700 (PDT) (envelope-from marcel@dhcp01.pn.xcllnt.net) Received: (from marcel@localhost) by dhcp01.pn.xcllnt.net (8.12.9/8.12.9/Submit) id h5M4tTCY080510 for arch@FreeBSD.ORG; Sat, 21 Jun 2003 21:55:29 -0700 (PDT) (envelope-from marcel) Date: Sat, 21 Jun 2003 21:55:29 -0700 From: Marcel Moolenaar To: arch@FreeBSD.ORG Message-ID: <20030622045529.GA80446@dhcp01.pn.xcllnt.net> References: <20030622005124.GA59673@HAL9000.homeunix.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030622005124.GA59673@HAL9000.homeunix.com> User-Agent: Mutt/1.5.4i Subject: Re: Per-source CFLAGS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jun 2003 04:55:31 -0000 On Sat, Jun 21, 2003 at 05:51:24PM -0700, David Schultz wrote: > The following patch adds support for per-file CFLAGS, which gets > appended to the command line after the global CFLAGS. Per file compilation options are in direct conflict with make invocator control, by way of it being a makefile writer knob. Put differently: it's a feature for developers, not builders. We already see the problem with that when we define CFLAGS on the make command line, rather than in the environment. I'm not opposed to per-file options, but it seems to push the need to split make invocator knobs from makefile writer knobs. Until we have such seperation, I request that per-file options be made conditional so that make invocators still have control without being powerless. -- Marcel Moolenaar USPA: A-39004 marcel@xcllnt.net From owner-freebsd-arch@FreeBSD.ORG Sat Jun 21 23:45:29 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0F64937B401 for ; Sat, 21 Jun 2003 23:45:29 -0700 (PDT) Received: from HAL9000.homeunix.com (ip114.bella-vista.sfo.interquest.net [66.199.86.114]) by mx1.FreeBSD.org (Postfix) with ESMTP id A0F2243FB1 for ; Sat, 21 Jun 2003 23:45:27 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.9/8.12.9) with ESMTP id h5M6jMJa061263; Sat, 21 Jun 2003 23:45:23 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.9/8.12.9/Submit) id h5M6jMM3061262; Sat, 21 Jun 2003 23:45:22 -0700 (PDT) (envelope-from das@FreeBSD.ORG) Date: Sat, 21 Jun 2003 23:45:22 -0700 From: David Schultz To: Marcel Moolenaar Message-ID: <20030622064521.GA61030@HAL9000.homeunix.com> Mail-Followup-To: Marcel Moolenaar , arch@freebsd.org References: <20030622005124.GA59673@HAL9000.homeunix.com> <20030622045529.GA80446@dhcp01.pn.xcllnt.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030622045529.GA80446@dhcp01.pn.xcllnt.net> cc: arch@FreeBSD.ORG Subject: Re: Per-source CFLAGS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jun 2003 06:45:29 -0000 On Sat, Jun 21, 2003, Marcel Moolenaar wrote: > On Sat, Jun 21, 2003 at 05:51:24PM -0700, David Schultz wrote: > > The following patch adds support for per-file CFLAGS, which gets > > appended to the command line after the global CFLAGS. > > Per file compilation options are in direct conflict with make > invocator control, by way of it being a makefile writer knob. > Put differently: it's a feature for developers, not builders. > We already see the problem with that when we define CFLAGS on > the make command line, rather than in the environment. I'm > not opposed to per-file options, but it seems to push the > need to split make invocator knobs from makefile writer knobs. > Until we have such seperation, I request that per-file options > be made conditional so that make invocators still have control > without being powerless. I expect that this feature would not be used except in very special cases, and I would be opposed to gratuitous use of it. In fact, most of these cases are so special that the relevant file probably won't even work without the extra option. For example, Peter mentioned a while ago that vfprintf.c was causing an ICE unless -O was turned off. Since these things are only used selectively, it only makes sense to disable them selectively. For instance, if we set it on two files to temporarily work around a gcc bug, and on another file because it's vendor code that we don't want to see warnings for, a big knob that says ``Turn off all the special cases'' wouldn't make much sense. However, if what you're looking for is the ability to say GDTOA_WARNS=YES in your make.conf, that can certainly be done on a case by case basis. Would this satisfy your concerns? I realize it isn't perfect, but I'm not prepared to rewrite the entire build infrastructure over an issue (gdtoa warnings) that I don't really want to deal with. I already tried getting the vendor to conform to GNU's preferred style, and I already tried convincing people that gdtoa doesn't really have to be vendor code, but to no avail.