From owner-freebsd-arch@FreeBSD.ORG Mon Feb 5 00:00:33 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C605116A400; Mon, 5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu) Received: from mail.cs.rice.edu (mail.cs.rice.edu [128.42.1.31]) by mx1.freebsd.org (Postfix) with ESMTP id 7B43113C491; Mon, 5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu) Received: from mail.cs.rice.edu (localhost.localdomain [127.0.0.1]) by mail.cs.rice.edu (Postfix) with ESMTP id 833BA2C2AC9; Sun, 4 Feb 2007 17:41:10 -0600 (CST) X-Virus-Scanned: by amavis-2.4.0 at mail.cs.rice.edu Received: from mail.cs.rice.edu ([127.0.0.1]) by mail.cs.rice.edu (mail.cs.rice.edu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id GXjMwLHzpR6a; Sun, 4 Feb 2007 17:41:09 -0600 (CST) Received: from [216.63.78.18] (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.cs.rice.edu (Postfix) with ESMTP id A0B822C2ABD; Sun, 4 Feb 2007 17:41:09 -0600 (CST) Message-ID: <45C66F14.2020907@cs.rice.edu> Date: Sun, 04 Feb 2007 17:41:08 -0600 From: Alan Cox User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.13) Gecko/20061115 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Suleiman Souhlal References: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org> In-Reply-To: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, arch@freebsd.org Subject: Re: Reducing vm page queue mutex contention X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Feb 2007 00:00:33 -0000 Suleiman Souhlal wrote: >Hello Alan, > >Profiling shows that the vm page queue mutex is the most contended lock in >the kernel, maybe apart from sched_lock. It seems that this is in part >because this lock protects a lot of things: page queues, pv entries, page >flags, page hold count, page wired count.. > > > To start, I would concentrate on entirely eliminating the use of the page queues lock from pmap_enter() and pmap_removes_pages(). I've inlined specific comments below. >I came up with a possible plan to reduce contention on this lock, >concentrating on the amd64 pmap (although these should be applicable to >the other architectures as well): > >- Make vm_page_flag_set/clear() just use atomic operations to get rid of > the page queues lock dependency. > I'm still not entirely convinced this is entirely safe. > > > I would not do this. Instead, I would create a separate machine-dependent flags field that is synchronized by the same lock as the pv lists. This field would then be used for flags such as PG_REFERENCED and PG_WRITEABLE. (In fact, I believe that there is wasted space due to alignment in amd64's page structure that could be used for this.) [snip] >- Add a mutex pool for vm pages to protect the pv entries lists. > I'm currently working on this. > My current approach makes struct pv_entry larger because it needs to store > a pointer to the pte in each pv_entry. > > > While a mutex pool may ultimately be needed, I would start with a simpler approach and then reevaluate what should be the next step. Specifically, I would start with a single lock for the pv lists and machine-dependent flags. Then, you won't need the pointer. Moreover, the use of a mutex pool is going to add overhead to the ranged operations like pmap_remove_pages() and pmap_remove(). > Another way that might be better is to move to per-object pv entries, > which is what Linux does. It would greatly reduce memory usage when > mapping large objects in a lot of processes, although it might be slower > for sparsely faulted objects mapped in a large number of processes. > This approach would be a lot of work, which is why I'm leaning towards > keeping per-page pv entries. > > > I have addressed this problem in a different way. With the superpages support in perforce, I create a single pv entry per superpage mapping. >- It should be possible to make vm_page->wired_count use atomic operations > instead of needing a lock, similarly to what I did for the hold_count. > This might be a bit tricky, but hopefully possible. > Alternatively, we could use the mutex pool described above to protect it. > > > Page table page's (ab)use their wire_count field as a reference count. The pmap lock is already held whenever this count is changed, or more generally when the page table is changed. So, at least for page table pages nothing needs to change. >- We can change pmap_unuse_pt and free_pv_entry to just mark the pages they > want to free in an array allocated by the caller. > The caller will then free those pages after it drops the pmap lock. > > For example: > > struct pages_to_free { > vm_page_t *page[MAX_PAGES]; > int num_pages; > }; > > void pmap_remove(...) > { > struct pages_to_free pages; > > PMAP_LOCK(pmap); > ... > pmap_unuse_pt(..., &pages); > ... > PMAP_UNLOCK(pmap); > vm_page_lock_queues(); > for (i = 0; i < pages.num_pages; i++) > vm_page_free(pages.page[i]); > vm_page_unlock_queues(); > } > > This way, pmap_remove can be mostly without queues lock. > > > I can make vm_page_free() callable without the page queues lock for page table and pv entry pages, i.e., pages that don't belong to a vm object. Then, you don't need to do this. However, there is another issue that you don't touch on here. The page queues lock is being used to synchronize changes to the page's dirty field and the PTE's PG_M bit against testing for dirty pages in the machine-independent (MI) code. There are ordering issues in both the pmap and MI code, e.g., pmap_enter() clears PG_M on an old mapping before setting the page's dirty field and vm_page_dontneed() reads the page's dirty field and then conditionally call pmap_is_modified(). >- Once the above are done, it should be possible to make pmap_enter() run > mostly queue lock free by: > - Pre-allocating a pv chunk early in pmap_enter, if there are no free > ones, so that we never have to allocate new chunks in pmap_insert_entry. > - Dropping the page queues lock immediately after the pmap_allocpte in > pmap_enter. > > Actually, you shouldn't need to acquire the page queues lock at all or move the allocation of a pv_chunk. Regards, Alan From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 13:13:04 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D79B716A400 for ; Tue, 6 Feb 2007 13:13:02 +0000 (UTC) (envelope-from antinvidia@gmail.com) Received: from nz-out-0506.google.com (nz-out-0506.google.com [64.233.162.237]) by mx1.freebsd.org (Postfix) with ESMTP id 9CA9B13C428 for ; Tue, 6 Feb 2007 13:13:02 +0000 (UTC) (envelope-from antinvidia@gmail.com) Received: by nz-out-0506.google.com with SMTP id i11so1933822nzh for ; Tue, 06 Feb 2007 05:13:02 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=faAfCsasl4ohMZpwLlIDKBH5gXRBDdfMDqCpLwQM7mqXPQM2KbKy6sZ2JdAL55I+ZnmM7gZK0YtVbybmzdXqyiIxuVSvzppMI6bleCRbfD2eCirxV1L6KBNK40+cTH4Smc/pI3byhVWbRO16+Disj52Nmdn15aJ/YxPJk1Fg/g4= Received: by 10.35.111.17 with SMTP id o17mr16215835pym.1170765929243; Tue, 06 Feb 2007 04:45:29 -0800 (PST) Received: by 10.35.58.13 with HTTP; Tue, 6 Feb 2007 04:45:29 -0800 (PST) Message-ID: Date: Tue, 6 Feb 2007 12:45:29 +0000 From: MQ To: freebsd-arch@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: a question about prison_ip X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 13:13:04 -0000 Hi everyone, I was reading jail(8) related codes in the file kern_jail.c recently. I found that prison_ip provided three parameters, the second of which is a flag that indicate whether the third argument ip is host byte order or network byte order. I skimmed the codes that invoke prison_ip function, all of them calls prison_ip with flag=0 to indicate that the ip argument is in network byte order. I'm considering if we can remove the flag parameter, and modify corresponding source codes, say, userland program jail.c and jls.c to keep consistent with the kernel. This can avoid some useless action taken by current kernel. But I don't know if current implementation of the prison_ip has some special reasons to act like this? Does anyone have opinions with this problem? Best regards. MQ From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 20:37:33 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D88C116A403; Tue, 6 Feb 2007 20:37:33 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 644AE13C46B; Tue, 6 Feb 2007 20:37:33 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id l16KbUGW072639; Tue, 6 Feb 2007 15:37:31 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: arch@freebsd.org Date: Tue, 6 Feb 2007 15:37:53 -0500 User-Agent: KMail/1.9.1 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200702061537.54056.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 06 Feb 2007 15:37:31 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2529/Tue Feb 6 14:25:02 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Subject: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 20:37:34 -0000 The patch below fixes kvm_getswapinfo(3) (and thus pstat -T/pstat -s/swapinfo) on crash dumps. The one ugliness in it is that since 'struct swdevt' was moved into swap_pager.c, there is no longer a header that I can include to get it (used to be in vm/swap_pager.h) so I have a copy of the structure in kvm_getswapinfo.c. I'd prefer to move the structures back into vm/swap_pager.h. Index: kvm_getswapinfo.c =================================================================== RCS file: /usr/cvs/src/lib/libkvm/kvm_getswapinfo.c,v retrieving revision 1.26 diff -u -r1.26 kvm_getswapinfo.c --- kvm_getswapinfo.c 31 Jul 2004 18:49:53 -0000 1.26 +++ kvm_getswapinfo.c 6 Feb 2007 20:31:09 -0000 @@ -49,18 +49,59 @@ #include "kvm_private.h" -#define NL_SWAPBLIST 0 -#define NL_SWDEVT 1 -#define NL_NSWDEV 2 -#define NL_DMMAX 3 +/* Grrr, this is hidden in swap_pager.c now, how annoying. */ +typedef int32_t swblk_t; + +struct buf; +struct blist; + +struct swdevt; +typedef void sw_strategy_t(struct buf *bp, struct swdevt *sw); +typedef void sw_close_t(struct thread *td, struct swdevt *sw); + +struct swdevt { + int sw_flags; + int sw_nblks; + int sw_used; + dev_t sw_dev; + struct vnode *sw_vp; + void *sw_id; + swblk_t sw_first; + swblk_t sw_end; + struct blist *sw_blist; + TAILQ_ENTRY(swdevt) sw_list; + sw_strategy_t *sw_strategy; + sw_close_t *sw_close; +}; + +static struct nlist kvm_swap_nl[] = { + { "_swtailq" }, /* list of swap devices and sizes */ + { "_dmmax" }, /* maximum size of a swap block */ + { NULL } +}; + +#define NL_SWTAILQ 0 +#define NL_DMMAX 1 static int kvm_swap_nl_cached = 0; static int unswdev; /* number of found swap dev's */ static int dmmax; +static int kvm_getswapinfo_kvm(kvm_t *, struct kvm_swap *, int, int); static int kvm_getswapinfo_sysctl(kvm_t *, struct kvm_swap *, int, int); +static int nlist_init(kvm_t *); static int getsysctl(kvm_t *, char *, void *, size_t); +#define KREAD(kd, addr, obj) \ + (kvm_read(kd, addr, (char *)(obj), sizeof(*obj)) != sizeof(*obj)) +#define KGET(idx, var) \ + KGET2(kvm_swap_nl[(idx)].n_value, var, kvm_swap_nl[(idx)].n_name) +#define KGET2(addr, var, msg) \ + if (KREAD(kd, (u_long)(addr), (var))) { \ + _kvm_err(kd, kd->program, "cannot read %s", msg); \ + return (-1); \ + } + #define GETSWDEVNAME(dev, str, flags) \ if (dev == NODEV) { \ strlcpy(str, "[NFS swap]", sizeof(str)); \ @@ -91,8 +132,50 @@ if (ISALIVE(kd)) { return kvm_getswapinfo_sysctl(kd, swap_ary, swap_max, flags); } else { - return -1; + return kvm_getswapinfo_kvm(kd, swap_ary, swap_max, flags); + } +} + +int +kvm_getswapinfo_kvm( + kvm_t *kd, + struct kvm_swap *swap_ary, + int swap_max, + int flags +) { + int i, ttl; + TAILQ_HEAD(, swdevt) swtailq; + struct swdevt *sp, swinfo; + struct kvm_swap tot; + + if (!nlist_init(kd)) + return (-1); + + bzero(&tot, sizeof(tot)); + KGET(NL_SWTAILQ, &swtailq); + sp = TAILQ_FIRST(&swtailq); + for (i = 0; sp != NULL; i++) { + KGET2(sp, &swinfo, "swinfo"); + ttl = swinfo.sw_nblks - dmmax; + if (i < swap_max - 1) { + bzero(&swap_ary[i], sizeof(swap_ary[i])); + swap_ary[i].ksw_total = ttl; + swap_ary[i].ksw_used = swinfo.sw_used; + swap_ary[i].ksw_flags = swinfo.sw_flags; + GETSWDEVNAME(swinfo.sw_dev, swap_ary[i].ksw_devname, + flags); + } + tot.ksw_total += ttl; + tot.ksw_used += swinfo.sw_used; + sp = TAILQ_NEXT(&swinfo, sw_list); } + + if (i >= swap_max) + i = swap_max - 1; + if (i >= 0) + swap_ary[i] = tot; + + return(i); } #define GETSYSCTL(kd, name, var) \ @@ -168,6 +251,36 @@ } static int +nlist_init(kvm_t *kd) +{ + TAILQ_HEAD(, swdevt) swtailq; + struct swdevt *sp, swinfo; + + if (kvm_swap_nl_cached) + return (1); + + if (kvm_nlist(kd, kvm_swap_nl) < 0) + return (0); + + /* Required entries */ + if (kvm_swap_nl[NL_SWTAILQ].n_value == 0) { + _kvm_err(kd, kd->program, "unable to find swtailq"); + return (0); + } + + if (kvm_swap_nl[NL_DMMAX].n_value == 0) { + _kvm_err(kd, kd->program, "unable to find dmmax"); + return (0); + } + + /* Get globals, type of swap */ + KGET(NL_DMMAX, &dmmax); + + kvm_swap_nl_cached = 1; + return (1); +} + +static int getsysctl ( kvm_t *kd, char *name, -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 20:57:02 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5D80E16A400; Tue, 6 Feb 2007 20:57:02 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id B90CD13C48E; Tue, 6 Feb 2007 20:57:01 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id l16KuwUI072784; Tue, 6 Feb 2007 15:56:58 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Attilio Rao" Date: Tue, 6 Feb 2007 15:57:24 -0500 User-Agent: KMail/1.9.1 References: <200702061537.54056.jhb@freebsd.org> <3bbf2fe10702061246ofdc00a6kc9ba5366aea4e2e2@mail.gmail.com> In-Reply-To: <3bbf2fe10702061246ofdc00a6kc9ba5366aea4e2e2@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200702061557.24646.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 06 Feb 2007 15:56:58 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2529/Tue Feb 6 14:25:02 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: arch@freebsd.org Subject: Re: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 20:57:02 -0000 On Tuesday 06 February 2007 15:46, Attilio Rao wrote: > 2007/2/6, John Baldwin : > > The patch below fixes kvm_getswapinfo(3) (and thus pstat -T/pstat -s/swapinfo) > > on crash dumps. The one ugliness in it is that since 'struct swdevt' was > > moved into swap_pager.c, there is no longer a header that I can include to > > get it (used to be in vm/swap_pager.h) so I have a copy of the structure in > > kvm_getswapinfo.c. I'd prefer to move the structures back into > > vm/swap_pager.h. > > > > Index: kvm_getswapinfo.c > > =================================================================== > > RCS file: /usr/cvs/src/lib/libkvm/kvm_getswapinfo.c,v > > retrieving revision 1.26 > > diff -u -r1.26 kvm_getswapinfo.c > > --- kvm_getswapinfo.c 31 Jul 2004 18:49:53 -0000 1.26 > > +++ kvm_getswapinfo.c 6 Feb 2007 20:31:09 -0000 > > @@ -49,18 +49,59 @@ > > > > #include "kvm_private.h" > > > > -#define NL_SWAPBLIST 0 > > -#define NL_SWDEVT 1 > > -#define NL_NSWDEV 2 > > -#define NL_DMMAX 3 > > +/* Grrr, this is hidden in swap_pager.c now, how annoying. */ > > +typedef int32_t swblk_t; > > It would not be better exporting swblk_t definition into swap_pager.h, > while you are here? Yes, as I mentioned, I would prefer that, but they were explicitly moved from swap_pager.h to swap_pager.c previously, so I'd like to make sure it is ok to move them back. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 21:02:03 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 25C7316A400; Tue, 6 Feb 2007 21:02:03 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id C211B13C461; Tue, 6 Feb 2007 21:02:02 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 6F2C91747C; Tue, 6 Feb 2007 21:02:01 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.13.8/8.13.8) with ESMTP id l16L20Fl094788; Tue, 6 Feb 2007 21:02:01 GMT (envelope-from phk@critter.freebsd.dk) To: John Baldwin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 06 Feb 2007 15:57:24 EST." <200702061557.24646.jhb@freebsd.org> Date: Tue, 06 Feb 2007 21:02:00 +0000 Message-ID: <94787.1170795720@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Attilio Rao , arch@freebsd.org Subject: Re: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 21:02:03 -0000 In message <200702061557.24646.jhb@freebsd.org>, John Baldwin writes: >Yes, as I mentioned, I would prefer that, but they were explicitly moved from >swap_pager.h to swap_pager.c previously, so I'd like to make sure it is ok to >move them back. No objection from me. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 21:13:16 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8989016A402 for ; Tue, 6 Feb 2007 21:13:16 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.188]) by mx1.freebsd.org (Postfix) with ESMTP id 26E5B13C494 for ; Tue, 6 Feb 2007 21:13:15 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by nf-out-0910.google.com with SMTP id m19so278780nfc for ; Tue, 06 Feb 2007 13:13:15 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=crQrQREhBSOq22Oe0TcHP9Qc0g41ybbdR7sapJViDQ+4HtYKaUxc6ioSZ/C1AkyqmUWtxaSPlN10fpfJqADJTe9kHtrqCZg+NI8YRU7JjkDo6GcjSn+fCNre4CHhVkb8M2nR1yTK0MMCrzAmBxJ7JSAtnYwd3o7zljJHx9TnI/Q= Received: by 10.82.136.4 with SMTP id j4mr5192053bud.1170794774248; Tue, 06 Feb 2007 12:46:14 -0800 (PST) Received: by 10.48.238.9 with HTTP; Tue, 6 Feb 2007 12:46:14 -0800 (PST) Message-ID: <3bbf2fe10702061246ofdc00a6kc9ba5366aea4e2e2@mail.gmail.com> Date: Tue, 6 Feb 2007 21:46:14 +0100 From: "Attilio Rao" Sender: asmrookie@gmail.com To: "John Baldwin" In-Reply-To: <200702061537.54056.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200702061537.54056.jhb@freebsd.org> X-Google-Sender-Auth: 3b650e85f0c74580 Cc: arch@freebsd.org Subject: Re: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 21:13:16 -0000 2007/2/6, John Baldwin : > The patch below fixes kvm_getswapinfo(3) (and thus pstat -T/pstat -s/swapinfo) > on crash dumps. The one ugliness in it is that since 'struct swdevt' was > moved into swap_pager.c, there is no longer a header that I can include to > get it (used to be in vm/swap_pager.h) so I have a copy of the structure in > kvm_getswapinfo.c. I'd prefer to move the structures back into > vm/swap_pager.h. > > Index: kvm_getswapinfo.c > =================================================================== > RCS file: /usr/cvs/src/lib/libkvm/kvm_getswapinfo.c,v > retrieving revision 1.26 > diff -u -r1.26 kvm_getswapinfo.c > --- kvm_getswapinfo.c 31 Jul 2004 18:49:53 -0000 1.26 > +++ kvm_getswapinfo.c 6 Feb 2007 20:31:09 -0000 > @@ -49,18 +49,59 @@ > > #include "kvm_private.h" > > -#define NL_SWAPBLIST 0 > -#define NL_SWDEVT 1 > -#define NL_NSWDEV 2 > -#define NL_DMMAX 3 > +/* Grrr, this is hidden in swap_pager.c now, how annoying. */ > +typedef int32_t swblk_t; It would not be better exporting swblk_t definition into swap_pager.h, while you are here? Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 21:31:34 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2251416A401 for ; Tue, 6 Feb 2007 21:31:34 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outY.internet-mail-service.net (outY.internet-mail-service.net [216.240.47.248]) by mx1.freebsd.org (Postfix) with ESMTP id 0D0E813C4B5 for ; Tue, 6 Feb 2007 21:31:33 +0000 (UTC) (envelope-from julian@elischer.org) Received: from shell.idiom.com (HELO idiom.com) (216.240.47.20) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Tue, 06 Feb 2007 13:09:12 -0800 Received: from [192.168.2.5] (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 91EAF125B31; Tue, 6 Feb 2007 13:31:32 -0800 (PST) Message-ID: <45C8F3B4.8060400@elischer.org> Date: Tue, 06 Feb 2007 13:31:32 -0800 From: Julian Elischer User-Agent: Thunderbird 1.5.0.9 (Macintosh/20061207) MIME-Version: 1.0 To: John Baldwin References: <200702061537.54056.jhb@freebsd.org> In-Reply-To: <200702061537.54056.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 21:31:34 -0000 John Baldwin wrote: > The patch below fixes kvm_getswapinfo(3) (and thus pstat -T/pstat -s/swapinfo) > on crash dumps. The one ugliness in it is that since 'struct swdevt' was > moved into swap_pager.c, there is no longer a header that I can include to > get it (used to be in vm/swap_pager.h) so I have a copy of the structure in > kvm_getswapinfo.c. I'd prefer to move the structures back into > vm/swap_pager.h. I agree From owner-freebsd-arch@FreeBSD.ORG Tue Feb 6 22:36:33 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8D4B616A414; Tue, 6 Feb 2007 22:36:33 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 1A73B13C467; Tue, 6 Feb 2007 22:36:32 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id l16MaU7U073514; Tue, 6 Feb 2007 17:36:31 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Poul-Henning Kamp" Date: Tue, 6 Feb 2007 17:36:54 -0500 User-Agent: KMail/1.9.1 References: <94787.1170795720@critter.freebsd.dk> In-Reply-To: <94787.1170795720@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200702061736.54926.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 06 Feb 2007 17:36:31 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2529/Tue Feb 6 14:25:02 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , arch@freebsd.org Subject: Re: Restoring crash dump ability for kvm_getswapinfo(3) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 22:36:33 -0000 On Tuesday 06 February 2007 16:02, Poul-Henning Kamp wrote: > In message <200702061557.24646.jhb@freebsd.org>, John Baldwin writes: > > >Yes, as I mentioned, I would prefer that, but they were explicitly moved from > >swap_pager.h to swap_pager.c previously, so I'd like to make sure it is ok to > >move them back. > > No objection from me. Ok, cool. This is the updated patch: Index: lib/libkvm/kvm_getswapinfo.c =================================================================== RCS file: /usr/cvs/src/lib/libkvm/kvm_getswapinfo.c,v retrieving revision 1.26 diff -u -r1.26 kvm_getswapinfo.c --- lib/libkvm/kvm_getswapinfo.c 31 Jul 2004 18:49:53 -0000 1.26 +++ lib/libkvm/kvm_getswapinfo.c 6 Feb 2007 21:27:52 -0000 @@ -33,6 +33,7 @@ #include #include +#include #include #include @@ -49,18 +50,34 @@ #include "kvm_private.h" -#define NL_SWAPBLIST 0 -#define NL_SWDEVT 1 -#define NL_NSWDEV 2 -#define NL_DMMAX 3 +static struct nlist kvm_swap_nl[] = { + { "_swtailq" }, /* list of swap devices and sizes */ + { "_dmmax" }, /* maximum size of a swap block */ + { NULL } +}; + +#define NL_SWTAILQ 0 +#define NL_DMMAX 1 static int kvm_swap_nl_cached = 0; static int unswdev; /* number of found swap dev's */ static int dmmax; +static int kvm_getswapinfo_kvm(kvm_t *, struct kvm_swap *, int, int); static int kvm_getswapinfo_sysctl(kvm_t *, struct kvm_swap *, int, int); +static int nlist_init(kvm_t *); static int getsysctl(kvm_t *, char *, void *, size_t); +#define KREAD(kd, addr, obj) \ + (kvm_read(kd, addr, (char *)(obj), sizeof(*obj)) != sizeof(*obj)) +#define KGET(idx, var) \ + KGET2(kvm_swap_nl[(idx)].n_value, var, kvm_swap_nl[(idx)].n_name) +#define KGET2(addr, var, msg) \ + if (KREAD(kd, (u_long)(addr), (var))) { \ + _kvm_err(kd, kd->program, "cannot read %s", msg); \ + return (-1); \ + } + #define GETSWDEVNAME(dev, str, flags) \ if (dev == NODEV) { \ strlcpy(str, "[NFS swap]", sizeof(str)); \ @@ -91,8 +108,50 @@ if (ISALIVE(kd)) { return kvm_getswapinfo_sysctl(kd, swap_ary, swap_max, flags); } else { - return -1; + return kvm_getswapinfo_kvm(kd, swap_ary, swap_max, flags); + } +} + +int +kvm_getswapinfo_kvm( + kvm_t *kd, + struct kvm_swap *swap_ary, + int swap_max, + int flags +) { + int i, ttl; + TAILQ_HEAD(, swdevt) swtailq; + struct swdevt *sp, swinfo; + struct kvm_swap tot; + + if (!nlist_init(kd)) + return (-1); + + bzero(&tot, sizeof(tot)); + KGET(NL_SWTAILQ, &swtailq); + sp = TAILQ_FIRST(&swtailq); + for (i = 0; sp != NULL; i++) { + KGET2(sp, &swinfo, "swinfo"); + ttl = swinfo.sw_nblks - dmmax; + if (i < swap_max - 1) { + bzero(&swap_ary[i], sizeof(swap_ary[i])); + swap_ary[i].ksw_total = ttl; + swap_ary[i].ksw_used = swinfo.sw_used; + swap_ary[i].ksw_flags = swinfo.sw_flags; + GETSWDEVNAME(swinfo.sw_dev, swap_ary[i].ksw_devname, + flags); + } + tot.ksw_total += ttl; + tot.ksw_used += swinfo.sw_used; + sp = TAILQ_NEXT(&swinfo, sw_list); } + + if (i >= swap_max) + i = swap_max - 1; + if (i >= 0) + swap_ary[i] = tot; + + return(i); } #define GETSYSCTL(kd, name, var) \ @@ -168,6 +227,36 @@ } static int +nlist_init(kvm_t *kd) +{ + TAILQ_HEAD(, swdevt) swtailq; + struct swdevt *sp, swinfo; + + if (kvm_swap_nl_cached) + return (1); + + if (kvm_nlist(kd, kvm_swap_nl) < 0) + return (0); + + /* Required entries */ + if (kvm_swap_nl[NL_SWTAILQ].n_value == 0) { + _kvm_err(kd, kd->program, "unable to find swtailq"); + return (0); + } + + if (kvm_swap_nl[NL_DMMAX].n_value == 0) { + _kvm_err(kd, kd->program, "unable to find dmmax"); + return (0); + } + + /* Get globals, type of swap */ + KGET(NL_DMMAX, &dmmax); + + kvm_swap_nl_cached = 1; + return (1); +} + +static int getsysctl ( kvm_t *kd, char *name, Index: sys/vm/swap_pager.c =================================================================== RCS file: /usr/cvs/src/sys/vm/swap_pager.c,v retrieving revision 1.286 diff -u -r1.286 swap_pager.c --- sys/vm/swap_pager.c 5 Jan 2007 19:09:01 -0000 1.286 +++ sys/vm/swap_pager.c 6 Feb 2007 21:30:28 -0000 @@ -138,37 +138,6 @@ #define SWAP_META_PAGES (SWB_NPAGES * 2) #define SWAP_META_MASK (SWAP_META_PAGES - 1) -typedef int32_t swblk_t; /* - * swap offset. This is the type used to - * address the "virtual swap device" and - * therefore the maximum swap space is - * 2^32 pages. - */ - -struct swdevt; -typedef void sw_strategy_t(struct buf *bp, struct swdevt *sw); -typedef void sw_close_t(struct thread *td, struct swdevt *sw); - -/* - * Swap device table - */ -struct swdevt { - int sw_flags; - int sw_nblks; - int sw_used; - dev_t sw_dev; - struct vnode *sw_vp; - void *sw_id; - swblk_t sw_first; - swblk_t sw_end; - struct blist *sw_blist; - TAILQ_ENTRY(swdevt) sw_list; - sw_strategy_t *sw_strategy; - sw_close_t *sw_close; -}; - -#define SW_CLOSING 0x04 - struct swblock { struct swblock *swb_hnext; vm_object_t swb_object; Index: sys/vm/swap_pager.h =================================================================== RCS file: /usr/cvs/src/sys/vm/swap_pager.h,v retrieving revision 1.51 diff -u -r1.51 swap_pager.h --- sys/vm/swap_pager.h 10 Apr 2006 10:03:41 -0000 1.51 +++ sys/vm/swap_pager.h 6 Feb 2007 21:27:33 -0000 @@ -38,6 +38,38 @@ #ifndef _VM_SWAP_PAGER_H_ #define _VM_SWAP_PAGER_H_ 1 +typedef int32_t swblk_t; /* + * swap offset. This is the type used to + * address the "virtual swap device" and + * therefore the maximum swap space is + * 2^32 pages. + */ + +struct buf; +struct swdevt; +typedef void sw_strategy_t(struct buf *, struct swdevt *); +typedef void sw_close_t(struct thread *, struct swdevt *); + +/* + * Swap device table + */ +struct swdevt { + int sw_flags; + int sw_nblks; + int sw_used; + dev_t sw_dev; + struct vnode *sw_vp; + void *sw_id; + swblk_t sw_first; + swblk_t sw_end; + struct blist *sw_blist; + TAILQ_ENTRY(swdevt) sw_list; + sw_strategy_t *sw_strategy; + sw_close_t *sw_close; +}; + +#define SW_CLOSING 0x04 + #ifdef _KERNEL extern int swap_pager_full; -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Feb 9 00:00:47 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E042816A402 for ; Fri, 9 Feb 2007 00:00:47 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mrout3.yahoo.com (mrout3.yahoo.com [216.145.54.173]) by mx1.freebsd.org (Postfix) with ESMTP id C423813C428 for ; Fri, 9 Feb 2007 00:00:47 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from minion.neville-neil.com (proxy8.corp.yahoo.com [216.145.48.13]) by mrout3.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id l18NoH6T064239; Thu, 8 Feb 2007 15:50:18 -0800 (PST) Date: Thu, 08 Feb 2007 15:50:10 -0800 Message-ID: From: "George V. Neville-Neil" To: Julian Elischer In-Reply-To: <45BFC246.2000005@elischer.org> References: <45BFC246.2000005@elischer.org> User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.0.92 (i386-apple-darwin8.8.2) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: arch@freebsd.org, Marko Zec , Robert Watson Subject: Re: vimage and modules X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Feb 2007 00:00:48 -0000 At Tue, 30 Jan 2007 14:10:14 -0800, julian wrote: > > I've looked at the current vimage code and I'm impressed. > > The question remains to me that we need to have vimage and modules > interact well. > > One question is, "should existing vimages suddenly get > new capabilities when new kernel modules are loaded?" > One alternative is to only allow them to have access to modules that > were loaded before the creation of the vimage. > > An analogy in the TLS (thread-local storage) world is that > when a new shared library is loaded, TLS variables are > immediately available to it. However this may not be needed > in a virtual machine. > > I heard someone mention the following idea in passing and the > more I think about it, the more I like it.. > > Virtual machines are 'booted without loaded modules.' > They have however, available to them all the modules loaded > into the base system at the time that they are looking, and > can 'load them' just as the base system can load kernel > modules. > > The difference is that they are not able to load new modules, > but rather, only to do the 'vm_linkage' stage required on already > loaded modules. > > The vm linkage would be a separate subset of what needs to > be done when a module is loaded. It would be a separate > entry-point that would only be present on modules that > supported vimages. > > An example would be some module 'x'. > > It would have a function that would set up any > per-vimage linkages needed (mallocing and linking > its own vimage-specific variables structure into the > list of modules for that vimage. > > Currently we have a load and unload method. This would > suggest adding a vm_load method as well. creating a new > vimage could run through all the existing modules, and > call that functions, or we could do it as part of the booting of the > vimage from (say) /etc/rc or similar. > > > If a module had no vimage-specific behaviour it would > not have the extra entry-point. > > It does mean that we need to look at the inter-module dependencies > as if you wanted to have one module call into another, you'd need to > have dependencies well documented. In the current code you get a > linkage failure, if there are global variables accessed between them > but with a 'per vmimage' structure of variables this becomes a > runtime problem, unless we explicitly have the dependencies > registered. I suspect that what we need a diagram of how we think modules and vimages relate. Each vimage may wind up being what was called in another system, a "protection domain" for lack of a better word. If a full vimage is really a copy of the whole kernel, modules and all, then we probably need to think about a fork/exec model for kernels. That is, creating a new vimage is the equivalent of a fork() where you get copies of everything and some stuff is reset. What exactly is kept or reset is very important to work out. I would suspect that we would clear a lot of stuff, such as fd's, sockets etc. but keep a lot of other state, like the RIB and FIB. Best, George From owner-freebsd-arch@FreeBSD.ORG Fri Feb 9 06:04:04 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 664B316A400 for ; Fri, 9 Feb 2007 06:04:04 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [210.51.165.229]) by mx1.freebsd.org (Postfix) with ESMTP id AC1B913C47E for ; Fri, 9 Feb 2007 06:04:03 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from localhost (tarsier.geekcn.org [210.51.165.229]) by tarsier.geekcn.org (Postfix) with ESMTP id C3EABEB205A for ; Fri, 9 Feb 2007 14:03:59 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([210.51.165.229]) by localhost (mail.geekcn.org [210.51.165.229]) (amavisd-new, port 10024) with ESMTP id xJBxxOtPVYDo for ; Fri, 9 Feb 2007 14:03:49 +0800 (CST) Received: from [10.217.12.154] (sina152-194.staff.sina.com.cn [61.135.152.194]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTP id 76571EB08C6 for ; Fri, 9 Feb 2007 14:03:48 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:organization:user-agent:mime-version:to: subject:x-enigmail-version:content-type; b=Cr6FDuo8r5EzVlAyTDsB+mR7PIeiQhnAfWDq809MGGLZmBype4sGkE6UVJOP3bHVY pqMxMKJxztiKk0zHYc5pg== Message-ID: <45CC0EB9.7030400@delphij.net> Date: Fri, 09 Feb 2007 14:03:37 +0800 From: LI Xin Organization: The FreeBSD Project User-Agent: Thunderbird 1.5.0.9 (Macintosh/20061207) MIME-Version: 1.0 To: freebsd-arch@freebsd.org X-Enigmail-Version: 0.94.1.0 Content-Type: multipart/signed; micalg=pgp-ripemd160; protocol="application/pgp-signature"; boundary="------------enig1C02D17C056105FC2139BFB9" Subject: Patch for review: resolve a race condition in [sg]etpriority() X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Feb 2007 06:04:04 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig1C02D17C056105FC2139BFB9 Content-Type: multipart/mixed; boundary="------------070209090508050504040207" This is a multi-part message in MIME format. --------------070209090508050504040207 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi, Here is a patch which resolves a race condition in [sg]etpriority(), but I think there might be some better idea to resolve the problem, so I would appreciate if someone would shed me some light :-) Description of the problem: In PRIO_USER of both [sg]etpriority(), p_ucred is expected to be a valid reference to a valid user credential. However, for PRS_NEW processes, there is chances where it is not properly initialized, causing a fatal trap 12 [1]. On the other hand, just determining whether a process is in PRS_NEW state and skip those who are newly born is not enough for semantical correctness. A process could either have p_{start,end}copy section copied or not, which needs to be handled with care. The attached solution: What I did in the attached patch is basically: - Before allproc_lock is sx_xunlock'ed in fork1(), lock both the parent and the child. - After releasing allproc_lock, do process p_{start,end}copy, p_{start,end}zero and crhold() immediately, and release parent and child's p_mtx as soon as possible. - In getpriority(), simply skip PRS_NEW processes because they does not affect PRIO_USER path's result. This part is not really necessary for correctness, though. The pros for this is that it does not require an extensive API change, and in theory it does not require that the allproc lock to be held for a extended period; the cons is that this adds some overhead because it adds two pairs of mutex lock/unlock. In a private discussion, there was also some other ideas. One is to just move the unlock to where the process is completely initialized, another is to introduce an event handler and msleep() p_state when necessary, and do a wakeup once we bumped the state, but I think these solution have their own limitations just like mine. [1] Reproducing script can be obtained from kern/108071. Cheers, --=20 Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! --------------070209090508050504040207 Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0"; name="patch-priority.20070124" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline; filename="patch-priority.20070124" Index: kern_fork.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /home/ncvs/src/sys/kern/kern_fork.c,v retrieving revision 1.266 diff -u -p -r1.266 kern_fork.c --- kern_fork.c 23 Jan 2007 08:46:50 -0000 1.266 +++ kern_fork.c 24 Jan 2007 08:35:31 -0000 @@ -423,8 +423,22 @@ again: AUDIT_ARG(pid, p2->p_pid); LIST_INSERT_HEAD(&allproc, p2, p_list); LIST_INSERT_HEAD(PIDHASH(p2->p_pid), p2, p_hash); + + PROC_LOCK(p2); + PROC_LOCK(p1); + sx_xunlock(&allproc_lock); =20 + bcopy(&p1->p_startcopy, &p2->p_startcopy, + __rangeof(struct proc, p_startcopy, p_endcopy)); + PROC_UNLOCK(p1); + + bzero(&p2->p_startzero, + __rangeof(struct proc, p_startzero, p_endzero)); + + p2->p_ucred =3D crhold(td->td_ucred); + PROC_UNLOCK(p2); + /* * Malloc things while we don't hold any locks. */ @@ -482,13 +496,9 @@ again: PROC_LOCK(p2); PROC_LOCK(p1); =20 - bzero(&p2->p_startzero, - __rangeof(struct proc, p_startzero, p_endzero)); bzero(&td2->td_startzero, __rangeof(struct thread, td_startzero, td_endzero)); =20 - bcopy(&p1->p_startcopy, &p2->p_startcopy, - __rangeof(struct proc, p_startcopy, p_endcopy)); bcopy(&td->td_startcopy, &td2->td_startcopy, __rangeof(struct thread, td_startcopy, td_endcopy)); =20 @@ -511,7 +521,6 @@ again: sched_fork(td, td2); =20 mtx_unlock_spin(&sched_lock); - p2->p_ucred =3D crhold(td->td_ucred); td2->td_ucred =3D crhold(p2->p_ucred); #ifdef AUDIT audit_proc_fork(p1, p2); Index: kern_resource.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /home/ncvs/src/sys/kern/kern_resource.c,v retrieving revision 1.165 diff -u -p -r1.165 kern_resource.c --- kern_resource.c 17 Jan 2007 14:58:53 -0000 1.165 +++ kern_resource.c 24 Jan 2007 02:06:13 -0000 @@ -143,6 +143,9 @@ getpriority(td, uap) uap->who =3D td->td_ucred->cr_uid; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { + /* Do not bother to check PRS_NEW processes */ + if (p->p_state =3D=3D PRS_NEW) + continue; PROC_LOCK(p); if (!p_cansee(td, p) && p->p_ucred->cr_uid =3D=3D uap->who) { --------------070209090508050504040207-- --------------enig1C02D17C056105FC2139BFB9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFzA65OfuToMruuMARA7cHAJsGWVZlMcoepLM7h1+zT014Cj6u7gCgge6e RL6f6wX1VMZHkO8l84269xQ= =fkK6 -----END PGP SIGNATURE----- --------------enig1C02D17C056105FC2139BFB9-- From owner-freebsd-arch@FreeBSD.ORG Fri Feb 9 14:04:17 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8587916A541 for ; Fri, 9 Feb 2007 14:04:15 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id DE69C13C4A6 for ; Fri, 9 Feb 2007 14:04:14 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (zion.baldwin.cx [192.168.0.7]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id l19E3ple096573; Fri, 9 Feb 2007 09:03:54 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Fri, 9 Feb 2007 08:37:04 -0500 User-Agent: KMail/1.9.4 References: <45CC0EB9.7030400@delphij.net> In-Reply-To: <45CC0EB9.7030400@delphij.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200702090837.04495.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [192.168.0.1]); Fri, 09 Feb 2007 09:03:55 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2544/Fri Feb 9 03:44:48 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: LI Xin Subject: Re: Patch for review: resolve a race condition in [sg]etpriority() X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Feb 2007 14:04:17 -0000 On Friday 09 February 2007 01:03, LI Xin wrote: > Hi, > > Here is a patch which resolves a race condition in [sg]etpriority(), but > I think there might be some better idea to resolve the problem, so I > would appreciate if someone would shed me some light :-) > > Description of the problem: > > In PRIO_USER of both [sg]etpriority(), p_ucred is expected to be a valid > reference to a valid user credential. However, for PRS_NEW processes, > there is chances where it is not properly initialized, causing a fatal > trap 12 [1]. > > On the other hand, just determining whether a process is in PRS_NEW > state and skip those who are newly born is not enough for semantical > correctness. A process could either have p_{start,end}copy section > copied or not, which needs to be handled with care. > > The attached solution: > > What I did in the attached patch is basically: > > - Before allproc_lock is sx_xunlock'ed in fork1(), lock both the parent > and the child. > - After releasing allproc_lock, do process p_{start,end}copy, > p_{start,end}zero and crhold() immediately, and release parent and > child's p_mtx as soon as possible. > - In getpriority(), simply skip PRS_NEW processes because they does not > affect PRIO_USER path's result. This part is not really necessary for > correctness, though. > > The pros for this is that it does not require an extensive API change, > and in theory it does not require that the allproc lock to be held for a > extended period; the cons is that this adds some overhead because it > adds two pairs of mutex lock/unlock. > > In a private discussion, there was also some other ideas. One is to > just move the unlock to where the process is completely initialized, > another is to introduce an event handler and msleep() p_state when > necessary, and do a wakeup once we bumped the state, but I think these > solution have their own limitations just like mine. My only reason for favoring the wakeup for complete initialization is that while this patch may solve the getprio/setprio race, it doesn't solve all PRS_NEW-related races, which the sleep/wakeup proposal did. -- John Baldwin