From owner-freebsd-arch@FreeBSD.ORG Thu Feb 13 23:57:34 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 64141376; Thu, 13 Feb 2014 23:57:34 +0000 (UTC) Received: from mail-qc0-x22c.google.com (mail-qc0-x22c.google.com [IPv6:2607:f8b0:400d:c01::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0FCC31359; Thu, 13 Feb 2014 23:57:33 +0000 (UTC) Received: by mail-qc0-f172.google.com with SMTP id c9so19191327qcz.3 for ; Thu, 13 Feb 2014 15:57:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=cz3U0btyXW8MwXJuj5oXkC8m9w5u4r8gLJ6IfDdUCeU=; b=d1OZ6lB8+z9iX8cldj+Pfm4lXHdqQKo+SNqdR4Z+9lqcsq6r7V8kDYMep0maQ6YpTE oTXQtBXusFnoVHpqMEWP/vvkQvlJOBmcJlyoGL2oB4SVkN3hsPqO3Izf59okQuRHlHL2 NgktTfJI9gobd+He/1EpzbLtp9KqPofCA72MrqD4tfAXjjXX8/vnk+KdIJkqxsUXb7V7 BYsdK0VdHHQ63R1Bm/wSnwC48wS/A202Tic9ngqDKAFPY9uovDYt8gvVtTWTJFV6xidQ imZv+Wl1NtvMChdbtfhr0Y9WfXpSl/sUU+tk0ZwJbSxz1r7Vf7jl2By5xiEENzaZDcPx 2OSA== MIME-Version: 1.0 X-Received: by 10.140.50.131 with SMTP id s3mr7347333qga.12.1392335853133; Thu, 13 Feb 2014 15:57:33 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.16.10 with HTTP; Thu, 13 Feb 2014 15:57:33 -0800 (PST) Date: Thu, 13 Feb 2014 15:57:33 -0800 X-Google-Sender-Auth: rVGTpt9nbFWANg8oBRjJkIXiVQk Message-ID: Subject: can the scheduler decide to schedule an interrupted but runnable thread on another CPU core? What are the implications for code? From: Adrian Chadd To: "freebsd-arch@freebsd.org" , "freebsd-hackers@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Feb 2014 23:57:34 -0000 Hi, Whilst digging into collisions in the flowtable code I discovered that a bunch of them are due to some of the flowtable_lookup() code running on a different CPU - the contention on the l2/l3 lookup lock(s) was enough to block things so they'd get an obvious chance to be migrated. So this led me to wonder whether in a fully preemptive kernel, a running kernel thread would stay on the current CPU until it hit a very specific subset of things (exited to userland, hit a lock, etc.) Apparently (according to kan and rwatson) this is how we define fully preemptive - it's not just interruptable at almost any point, but the running task may be interrupted and rescheduled on a different CPU outside of specific critical sections. This means that for the flowtable case, the current setup (without atomics to maintain the lists) can only work if all threads doing work with the flowtable structures (ie, lookup, insert, purge) have to be CPU pinned. Otherwise we may have the situation where: sequentually: * lookup occurs on CPU A; * lookup succeeds on CPU A for some almost-expired entry; * preemption occurs, and it gets scheduled to CPU B; then simultaneously: * CPU A's flowtable purge code runs, and decides to purge entries including the current one; * the code now running on CPU B has an item from the CPU A flowtable, and dereferences it as it's being freed, leading to potential badness. Now, it's a ridiculously small window of opportunity, but I'd rather the code be written to be correct and mostly-fast versus very fast and potentially exploding. I'm sure those in operations would agree. :-) So, my questions: * is this actually how fully pre-emptive kernels _may_ behave? * I believe there's a difference between what 4BSD and ULE will do here - is this the case? What are the scheduler behaviours? * are there any other areas in the kernel that rely on pcpu uma zones / curcpu indexes for things outside of sched_pin() ? Thanks, -a