From owner-freebsd-current@freebsd.org Sat Jun 23 21:03:05 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B010F1005ECA for ; Sat, 23 Jun 2018 21:03:05 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660054.outbound.protection.outlook.com [40.107.66.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT TLS CA 4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 27E929287B; Sat, 23 Jun 2018 21:03:04 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTXPR0101MB0959.CANPRD01.PROD.OUTLOOK.COM (52.132.34.15) by YTXPR0101MB0766.CANPRD01.PROD.OUTLOOK.COM (52.132.33.32) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.863.17; Sat, 23 Jun 2018 21:03:02 +0000 Received: from YTXPR0101MB0959.CANPRD01.PROD.OUTLOOK.COM ([fe80::2cbc:a2ff:df24:74ff]) by YTXPR0101MB0959.CANPRD01.PROD.OUTLOOK.COM ([fe80::2cbc:a2ff:df24:74ff%3]) with mapi id 15.20.0884.010; Sat, 23 Jun 2018 21:03:02 +0000 From: Rick Macklem To: "freebsd-current@freebsd.org" CC: "kib@FreeBSD.org" , Alexander Motin Subject: nfsd kernel threads won't die via SIGKILL Thread-Topic: nfsd kernel threads won't die via SIGKILL Thread-Index: AQHUCzMbOoi8yQKY7UygDXm1hs0B0A== Date: Sat, 23 Jun 2018 21:03:02 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTXPR0101MB0766; 7:HD8uF5d+AnLGkKs3LCwMKhOynb0xmt/kwwy7p8td1uQD1IhaKH/U32RLsDuNuJ19l6ib9pFPzFAbRwYn+CgtSCwv1F2uVy1ygj4jkHTCyPRQsy7VifXK2NRxJ9qlvzQu6x9FbgQySZIJBtFv2E36psve2LJAB91xtYhreKuKFtH8RRPrOIQxPQYF3N5AgwLUJghrAz96GrYclUlMSTF075KQWRJsdNfxqnSVRhD1/ASIpk8mgpWl2pQlh48NgHXx x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-correlation-id: f098fb2e-be29-4106-6ac3-08d5d94cb561 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(8989117)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(5600026)(711020)(2017052603328)(7153060)(7193020); SRVR:YTXPR0101MB0766; x-ms-traffictypediagnostic: YTXPR0101MB0766: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040522)(2401047)(5005006)(8121501046)(10201501046)(3231254)(944501410)(52105095)(3002001)(93006095)(93001095)(149027)(150027)(6041310)(20161123562045)(20161123558120)(20161123564045)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(6072148)(201708071742011)(7699016); SRVR:YTXPR0101MB0766; BCL:0; PCL:0; RULEID:; SRVR:YTXPR0101MB0766; x-forefront-prvs: 07126E493C x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(39380400002)(39860400002)(376002)(346002)(396003)(366004)(189003)(199004)(478600001)(2351001)(106356001)(105586002)(33656002)(74482002)(68736007)(5640700003)(54906003)(450100002)(55016002)(6436002)(99286004)(316002)(25786009)(9686003)(305945005)(786003)(5250100002)(486006)(2501003)(476003)(74316002)(4326008)(7696005)(59450400001)(186003)(97736004)(102836004)(6506007)(26005)(2900100001)(14454004)(3660700001)(3280700002)(6916009)(2906002)(53936002)(8936002)(8676002)(81156014)(5660300001)(86362001)(81166006); DIR:OUT; SFP:1101; SCL:1; SRVR:YTXPR0101MB0766; H:YTXPR0101MB0959.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-microsoft-antispam-message-info: df+t87GK+aSGkvcnc4y69WN4xUfbALVH3ZA6NI3RzMVf89vU7KbL3ZPM+C9A7ths2G3H9w0Is/n+idn9URgH/2EsOyS4vrYzMdFCr5SIK9HOUxUl6gkLtomRhzrEt87EyTVYKWM9UmNir90m59Ft4hqFgaMgnGiN3xDiXmTKW0AeIAwQNHAeG15UbJP6UoEDrjTkLsdo8ZwJO8xvPvKUPVHfN62vyzJYU5jX9muCwbsd3iAz6/hkMF9JAvLPkP7XiPHtJ60fpotjSS844uphnZws4+3Oi3JybTZBRlXVC5hyweg2sS8l3+KahqnsD5p/2o5OdxfQp8Vjn8UubV0Uj/K85u16/izPg/D8kDE+64Y= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: f098fb2e-be29-4106-6ac3-08d5d94cb561 X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jun 2018 21:03:02.6225 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTXPR0101MB0766 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jun 2018 21:03:05 -0000 During testing of the pNFS server I have been frequently killing/restarting= the nfsd. Once in a while, the "slave" nfsd process doesn't terminate and a "ps axHl"= shows: 0 48889 1 0 20 0 5884 812 svcexit D - 0:00.01 nfsd: ser= ver=20 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: ser= ver=20 ... more of the same 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: ser= ver=20 0 48889 1 0 -8 0 5884 812 rpcsvc I - 1:51.78 nfsd: ser= ver=20 0 48889 1 0 -8 0 5884 812 rpcsvc I - 2:27.75 nfsd: ser= ver=20 You can see that the top thread (the one that was created with the process)= is stuck in "D" on "svcexit". The rest of the threads are still servicing NFS RPCs. If you still have an = NFS mount on the server, the mount continues to work and the CPU time for the last two t= hreads slowly climbs, due to NFS RPC activity. A SIGKILL was posted for the proces= s and these threads (created by kthread_add) are here, but the cv_wait_sig/cv_timedwait_sig never seems to return EINTR for these other th= reads. if (ismaster || (!ismaster && 1207 grp->sg_threadcount > grp->sg_minthreads)= ) 1208 error =3D cv_timedwait_sig(&st->st_co= nd, 1209 &grp->sg_lock, 5 * hz); 1210 else 1211 error =3D cv_wait_sig(&st->st_cond, 1212 &grp->sg_lock); The top thread (referred to in svc.c as "ismaster" did return from here wit= h EINTR and has now done an msleep() here, waiting for the other threads to termina= te. /* Waiting for threads to stop. */ 1387 for (g =3D 0; g < pool->sp_groupcount; g++) { 1388 grp =3D &pool->sp_groups[g]; 1389 mtx_lock(&grp->sg_lock); 1390 while (grp->sg_threadcount > 0) 1391 msleep(grp, &grp->sg_lock, 0, "svcexit", 0); 1392 mtx_unlock(&grp->sg_lock); 1393 } Although I can't be sure if this patch has fixed the problem because it hap= pens intermittently, I have not seen the problem since applying this patch: --- rpc/svc.c.sav 2018-06-21 22:52:11.623955000 -0400 +++ rpc/svc.c 2018-06-22 09:01:40.271803000 -0400 @@ -1388,7 +1388,7 @@ svc_run(SVCPOOL *pool) grp =3D &pool->sp_groups[g]; mtx_lock(&grp->sg_lock); while (grp->sg_threadcount > 0) - msleep(grp, &grp->sg_lock, 0, "svcexit", 0); + msleep(grp, &grp->sg_lock, 0, "svcexit", 1); mtx_unlock(&grp->sg_lock); } } As you can see, all it does is add a timeout to the msleep().=20 I am not familiar with the signal delivery code in sleepqeue, so it probabl= y isn't correct, but my theory is alonge the lines of... Since the msleep() doesn't have PCATCH, it does not set TDF_SINTR and if that happens before the other threads return EINTR from cv_wait_sig(= ), they no longer do so? And I thought that waking up from the msleep() via timeouts would maybe all= ow the other threads to return EINTR from cv_wait_sig()? Does this make sense? rick ps: I'll post if I see the problem again with the patch applied. pss: This is a single core i386 system, just in case that might affect this= .