From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 02:42:47 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 209E187A;
 Sun, 10 Mar 2013 02:42:47 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B9EF47B5;
 Sun, 10 Mar 2013 02:42:46 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEACryO1GDaFvO/2dsb2JhbABCiCi8JIF0dIIsAQEBAwEBAQEgBCcgCxsYAgINGQIpAQkmBggHBAEcBIdsBgyqI5FvgSOMOn00B4ItgRMDiHKLJYI+gR6PV4MoHjKBBTU
X-IronPort-AV: E=Sophos;i="4.84,816,1355115600"; d="scan'208";a="20301161"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 09 Mar 2013 21:42:45 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B9596B3F46;
 Sat,  9 Mar 2013 21:42:45 -0500 (EST)
Date: Sat, 9 Mar 2013 21:42:45 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Garrett Wollman <wollman@freebsd.org>
Message-ID: <663916089.3736429.1362883365710.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20795.30884.330015.123616@hergotha.csail.mit.edu>
Subject: Re: NFS DRC size
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 02:42:47 -0000

Garrett Wollman wrote:
> <<On Sat, 9 Mar 2013 11:27:32 -0500 (EST), Rick Macklem
> <rmacklem@uoguelph.ca> said:
> 
> > around the highwater mark basically indicates this is working. If it
> > wasn't
> > throwing away replies where the receipt has been ack'd at the TCP
> > level, the cache would grow very large, since they would only be
> > discarded after a loonnngg timeout (12hours unless you've changes
> > NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h).
> 
> That seems unreasonably large.
> 
I suppose. How long a network partitioning do you want the cache to
deal with? (My original design was trying to achieve a high level of
correctness by default.)

The only time cache entries normally hang around this long is when a
client has dismounted the volume(s) using the TCP connection. The
cached replies for the last few replies will then hang around until
the timeout. For a few clients this isn't an issue. For 2,000 clients,
I can see that it might be, if the clients choose to dismount volumes
(using something like amd).

Feel free to make it smaller, based on the longest network partitioning
that you anticipate might occur.

> > Well, the DRC will try to cache replies until the client's TCP layer
> > acknowledges receipt of the reply. It is hard to say how many
> > replies
> > that is for a given TCP connection, since it is a function of the
> > level
> > of concurrently (# of nfsiod threads in the FreeBSD client)
> > in the client. I'd guess it's somewhere between 1<->20?
> 
> Nearly all our clients are Linux, so it's likely to be whatever Debian
> does by default.
> 
> > Multiply that by the number of TCP connections from all clients and
> > you have about how big the server's DRC will be. (Some clients use
> > a single TCP connection for the client whereas others use a separate
> > TCP connection for each mount point.)
> 
> The Debian client appears to use a single TCP connection for
> everything.
> 
> So if I want to support 2,000 clients each with 20 requests in flight,
> that would suggest that I need a DRC size of 40,000, which my
> experience shows is not sufficient with even a much smaller number of
> clients.
> 
Well, especially since Debian is using one TCP connection for everything
from a client, the guess of 20 could be way low.

rick

> -GAWollman
> 
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 05:13:31 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 995D65AB
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 05:13:31 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 82C84D10
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 05:13:31 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap4EAJQVPFFCZpCq/2dsb2JhbABDwW6EVXSCKwEBBDlGNzQCWQgBAYgJBptdoDySVQOIco1jhWeLDoMrGw
X-IPAS-Result: Ap4EAJQVPFFCZpCq/2dsb2JhbABDwW6EVXSCKwEBBDlGNzQCWQgBAYgJBptdoDySVQOIco1jhWeLDoMrGw
X-IronPort-AV: E=Sophos;i="4.84,817,1355126400"; 
   d="scan'208";a="775349"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 09 Mar 2013 21:12:10 -0800
Message-ID: <513C1629.50501@caltel.com>
Date: Sat, 09 Mar 2013 21:12:09 -0800
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Aligning MBR for ZFS boot help
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 05:13:31 -0000

Hello all,

I am really struggling to understand what is going on, if anyone could 
tell me where I am going wrong, I would greatly appreciate it.

I have a new intel atom appliance that will not boot from a GPT 
partition table.  It came with an SSD, so I am trying to align it to 1MB 
for the erase block size.

All of these commands are being run from a 9.1-RELEASE-amd64-memstick


These commands partition the drive, the system boots just fine:
> gpart create -s mbr ada0
> gpart add -t freebsd ada0
> gpart create -s bsd ada0s1
> gpart add -s 52862M -t freebsd-zfs  ada0s1
> gpart add -s 8G  -t freebsd-swap ada0s1
> gpart set -a active -i 1 ada0
> gpart bootcode -b /boot/mbr ada0
> dd if=/boot/zfsboot of=/dev/ada0s1 count=1
> dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024

This is the gpart print output of those commands
> =>       63  125045361  ada0  MBR  (59G)
>          63  125045361     1  freebsd  [active]  (59G)
>
> =>        0  125045361  ada0s1  BSD  (59G)
>           0  108261376       1  freebsd-zfs  (51G)
>   108261376   16777216       2  freebsd-swap  (8.0G)
>   125038592       6769          - free -  (3.3M)

Here is my disk info
> root@:/root # diskinfo -v ada0
> ada0
> 	512         	# sectorsize
> 	64023257088 	# mediasize in bytes (59G)
> 	125045424   	# mediasize in sectors
> 	0           	# stripesize
> 	0           	# stripeoffset
> 	124053      	# Cylinders according to firmware.
> 	16          	# Heads according to firmware.
> 	63          	# Sectors according to firmware.

125045361 + 63 = 125045424
So gpart is for sure printing sectors.
freebsd-zfs starts at sector 63

So, I need that freebsd-zfs slice to start at 1MB
1MB = 2048s
2048 - 63 = 1985
so if I add an offset to my slice:
> gpart add -b 1985 -s 52862M -t freebsd-zfs ada0s1

should start me at 2048.
> =>       63  125045361  ada0  MBR  (59G)
>          63  125045361     1  freebsd  [active]  (59G)
> =>        0  125045361  ada0s1  BSD  (59G)
>           0       1985          - free -  (992k)
>        1985  108261376       1  freebsd-zfs  (51G)

BUT, when i boot, I get this:
> zfsboot: No ZFS Pools located, can't boot

I think remember reading that freebsd-zfs had to be the first slice (I 
cannot remember where i read that).  And it apparently does not think an 
offset is funny.

So, that leaves me with trying to adjust my MBR partition, so I start 
over and run:
> gpart add -b 1985 -t freebsd ada0

but that gives me:
> =>       63  125045361  ada0  MBR  (59G)
>          63       1953        - free -  (976k)
>        2016  125043408     1  freebsd  (59G)

HHHMMMMM????  well, 2016 - 1953 = 63  coincidence?  i doubt it, but I 
dont get it.

Poking around on the internet, it looks like gpart is possibly enforcing 
geometry boundaries? so I do the following:

> sysctl kern.geom.part.check_integrity=0
> root@:/root # gpart add -a 1m -t freebsd ada0
> ada0s1 added
> root@:/root # gpart show
> =>       63  125045361  ada0  MBR  (59G)
>          63       2016        - free -  (1M)
>        2079  125042652     1  freebsd  (59G)
>   125044731        693        - free -  (346k)

Obviously still didnt work.


I try a 10MB offset.
10MB = 20480s
20480-63  = 20417s
> gpart add -b 20417 -t freebsd ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63      20412        - free -  (10M)
>       20475  125024949     1  freebsd  (59G)

It is still just a few sectors off.  So what if i let gpart 
automatically align it.

> gpart add -a 1m -t freebsd ada0

> =>       63  125045361  ada0  MBR  (59G)
>          63       2016        - free -  (1M)
>        2079  125042652     1  freebsd  (59G)
>   125044731        693        - free -  (346k)


And 2079 is still != 2048.

I have tried adjusting those numbers one by one, and it just hops around 
the number I am looking for.  I have tried adding partitions in-front of 
it, setting the alignment to 1s, and adjusting the size.  I cannot get 
it to land on 2048.

It does boot with the padding in the MBR table, but I don't think it is 
aligned.  Maybe it is aligned, and I just dont know any better.

I am at a loss.

Any suggestions would be greatly appreciated.

Thanks,

Cody

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 06:12:24 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 001A0F1B
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 06:12:23 +0000 (UTC)
 (envelope-from jdavidlists@gmail.com)
Received: from mail-ia0-x234.google.com (mail-ia0-x234.google.com
 [IPv6:2607:f8b0:4001:c02::234])
 by mx1.freebsd.org (Postfix) with ESMTP id B82EEE39
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 06:12:23 +0000 (UTC)
Received: by mail-ia0-f180.google.com with SMTP id f27so2641701iae.39
 for <freebsd-fs@freebsd.org>; Sat, 09 Mar 2013 22:12:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=w6/rYYWSyXwzz89OE/o5rZnnfZ/WdbZVvdodMSfphAQ=;
 b=INB4eZmwilgwy7KnF1pJpESNXnW+/q3M8nhKnNH1Q54bcDG3dm7DfrxFkz9PZYHEeK
 iYe4AwHa8vooJp0DdLGHOamvYLTHRH8HPmpdPLayrMN8fU7XbGA12rQ2u3VStNZsCVkq
 g66DSBYUoVs4AHJU8wp/h5INoAa2wdhYviotWLekYg5ueammOhqcgHIXi2GwT2EtjkiD
 uGwev1g2ojX2R+3ZYlBvlvuva4Ik3nxqSM9N6b8LOnrwfRAwz58cbJ2OIdut+MvJHtzG
 nyzCss7jDBHn4aPnEZW7/4VHssLZBidhhUFrKGh5481PL6WdkQBmHSoVT++lDC0cpOJt
 2w6A==
MIME-Version: 1.0
X-Received: by 10.50.57.166 with SMTP id j6mr4152902igq.21.1362895943253; Sat,
 09 Mar 2013 22:12:23 -0800 (PST)
Sender: jdavidlists@gmail.com
Received: by 10.42.153.133 with HTTP; Sat, 9 Mar 2013 22:12:23 -0800 (PST)
In-Reply-To: <513C1629.50501@caltel.com>
References: <513C1629.50501@caltel.com>
Date: Sun, 10 Mar 2013 01:12:23 -0500
X-Google-Sender-Auth: 7XRxo5UqAysIqAivjxj_0tEH3xQ
Message-ID: <CABXB=RRE3+nq9RioVi4Er4kRP_=Tbonoh=rnh91Ew=3hzYapbw@mail.gmail.com>
Subject: Re: Aligning MBR for ZFS boot help
From: J David <j.david.lists@gmail.com>
To: Cody Ritts <cr@caltel.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 06:12:24 -0000

On Sun, Mar 10, 2013 at 12:12 AM, Cody Ritts <cr@caltel.com> wrote:

> I have a new intel atom appliance that will not boot from a GPT partition
> table.  It came with an SSD, so I am trying to align it to 1MB for the
> erase block size.
>

I looked and looked and I don't see where you're creating a GPT partition
table or indeed doing anything with GPT.  You create an MBR table here:


> gpart create -s mbr ada0
>>
>
And seem to stick with it through the rest of your example.

If you adjust this to:

gpart create -s gpt ada0

You may get better results, because MBR is indeed going to saddle you with
cylinder boundaries using some inscrutable probably-fictional geometry.

I think you'd want something like

gpart add -t freebsd-boot -b 34 -s 128 ada0
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs ada0
gpart add -s 8G -t freebsd-swap ada0

But that might need some tweaking.  Your zpool will then use the "zroot"
partition / ada0p2.

Hope that is helpful.

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 06:30:50 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id DBC313FE
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 06:30:50 +0000 (UTC)
 (envelope-from jdavidlists@gmail.com)
Received: from mail-ie0-x234.google.com (mail-ie0-x234.google.com
 [IPv6:2607:f8b0:4001:c03::234])
 by mx1.freebsd.org (Postfix) with ESMTP id B2DF4EAC
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 06:30:50 +0000 (UTC)
Received: by mail-ie0-f180.google.com with SMTP id bn7so3643220ieb.11
 for <freebsd-fs@freebsd.org>; Sat, 09 Mar 2013 22:30:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=PPfhhwj4qBZdpZY65kPMNJPcfF5HJlBWcYg3l9WdqZ4=;
 b=O7xsxN2Anse9VIrq66HDC3SKK345gu65+CxW7RkMKkmIPWSAaP28B5r0MHlh1Yitjr
 AU9OhQa4CRLl6LkggD8E7MMNIobcZK7FyOATtn5fgVZ6bNOggUgCS0eBjRBAmbkeeuhk
 g0aDmpchYgOmhZHsn4xO4p1ircledgxx8kxH1CzDRom1rOJZ7JUYEtBfkK0QX66T7Vb2
 NIGVNJguBZpB8v1i88Za6lJe3Oxo/hTTWDXifCQtBmvILpikUGEkd4QsmcjAFkbfcDuC
 CD9d+Eny0Kob9kz/CbasY7wXZr3lOzt10gaNCw3eOQXDZ8xm5cozEu2MNISpVzr2fCUg
 yARg==
MIME-Version: 1.0
X-Received: by 10.42.150.131 with SMTP id a3mr5691655icw.8.1362897050476; Sat,
 09 Mar 2013 22:30:50 -0800 (PST)
Sender: jdavidlists@gmail.com
Received: by 10.42.153.133 with HTTP; Sat, 9 Mar 2013 22:30:50 -0800 (PST)
In-Reply-To: <CABXB=RRE3+nq9RioVi4Er4kRP_=Tbonoh=rnh91Ew=3hzYapbw@mail.gmail.com>
References: <513C1629.50501@caltel.com>
 <CABXB=RRE3+nq9RioVi4Er4kRP_=Tbonoh=rnh91Ew=3hzYapbw@mail.gmail.com>
Date: Sun, 10 Mar 2013 01:30:50 -0500
X-Google-Sender-Auth: 7GB0uRqxB3vUfBQTUhypSAnV5hU
Message-ID: <CABXB=RRds2bU+3LhykkzV=tt4_PSgqPEZnLFeeptMRpy7Hz3zw@mail.gmail.com>
Subject: Re: Aligning MBR for ZFS boot help
From: J David <j.david.lists@gmail.com>
To: Cody Ritts <cr@caltel.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 06:30:50 -0000

Just to check myself, I ran this real quick on a virstor:

# truncate -s 1G deleteme
# mdconfig -a -t vnode -f deleteme
md0
# gvirstor label -s 62522712k fakessd md0
Resizing virtual size to be a multiple of chunk size.
New virtual size: 61056 MB
Resizing virtual size to fit virstor structures.
New virtual size: 61184 MB (32 new chunks)
# gpart create -s gpt /dev/virstor/fakessd
virstor/fakessd created
# gpart add -t freebsd-boot -b 34 -s 128 /dev/virstor/fakessd
virstor/fakessdp1 added
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 /dev/virstor/fakessd
bootcode written to virstor/fakessd
# gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs /dev/virstor/fakessd
virstor/fakessdp2 added
# gpart add -t freebsd-swap /dev/virstor/fakessd  # no -s = use all space
left
virstor/fakessdp3 added
# gpart show /dev/virstor/fakessd
=>       34  125304765  virstor/fakessd  GPT  (59G)
         34        128                1  freebsd-boot  (64k)
        162       1886                   - free -  (943k)
       2048  106954752                2  freebsd-zfs  (51G)
  106956800   18347999                3  freebsd-swap  (8.8G)
# zpool create zroot /dev/gpt/zroot
# zpool status
  pool: zroot
 state: ONLINE
  scan: none requested
config:

NAME         STATE     READ WRITE CKSUM
zroot        ONLINE       0     0     0
  gpt/zroot  ONLINE       0     0     0

errors: No known data errors


I won't have much luck booting a virstor to test this :) but it sure looks
pretty, so hopefully it will work for you.

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 08:22:09 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0CF582CB;
 Sun, 10 Mar 2013 08:22:09 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
 [211.29.132.183])
 by mx1.freebsd.org (Postfix) with ESMTP id 640A21F3;
 Sun, 10 Mar 2013 08:22:07 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2A8LvSb028477
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Sun, 10 Mar 2013 19:21:59 +1100
Date: Sun, 10 Mar 2013 19:21:57 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: "Kenneth D. Merry" <ken@FreeBSD.org>
Subject: Re: patches to add new stat(2) file flags
In-Reply-To: <20130308232155.GA47062@nargothrond.kdm.org>
Message-ID: <20130310181127.D2309@besplex.bde.org>
References: <20130307000533.GA38950@nargothrond.kdm.org>
 <20130307222553.P981@besplex.bde.org>
 <20130308232155.GA47062@nargothrond.kdm.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=bNdOu4CZ c=1 sm=1 a=n2O7wv11oSwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=YOiZBDKP_E4A:10
 a=7RUHTw8TR_SxG-0Q0ScA:9 a=CjuIK1q_8ugA:10 a=iA5AuRVOsPQzuK-W:21
 a=yF7AlGMdZxl7LVJH:21 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: arch@FreeBSD.org, fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 08:22:09 -0000

On Fri, 8 Mar 2013, Kenneth D. Merry wrote:

> On Fri, Mar 08, 2013 at 00:37:15 +1100, Bruce Evans wrote:
>> On Wed, 6 Mar 2013, Kenneth D. Merry wrote:
>>
>>> I have attached diffs against head for some additional stat(2) file flags.
>>>
>>> The primary purpose of these flags is to improve compatibility with CIFS,
>>> both from the client and the server side.
>>> ...
>>
>> I missed looking at the diffs in my previous reply.
>>
>> % --- //depot/users/kenm/FreeBSD-test3/bin/chflags/chflags.1	2013-03-04
>> 17:51:12.000000000 -0700
>> % +++ /usr/home/kenm/perforce4/kenm/FreeBSD-test3/bin/chflags/chflags.1
>> 2013-03-04 17:51:12.000000000 -0700
>> % --- /tmp/tmp.49594.86	2013-03-06 16:42:43.000000000 -0700
>> % +++ /usr/home/kenm/perforce4/kenm/FreeBSD-test3/bin/chflags/chflags.1
>> 2013-03-06 14:47:25.987128763 -0700
>> % @@ -117,6 +117,16 @@
>> %  set the user immutable flag (owner or super-user only)
>> %  .It Cm uunlnk , uunlink
>> %  set the user undeletable flag (owner or super-user only)
>> % +.It Cm system , usystem
>> % +set the Windows system flag (owner or super-user only)
>>
>> This begins unsorting of the list.
>
> Fixed.
>
>> It's not just a Windows flag, since it also works in DOS.
>
> Fixed.

Thanks.  Hopefully all the simple bugs are fixed now.

>> "Owner or" is too strict for msdosfs, since files can only have a
>> single owner so it is controlling access using groups is needed.  I
>> use owner root and group msdosfs for msdosfs mounts.  This works for
>> normal operations like open/read/write, but fails for most attributes
>> including file flags.  msdosfs doesn't support many attributes but
>> this change is supposed to add support for 3 new file flags so it would
>> be good if it didn't restrict the support to root.
>
> I wasn't trying to change the existing security model for msdosfs, but if
> you've got a suggested patch to fix it I can add that in.

I can't think of anything better than making group write permission enough
for attributes.

msdosfs also has some style bugs in this area.  It uses VOP_ACCESS()
with VADMIN for the non-VA_UTIMES_NULL case of utimes(), but for all
other attributes it hard-codes a direct uid check followed a
priv_check_cred() with PRIV_VFS_ADMIN.  VADMIN requires even more than
user write permission for POSIX file systems and using it unchanged
for all the attributes would be even more restrictive unless we changed
it, but it would be easier to make it uniformly less restrictive for
msdosfs by using it consistently.

Oops, that was in the old version of ffs.  ffs now has related
complications and unnecessary style bugs (verboseness and misformatting)
to support ACLs.  It now uses VOP_ACCESSX() with VWRITE_ATTRIBUTES for
utimes(), and VOP_ACCESSX() with other VFOO for all attributes except
flags.  It still uses VOP_ACCESS() with VADMIN() for flags.

>> ...
>> %  .It Dv SF_ARCHIVED
>> ...
>> % +Filesystems in FreeBSD may or may not have special handling for this
>> flag.
>> % +For instance, ZFS tracks changes to files and will clear this bit when a
>> % +file is updated.
>> % +UFS only stores the flag, and relies on the application to change it when
>> % +needed.
>>
>> I think that is useless, since changing it is needed whenever the file
>> changes, and applications can do that (short of running as daemons and
>> watching for changes).
>
> Do you mean applications can't do that or can?

Oops, can't.

It is still hard for users to know how their file system supports.
Even programmers don't know that it is backwards :-).

>> % --- //depot/users/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c
>> 2013-03-04 17:51:12.000000000 -0700
>> % +++
>> /usr/home/kenm/perforce4/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c
>> 2013-03-04 17:51:12.000000000 -0700
>> % --- /tmp/tmp.49594.370	2013-03-06 16:42:43.000000000 -0700
>> % +++
>> /usr/home/kenm/perforce4/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c
>> 2013-03-06 14:49:47.179130318 -0700
>> % @@ -345,8 +345,17 @@
>> %  		vap->va_birthtime.tv_nsec = 0;
>> %  	}
>> %  	vap->va_flags = 0;
>> % +	/*
>> % +	 * The DOS Archive attribute means that a file needs to be
>> % +	 * archived.  The BSD SF_ARCHIVED attribute means that a file has
>> % +	 * been archived.  Thus the inversion here.
>> % +	 */
>>
>> No need to document it again.  It goes without saying that ARCHIVE
>> != ARCHIVED.
>
> I disagree.  It wasn't immediately obvious to me that SF_ARCHIVED was
> generally used as the inverse of the DOS Archived bit until I started
> digging into this.  If this helps anyone figure that out more quickly, it's
> useful.

The surprising thing is that it is backwards in FreeBSD and not really
supported except in msdosfs.  Now several file systems have the comment
about it being inverted, but man pages still don't.

>> % @@ -420,12 +429,21 @@
>> %  			if (error)
>> %  				return (error);
>> %  		}
>>
>> The permissions check before this is delicate and was broken and is
>> more broken now.  It is still short-circuit to handle setting the
>> single flag that used to be supported, and is slightly broken for that:
>> - unprivileged user asks to set ARCHIVE by passing !SF_ARCHIVED.  We
>>   allow that, although this may toggle the flag and normal semantics
>>   for SF flags is to not allow toggling.
>> - unprivileged user asks to clear ARCHIVE by passing SF_ARCHIVED.  We
>>   don't allow that.  But we should allow preserving ARCHIVE if it is
>>   already clear.
>> The bug wasn't very important when only 1 flag was supported.  Now it
>> prevents unprivileged users managing the new UF flags if ARCHIVE is
>> clear.  Fortunately, this is the unusual case.  Anyway, unprivileged
>> users can set ARCHIVE by doing some other operation.  Even the chflags()
>> operation should set ARCHIVE and thus allow further chflags()'s that now
>> keep ARCHIVE set.  Except it is very confusing if a chflags() asks for
>> ARCHIVE to be clear.  This request might be just to try to preserve
>> the current setting and not want it if other things are changed, or
>> it might be to purposely clear it.  Changing it from set to clear should
>> still be privileged.
>
> I changed it to allow setting or clearing SF_ARCHIVED.  Now I can set or
> clear the flag as non-root:

Actually, it seems OK, since there are no old or new SF_ immututable flags.
Some of the actions are broken in the old and new code for directories --
see below.

>> See the more complicated permissions check in ffs.  It would be safest
>> to duplicate most of it, to get different permissions checking for the
>> SF and UF flags.  Then decide if we want to keep allowing setting
>> ARCHIVE without privilege.
>
> I think we should allow getting and setting SF_ARCHIVED without special
> privileges.  Given how it is generally used, I don't think it should be
> restricted to the super-user.

I don't really like that since changing the flags is mainly needed for
the failry privileged operation of managing other OS's file systems.
However, since we're mapping the SYSTEM flag to a UF_ flag, the SYSTEM
flag will require less privilege than the ARCHIVE flag.  This is backwards,
so we might as well require less privilege for ARCHIVE too.  I think we,
that is, you should use a new UF_ARCHIVE flag with the correct sense.

> Can you provide some code demonstrating how the permissions code should
> be changed in msdosfs?  I don't know that much about that sort of thing,
> so I'll probably spend an inordinate amount of time stumbling
> through it.

Now I think only cleanups are needed.

>> %  			return EOPNOTSUPP;
>> %  		if (vap->va_flags & SF_ARCHIVED)
>> %  			dep->de_Attributes &= ~ATTR_ARCHIVE;
>> %  		else if (!(dep->de_Attributes & ATTR_DIRECTORY))
>> %  			dep->de_Attributes |= ATTR_ARCHIVE;
>>
>> The comment before this says that we ignore attmps to set ATTR_ARCHIVED
>> for directories.  However, it is out of date.  WinXP allows setting it
>> and all the new flags for directories, and so do we.
>
> Do you mean we allow setting it in UFS, or where?  Obviously the code above
> won't set it on a directory.

I meant it here.  Actually, the comment matches the code -- I somehow missed
the test in the code.  However, the code is wrong.  All directories except
the root directory have this and other attributes, but FreeBSD refuses to
set them.  More below.

>> The WinXP attrib command (at least on a FAT32 fs) doesn't allow setting
>> or clearing ARCHIVE (even if it is already set or clear) if any of
>> HIDDEN, READONLY or SYSTEM is already set and remains set after the
>> command.  Thus the HRS attributes act a bit like immutable flags, but
>> subtly differently.  (ffs has the quite different and worse behaviour
>> of allowing chflags() irrespective of immutable flags being set before
>> or after, provided there is enough privilege to change the immutable
>> flags.) Anyway, they should all give some aspects of immutability.
>
> We could do that for msdosfs, but why add more things for the user to trip
> over given how the filesystem is typically used?  Most people probably
> use it for USB thumb drives these days.  Or perahps on a dual boot system
> to access their Windows partition.

The small data drives won't have many files with attributes (except
ARCHIVE).  For multiple-boot, I think the permssions shouldn't be too
much different than the foreign OS's.  I used not to worry about this
and liked deleting WinXP files without asking it, but recently I spent
a lot of time recovering a WinXP ntfs partition and changed a bit too
much using FreeBSD and Cygwin because I didn't understand the
permissions (especially ACLs).  ntfs in FreeBSD was less than r/o so it
couldn't even back up the permissions (for file flags, it returned the
garbage in its internal inode flags without translation...).

> *** src/bin/chflags/chflags.1.orig
> --- src/bin/chflags/chflags.1
> ***************
> *** 101,120 ****
>   .Bl -tag -offset indent -width ".Cm opaque"
>   .It Cm arch , archived
>   set the archived flag (super-user only)
>   .It Cm opaque
>   set the opaque flag (owner or super-user only)
> - .It Cm nodump
> - set the nodump flag (owner or super-user only)
>   .It Cm sappnd , sappend

The opaque flag is UF_ too.

> + .It Cm snapshot
> + set the snapshot flag (most filesystems do not allow changing this flag)

I think none do.  It can only be displayed.

chflags(1) doesn't display flags, so this shouldn't be here.  The problem
is that this man page is the only place where the flag names are documented.
ls(1) and strtofflags(3) just point to here.  strtofflags(3) says that the
flag names are documented here, but ls(1) just has an Xref to here.

> *** src/lib/libc/sys/chflags.2.orig
> --- src/lib/libc/sys/chflags.2
> --- 71,127 ----
>   the following values
>   .Pp
>   .Bl -tag -width ".Dv SF_IMMUTABLE" -compact -offset indent
> ! .It Dv SF_APPEND
>   The file may only be appended to.
>   .It Dv SF_ARCHIVED
> ! The file has been archived.
> ! This flag means the opposite of the Windows and CIFS FILE_ATTRIBUTE_ARCHIVE

DOS, Windows and CIFS...

> ! attribute.
> ! That attribute means that the file should be archived, whereas
> ! .Dv SF_ARCHIVED
> ! means that the file has been archived.
> ! Filesystems in FreeBSD may or may not have special handling for this flag.
> ! For instance, ZFS tracks changes to files and will clear this bit when a
> ! file is updated.

Does zfs clear it in other circumstances?  WinXP doesn't for msdosfs (or
ntfs?), but FreeBSD clears it when changing some attributes, even for
null changes (these are: times except for atimes, and the HIDDEN attribute
when it is changed by chmod() -- even for null changes --, but not for
the HIDDEN attribute when it is changed (or preserved) by chflags() in
your new code).  I want to to be cleared for metadata so that backup
utilities can trust the ARCHIVE flag for metadata changes.

> + .It Dv UF_IMMUTABLE
> + The file may not be changed.
> + Filesystems may use this flag to maintain compatibility with the Windows and
> + CIFS FILE_ATTRIBUTE_READONLY attribute.

So READONLY is only mapped to UFS_IMMUTABLE if it gives immutability?

> *** src/sys/fs/msdosfs/msdosfs_vnops.c.orig
> --- src/sys/fs/msdosfs/msdosfs_vnops.c
> ***************
> *** 415,431 ****
>   		 * set ATTR_ARCHIVE for directories `cp -pr' from a more
>   		 * sensible filesystem attempts it a lot.
>   		 */
> ! 		if (vap->va_flags & SF_SETTABLE) {
>   			error = priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0);
>   			if (error)
>   				return (error);
>   		}
> ! 		if (vap->va_flags & ~SF_ARCHIVED)
>   			return EOPNOTSUPP;
>   		if (vap->va_flags & SF_ARCHIVED)
>   			dep->de_Attributes &= ~ATTR_ARCHIVE;
>   		else if (!(dep->de_Attributes & ATTR_DIRECTORY))
>   			dep->de_Attributes |= ATTR_ARCHIVE;
>   		dep->de_flag |= DE_MODIFIED;
>   	}
> 
> --- 424,448 ----
>   		 * set ATTR_ARCHIVE for directories `cp -pr' from a more
>   		 * sensible filesystem attempts it a lot.
>   		 */
> ! 		if (vap->va_flags & (SF_SETTABLE & ~(SF_ARCHIVED))) {

Excessive parentheses.

>   			error = priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0);
>   			if (error)
>   				return (error);
>   		}

VADMIN is still needed, and that is too strict.  This is a general problem
and should be fixed separately.

> ! 		if (vap->va_flags & ~(SF_ARCHIVED | UF_HIDDEN | UF_SYSTEM))
>   			return EOPNOTSUPP;
>   		if (vap->va_flags & SF_ARCHIVED)
>   			dep->de_Attributes &= ~ATTR_ARCHIVE;
>   		else if (!(dep->de_Attributes & ATTR_DIRECTORY))
>   			dep->de_Attributes |= ATTR_ARCHIVE;
> + 		if (vap->va_flags & UF_HIDDEN)
> + 			dep->de_Attributes |= ATTR_HIDDEN;
> + 		else
> + 			dep->de_Attributes &= ~ATTR_HIDDEN;
> + 		if (vap->va_flags & UF_SYSTEM)
> + 			dep->de_Attributes |= ATTR_SYSTEM;
> + 		else
> + 			dep->de_Attributes &= ~ATTR_SYSTEM;
>   		dep->de_flag |= DE_MODIFIED;
>   	}

Technical old and new problems with msdosfs:
- all directories except the root directory support the 3 attributes
   handled above, and READONLY
- the special case for the root directory is because before FAT32, the
   root directory didn't have an entry for itself (and was otherwise
   special).  With FAT32, the root directory is not so special, but
   still doesn't have an entry for itself.
- thus the old code in the above is wrong for all directories except
   the root directory
- thus the new code in the above is wrong for the root directory.  It
   will make changes to the in-core denode.  These can be seen by stat()
   for a while, but go away when the vnode is recycled.
- other code is wrong for directories too.  deupdat() refuses to
   convert from the in-core denode to the disk directory entry for
   directories.  So even when the above changes values for directories,
   the changes only get synced to the disk accidentally when there is
   a large change (such as for extending the directory), to the directory
   entry.
- being the root directory is best tested for using VV_ROOT.  I use the
   following to fix the corresponding bugs in utimes():

 		/* Was: silently ignore the non-error or error for all dirs. */
 		if (DETOV(dep)->v_vflag & VV_ROOT)
 			return (EINVAL);
 		/* Otherwise valid. */

   deupdat() needs a similar change to not ignore all directories.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 15:57:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id CFF4C4E6;
 Sun, 10 Mar 2013 15:57:37 +0000 (UTC)
 (envelope-from mckusick@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id A10F2350;
 Sun, 10 Mar 2013 15:57:37 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2AFvb9H065897;
 Sun, 10 Mar 2013 15:57:37 GMT
 (envelope-from mckusick@freefall.freebsd.org)
Received: (from mckusick@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2AFva5E065896;
 Sun, 10 Mar 2013 15:57:36 GMT (envelope-from mckusick)
Date: Sun, 10 Mar 2013 15:57:36 GMT
Message-Id: <201303101557.r2AFva5E065896@freefall.freebsd.org>
To: kvedulv@kvedulv.de, mckusick@FreeBSD.org, freebsd-fs@FreeBSD.org
From: mckusick@FreeBSD.org
Subject: Re: kern/162362: [snapshots] [panic] ufs with snapshot(s) panics when
 getting full
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 15:57:37 -0000

Synopsis: [snapshots] [panic] ufs with snapshot(s) panics when getting full

State-Changed-From-To: open->closed
State-Changed-By: mckusick
State-Changed-When: Sun Mar 10 15:57:04 UTC 2013
State-Changed-Why: 
Closed at the request of the submitter.

http://www.freebsd.org/cgi/query-pr.cgi?pr=162362

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 16:22:43 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9CEC199F
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:22:43 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id 62D90639
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:22:43 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AGMfa6006199;
 Sun, 10 Mar 2013 10:22:41 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AGMfSc006196;
 Sun, 10 Mar 2013 10:22:41 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Date: Sun, 10 Mar 2013 10:22:41 -0600 (MDT)
From: Warren Block <wblock@wonkity.com>
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513C1629.50501@caltel.com>
Message-ID: <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
References: <513C1629.50501@caltel.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 10:22:41 -0600 (MDT)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 16:22:43 -0000

On Sat, 9 Mar 2013, Cody Ritts wrote:

> Poking around on the internet, it looks like gpart is possibly enforcing 
> geometry boundaries?

Not gpart, but the kernel.  At present, I don't know of any way to use 
FreeBSD for creating MBR slices aligned to anything other than 63 
blocks.  FreeBSD partitions can be aligned inside a slice with an 
offset.  Putting ZFS on one of those partitions may be the easiest way 
to do this.  Put the slice at block 2016, then align the first FreeBSD 
partition inside that slice to 1M and it should land at block 2048.

Another option is to create the MBR with aligned slices using another 
operating system, one that allows deviation from the MBR standard. 
Ronald Guilmette recently showed an interesting approach of starting the 
slice at 63M, the least common multiple of 63 and 1M.

If the BIOS does not like GPT, check for BIOS updates.  And make sure 
the vendor knows about the problem.

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 16:35:23 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A2CC9CAE
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:35:23 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 6C861698
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:35:23 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAOa1PFFCZpCq/2dsb2JhbABDxD+BYHSCJgEBBTg1CxELGAkWCAcJAwIBAgE0ERMGAgEBiA+oVpJzjxUWgyoDiHKNY4Vniw6DKhw
X-IPAS-Result: AqAEAOa1PFFCZpCq/2dsb2JhbABDxD+BYHSCJgEBBTg1CxELGAkWCAcJAwIBAgE0ERMGAgEBiA+oVpJzjxUWgyoDiHKNY4Vniw6DKhw
X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16937060"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 10 Mar 2013 09:34:33 -0700
Message-ID: <513CB619.9050201@caltel.com>
Date: Sun, 10 Mar 2013 09:34:33 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <CABXB=RRE3+nq9RioVi4Er4kRP_=Tbonoh=rnh91Ew=3hzYapbw@mail.gmail.com>
In-Reply-To: <CABXB=RRE3+nq9RioVi4Er4kRP_=Tbonoh=rnh91Ew=3hzYapbw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 16:35:23 -0000

That is the thing, I am struggling with MBR because the computer just 
completely ignores booting from the default GPT.  I spent about 2 hours 
learning that the hard way the other day.  Who would have thought a 
brand new machine would not boot from a GPT disk?  I will be testing if 
it is just FreeBSDs GPT or others as well.

Thanks,

Cody

On 3/9/13 10:12 PM, J David wrote:
>
> On Sun, Mar 10, 2013 at 12:12 AM, Cody Ritts <cr@caltel.com
> <mailto:cr@caltel.com>> wrote:
>
>     I have a new intel atom appliance that will not boot from a GPT
>     partition table.  It came with an SSD, so I am trying to align it to
>     1MB for the erase block size.
>
>
> I looked and looked and I don't see where you're creating a GPT
> partition table or indeed doing anything with GPT.  You create an MBR
> table here:
>
>         gpart create -s mbr ada0
>
>
> And seem to stick with it through the rest of your example.
>
> If you adjust this to:
>
> gpart create -s gpt ada0
>
> You may get better results, because MBR is indeed going to saddle you
> with cylinder boundaries using some inscrutable probably-fictional geometry.
>
> I think you'd want something like
>
> gpart add -t freebsd-boot -b 34 -s 128 ada0
> gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
> gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs ada0
> gpart add -s 8G -t freebsd-swap ada0
>
> But that might need some tweaking.  Your zpool will then use the "zroot"
> partition / ada0p2.
>
> Hope that is helpful.
>

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 16:53:03 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id DFD15F32
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:53:03 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id C92D472B
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 16:53:03 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap8EABi6PFFCZpCq/2dsb2JhbABDxECBYHSCJgEBBAE4QBELGAkWDwkDAgECAUUTCAEBiAkGu0uPFRaDKgOIco1jhWeLDoMqHA
X-IPAS-Result: Ap8EABi6PFFCZpCq/2dsb2JhbABDxECBYHSCJgEBBAE4QBELGAkWDwkDAgECAUUTCAEBiAkGu0uPFRaDKgOIco1jhWeLDoMqHA
X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16937576"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 10 Mar 2013 09:53:03 -0700
Message-ID: <513CBA6E.1090808@caltel.com>
Date: Sun, 10 Mar 2013 09:53:02 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
In-Reply-To: <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 16:53:03 -0000

Offsetting the zfs slice, was one of the the first things I tried, but 
when I boot the loader tells me:

  > zfsboot: No ZFS Pools located, can't boot

I get the feeling that these need to be next to each other
   dd if=/boot/zfsboot of=/dev/ada0s1 count=1
   dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024

Good idea on using another fdisk.  I will fire up Arch and give it a go. 
  That will also let me test if the system will not boot with any GPT, 
or of there is something specific to FreeBSDs.  Once I isolate it, I see 
if I can figure out how to make a bug report to Foxconn.

And putting things in perspective, 63M out of 65536M is really nbd.  I 
wish I would have thought of that, so simple.  I guess my head is still 
stuck in 1996 when drives were still measured in MB :)

Thanks,

Cody


On 3/10/13 9:22 AM, Warren Block wrote:
> On Sat, 9 Mar 2013, Cody Ritts wrote:
>
>> Poking around on the internet, it looks like gpart is possibly
>> enforcing geometry boundaries?
>
> Not gpart, but the kernel.  At present, I don't know of any way to use
> FreeBSD for creating MBR slices aligned to anything other than 63
> blocks.  FreeBSD partitions can be aligned inside a slice with an
> offset.  Putting ZFS on one of those partitions may be the easiest way
> to do this.  Put the slice at block 2016, then align the first FreeBSD
> partition inside that slice to 1M and it should land at block 2048.
>
> Another option is to create the MBR with aligned slices using another
> operating system, one that allows deviation from the MBR standard.
> Ronald Guilmette recently showed an interesting approach of starting the
> slice at 63M, the least common multiple of 63 and 1M.
>
> If the BIOS does not like GPT, check for BIOS updates.  And make sure
> the vendor knows about the problem.
>

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 19:06:26 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id B2AB7D5
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:06:26 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 550D6B2F
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:06:26 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAGnYPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu1KPFYNAA4hyjWOFZ4sOgyoc
X-IPAS-Result: AqAEAGnYPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu1KPFYNAA4hyjWOFZ4sOgyoc
X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16945837"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 10 Mar 2013 12:06:19 -0700
Message-ID: <513CD9AB.5080903@caltel.com>
Date: Sun, 10 Mar 2013 12:06:19 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
In-Reply-To: <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 19:06:26 -0000

So, aligning to 63MB was still tricky.  I found your thread, but I could 
not find step by step how to calculate the offset.

So to put a close to this thread (hopefully), here is how I calculated 
my MBR alignment.

The MBR seems to force partitions to start on a track

There are 63 sectors/track

Rule of thumb for SSD Erase Blocks is align to 1MB (2048s)

I can start a partition on every 63rd sector:
63,126,189,252... etc

63 and 2048 have no common multiples, so
63x2048 = 129024

> root@:/root # gpart add -b 129024 -t freebsd ada0
> ada0s1 added
> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63     128961        - free -  (63M)
>      129024  124916400     1  freebsd  (59G)

YAY, my partition now starts at track 2048.


1MB for SSDs is a rule of thumb and most erase blocks are 128 256 or 
512. (so I have read on the internet).

Odds are, my SSD has an erase block of 512K or less, so, I can choose a 
smaller offset:

512K = 1024 sectors
1024*63 = 64512
> root@:/root # gpart add -b 64512 -t freebsd ada0
> ada0s1 added
> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63      64449        - free -  (31M)
>       64512  124980912     1  freebsd  (59G)


Anyway, that is good enough for me.
In the end, here are my working commands to create an aligned MBR 
partition ready for zpool creation and boot.
> gpart create -s mbr ada0
> gpart add -b 64512 -t freebsd ada0
> gpart create -s bsd ada0s1
> gpart add -s 52833M -t freebsd-zfs  ada0s1
> gpart add -t freebsd-swap ada0s1
> gpart set -a active -i 1 ada0
> gpart bootcode -b /boot/mbr ada0
> dd if=/boot/zfsboot of=/dev/ada0s1 count=1
> dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024

Also as related bonus, if you are reading about alignment, here is how 
to get 4k blocks for your zpool on an SSD or AF/4K hard drive.
> glabel label zfs  /dev/ada0s1a
> gnop create -S 4096 /dev/label/zfs
> zpool create -R /mnt tank /dev/label/zfs.nop
> zpool export tank
> gnop destroy /dev/label/zfs.nop
> zpool import -R /mnt -o cachefile=/tmp/zpool.cache system
> zdb -U /tmp/zpool.cache | grep ashift
>    ashift: 12
   2^12 = 4096


Thanks,

Cody



On 3/10/13 9:22 AM, Warren Block wrote:
> On Sat, 9 Mar 2013, Cody Ritts wrote:
>
>> Poking around on the internet, it looks like gpart is possibly
>> enforcing geometry boundaries?
>
> Not gpart, but the kernel.  At present, I don't know of any way to use
> FreeBSD for creating MBR slices aligned to anything other than 63
> blocks.  FreeBSD partitions can be aligned inside a slice with an
> offset.  Putting ZFS on one of those partitions may be the easiest way
> to do this.  Put the slice at block 2016, then align the first FreeBSD
> partition inside that slice to 1M and it should land at block 2048.
>
> Another option is to create the MBR with aligned slices using another
> operating system, one that allows deviation from the MBR standard.
> Ronald Guilmette recently showed an interesting approach of starting the
> slice at 63M, the least common multiple of 63 and 1M.
>
> If the BIOS does not like GPT, check for BIOS updates.  And make sure
> the vendor knows about the problem.
>

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 19:34:51 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 3BBF982E
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:34:51 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id D8282CE1
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:34:50 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AJYmrI007464;
 Sun, 10 Mar 2013 13:34:48 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AJYmoi007461;
 Sun, 10 Mar 2013 13:34:48 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Date: Sun, 10 Mar 2013 13:34:48 -0600 (MDT)
From: Warren Block <wblock@wonkity.com>
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513CD9AB.5080903@caltel.com>
Message-ID: <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 13:34:48 -0600 (MDT)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 19:34:51 -0000

On Sun, 10 Mar 2013, Cody Ritts wrote:

> So, aligning to 63MB was still tricky.  I found your thread, but I could not 
> find step by step how to calculate the offset.

Here is the procedure I had in mind:

# gpart create -s mbr da0
da0 created
root@lightning# gpart add -t freebsd -b 2016 da0
da0s1 added
# gpart show da0
=>      63  39070017  da0  MBR  (18G)
         63      1953       - free -  (976k)
       2016  39068064    1  freebsd  (18G)

# gpart create -s bsd da0s1
da0s1 created
# gpart add -t freebsd-zfs -a 1m da0s1
da0s1a added
root@lightning# gpart show da0s1
=>       0  39068064  da0s1  BSD  (18G)
          0        32         - free -  (16k)
         32  39067648      1  freebsd-zfs  (18G)
   39067680       384         - free -  (192k)

The first slice starts at the last CHS-aligned block before 1M, or 2016. 
Misaligned, but not a problem because nothing will be reading from that 
location.

The freebsd-zfs partition is created, letting gpart align it to 1M. 
gpart starts the partition at an offset of 32, making it the 1M-aligned 
block 2048 of the disk.

gpart should also be able to install the bootcode correctly, but I have 
not tried it for MBR and ZFS.

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 19:47:54 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id CB72D928
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:47:54 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 4066CD3B
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:47:53 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEACziPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu12PFRaDKgOIco1jhWeLDoMqHA
X-IPAS-Result: AqAEACziPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu12PFRaDKgOIco1jhWeLDoMqHA
X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16947068"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 10 Mar 2013 12:47:53 -0700
Message-ID: <513CE369.4030303@caltel.com>
Date: Sun, 10 Mar 2013 12:47:53 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
In-Reply-To: <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 19:47:54 -0000

Yeah, just for clarity:

> root@lightning# gpart show da0s1
> =>       0  39068064  da0s1  BSD  (18G)
>           0        32         - free -  (16k)
>          32  39067648      1  freebsd-zfs  (18G)
>    39067680       384         - free -  (192k)

and

> dd if=/boot/zfsboot of=/dev/da0s1 count=1
> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024

will not boot.
and will result in:

> zfsboot: No ZFS Pools located, can't boot

The freebsd-zfs slice cannot have an offset.

I tried it several different ways first, since it was the easiest to 
align, and as soon as I added that offset, the boot strap process would 
break.

Thanks

Cody

On 3/10/13 12:34 PM, Warren Block wrote:
> On Sun, 10 Mar 2013, Cody Ritts wrote:
>
>> So, aligning to 63MB was still tricky.  I found your thread, but I
>> could not find step by step how to calculate the offset.
>
> Here is the procedure I had in mind:
>
> # gpart create -s mbr da0
> da0 created
> root@lightning# gpart add -t freebsd -b 2016 da0
> da0s1 added
> # gpart show da0
> =>      63  39070017  da0  MBR  (18G)
>          63      1953       - free -  (976k)
>        2016  39068064    1  freebsd  (18G)
>
> # gpart create -s bsd da0s1
> da0s1 created
> # gpart add -t freebsd-zfs -a 1m da0s1
> da0s1a added
> root@lightning# gpart show da0s1
> =>       0  39068064  da0s1  BSD  (18G)
>           0        32         - free -  (16k)
>          32  39067648      1  freebsd-zfs  (18G)
>    39067680       384         - free -  (192k)
>
> The first slice starts at the last CHS-aligned block before 1M, or 2016.
> Misaligned, but not a problem because nothing will be reading from that
> location.
>
> The freebsd-zfs partition is created, letting gpart align it to 1M.
> gpart starts the partition at an offset of 32, making it the 1M-aligned
> block 2048 of the disk.
>
> gpart should also be able to install the bootcode correctly, but I have
> not tried it for MBR and ZFS.
>

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 19:51:20 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 3B703A88
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:51:20 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id E5943D60
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 19:51:19 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AJpHdv007678;
 Sun, 10 Mar 2013 13:51:18 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AJpHSZ007675;
 Sun, 10 Mar 2013 13:51:17 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Date: Sun, 10 Mar 2013 13:51:17 -0600 (MDT)
From: Warren Block <wblock@wonkity.com>
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513CE369.4030303@caltel.com>
Message-ID: <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 13:51:18 -0600 (MDT)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 19:51:20 -0000

On Sun, 10 Mar 2013, Cody Ritts wrote:

> Yeah, just for clarity:
>
>> root@lightning# gpart show da0s1
>> =>       0  39068064  da0s1  BSD  (18G)
>>           0        32         - free -  (16k)
>>          32  39067648      1  freebsd-zfs  (18G)
>>    39067680       384         - free -  (192k)
>
> and
>
>> dd if=/boot/zfsboot of=/dev/da0s1 count=1
>> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024
>
> will not boot.

But is that putting zfsboot in the right place?  Try installing zfsboot 
with gpart.

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 20:00:01 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A26D5BE9;
 Sun, 10 Mar 2013 20:00:01 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
 by mx1.freebsd.org (Postfix) with ESMTP id 6B045DA3;
 Sun, 10 Mar 2013 20:00:01 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r2AJxkIg047829;
 Sun, 10 Mar 2013 11:59:50 -0800 (PST)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201303101959.r2AJxkIg047829@gw.catspoiler.org>
Date: Sun, 10 Mar 2013 12:59:46 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift
 topic to ZFS!
To: lev@FreeBSD.org
In-Reply-To: <1809201254.20130309160817@serebryakov.spb.ru>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=iso-8859-5
Content-Transfer-Encoding: 8BIT
Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 20:00:01 -0000

On  9 Mar, Lev Serebryakov wrote:
> Hello, Don.
> You wrote 9 ����� 2013 �., 7:03:52:
> 
> 
>>>     But anyway, torrent client is bad benchmark if we start to speak
>>>   about some real experiments to decide what could be improved in
>>>   FFS/GEOM stack, as it is not very repeatable.
> DL> I seem to recall that you mentioning that the raid5 geom layer is doing
> DL> a lot of caching, presumably to coalesce writes.  If this causes the
> DL> responses to writes to be delayed too much, then the geom layer could
> DL> end up starved for writes because the vfs.hirunningspace limit will be
> DL> reached.  If this happens, you'll see threads waiting on wdrain.  You
> DL> could also monitor vfs.runningbufspace to see how close it is getting to
> DL> the limit.  If this is the problem, you might want to try cranking up
>   Strangely  enough,  vfs.runningbufspace  is  always zero, even under
> load.

That's very odd ...

> My geom_raid5 is configured to dealy writes up to 15 seconds...
> 
> DL> Something else to look at is what problems might the delayed write
> DL> completion notifications from the drives cause in the raid5 layer
> DL> itself.  Could that be preventing the raid5 layer from sending other I/O
> DL> commands to the drives?   Between the time a write command has been sent
>    Nope. It should not. I'm not sure for 100%, as I picked up these
> sources from original author and sources are rather cryptic, but I
> could not see any throttling in it.
> DL> to a drive and the drive reports the completion of the write, what
> DL> happens if something wants to touch that buffer?
> 
> DL> What size writes does the application typically do?  What is the UFS
>   64K writes, 32K blocksize, 128K stripe size... Now I'm analyzing
> traces from this device to understand exact write patterns.

It would be interesting to see what percentage of the writes are full
stripe versus a partial stripe to see how effective the caching is.  The
partial stripe writes probably have to read the parity in order to
update it.

It would also be interesting to monitor the number of commands each
drive is handling.  In the ahci.c, it looks like it would be
ch->numrslots assuming that you aren't using a port multiplier.

> DL> blocksize?  What is the raid5 stripe size?  With this access pattern,
> DL> you may get poor results if the stripe size is much greater than the
> DL> block and write sizes.
> 


From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 20:37:05 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 693822C5
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 20:37:05 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 3620EEC0
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 20:37:04 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap8EACPuPFFCZpCq/2dsb2JhbABCxESBYXSCJgEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy7TgSPFRaDKgOIco1jhWeLDoMqHA
X-IPAS-Result: Ap8EACPuPFFCZpCq/2dsb2JhbABCxESBYXSCJgEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy7TgSPFRaDKgOIco1jhWeLDoMqHA
X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16947872"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 10 Mar 2013 13:36:55 -0700
Message-ID: <513CEEE7.8090400@caltel.com>
Date: Sun, 10 Mar 2013 13:36:55 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
In-Reply-To: <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 20:37:05 -0000

I have never seen ANY reference to installing the /boot/zfsboot with 
gpart.  And if you look at how that ZFS boot code is installed, it is 
not like any of the other boot codes that gpart does install.  I dont 
even know how you would construct the syntax.  the bootcode arguments 
seem to be GPT and MBR specific, and mutually exclusive.

https://wiki.freebsd.org/RootOnZFS/ZFSBootPartition#A_Installing_FreeBSD_to_the_ZFS_filesystem

I know that if I leave either of those dd commands out when installing, 
the system will not boot.  If I leave out the ada0s1 boot code, the 
system just hangs after mbr is run.  If I leave out the ada0s1a boot 
code, I get an error. (I dont remember what)

Ultimately, I am aligned and running so I am happy enough. 32MB is an 
acceptable loss.  I am burnt out on struggling with the bootloader.  I 
have been doing it in various ways for DAYS now trying to just figure 
out ZFS boot at all.  There are a lot of broken (or old?) ZFS boot 
howtos that made the process more difficult than it should be for me 
because I wasnt luck enough to stumble onto the "correct" articles right 
off the bat.  I do everything the hard way :)

Thanks

Cody

On 3/10/13 12:51 PM, Warren Block wrote:
> On Sun, 10 Mar 2013, Cody Ritts wrote:
>
>> Yeah, just for clarity:
>>
>>> root@lightning# gpart show da0s1
>>> =>       0  39068064  da0s1  BSD  (18G)
>>>           0        32         - free -  (16k)
>>>          32  39067648      1  freebsd-zfs  (18G)
>>>    39067680       384         - free -  (192k)
>>
>> and
>>
>>> dd if=/boot/zfsboot of=/dev/da0s1 count=1
>>> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024
>>
>> will not boot.
>
> But is that putting zfsboot in the right place?  Try installing zfsboot
> with gpart.
>

From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 20:43:06 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 44ADE624
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 20:43:06 +0000 (UTC)
 (envelope-from cross+freebsd@distal.com)
Received: from mail.distal.com (mail.distal.com [IPv6:2001:470:e24c:200::ae25])
 by mx1.freebsd.org (Postfix) with ESMTP id 1FC5FF0A
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 20:43:05 +0000 (UTC)
Received: from magrathea.distal.com (magrathea.distal.com [206.138.151.12])
 (authenticated bits=0)
 by mail.distal.com (8.14.3/8.14.3) with ESMTP id r2AKfmNE029825
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
 Sun, 10 Mar 2013 16:41:49 -0400 (EDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Aligning MBR for ZFS boot help
From: Chris Ross <cross+freebsd@distal.com>
In-Reply-To: <513CEEE7.8090400@caltel.com>
Date: Sun, 10 Mar 2013 16:41:48 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <6A16D732-01B5-43A8-B676-65B9B35C1FDA@distal.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <513CEEE7.8090400@caltel.com>
To: Cody Ritts <cr@caltel.com>
X-Mailer: Apple Mail (2.1499)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 20:43:06 -0000


On Mar 10, 2013, at 16:36 , Cody Ritts <cr@caltel.com> wrote:
> I have never seen ANY reference to installing the /boot/zfsboot with =
gpart.  And if you look at how that ZFS boot code is installed, it is =
not like any of the other boot codes that gpart does install.  I dont =
even know how you would construct the syntax.  the bootcode arguments =
seem to be GPT and MBR specific, and mutually exclusive.

  This is a sparc64 reference, but it does install /boot/zfsboot with =
gpart.  FYI.

http://lists.freebsd.org/pipermail/freebsd-sparc64/2012-July/008489.html

                   - Chris



From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 21:40:09 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 07D9E5EA
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 21:40:09 +0000 (UTC)
 (envelope-from freebsd@penx.com)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 by mx1.freebsd.org (Postfix) with ESMTP id C514B182
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 21:40:08 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.6/8.14.5) with ESMTP id r2ALduqo078414;
 Sun, 10 Mar 2013 14:39:56 -0700 (PDT)
 (envelope-from freebsd@penx.com)
Subject: Re: Aligning MBR for ZFS boot help
From: Dennis Glatting <freebsd@penx.com>
To: Warren Block <wblock@wonkity.com>
In-Reply-To: <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
Content-Type: text/plain; charset="us-ascii"
Date: Sun, 10 Mar 2013 14:39:55 -0700
Message-ID: <1362951595.99445.2.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
X-yoursite-MailScanner-Information: Dennis Glatting
X-yoursite-MailScanner-ID: r2ALduqo078414
X-yoursite-MailScanner: Found to be clean
X-MailScanner-From: freebsd@penx.com
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 21:40:09 -0000

Sorry for the stupid question but is this issue (and issues) and
procedures written up somewhere?


On Sun, 2013-03-10 at 13:51 -0600, Warren Block wrote:
> On Sun, 10 Mar 2013, Cody Ritts wrote:
> 
> > Yeah, just for clarity:
> >
> >> root@lightning# gpart show da0s1
> >> =>       0  39068064  da0s1  BSD  (18G)
> >>           0        32         - free -  (16k)
> >>          32  39067648      1  freebsd-zfs  (18G)
> >>    39067680       384         - free -  (192k)
> >
> > and
> >
> >> dd if=/boot/zfsboot of=/dev/da0s1 count=1
> >> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024
> >
> > will not boot.
> 
> But is that putting zfsboot in the right place?  Try installing zfsboot 
> with gpart.
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"



From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 22:57:36 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 51B4A24F
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 22:57:36 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id EE51865A
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 22:57:35 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id A3FCA47E1D; Sun, 10 Mar 2013 23:52:22 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.1.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id 09DEB47E11
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 23:52:22 +0100 (CET)
Message-ID: <513D0E90.5090105@platinum.linux.pl>
Date: Sun, 10 Mar 2013 23:52:00 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
In-Reply-To: <513C1629.50501@caltel.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 22:57:36 -0000

I don't think zfsboot is aware of BSD disklabel (offsets other than 0 
won't boot). Is there any reason you are using BSD disklabel and not two 
partition MBR?

I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs 
will be aligned to sector size (ashift). Unless ZFS pool is created with 
higher ashift then the 63 sector offset is as good as any.

gpart create -s mbr ada0
gpart add -s 52862M -t freebsd ada0
gpart add -s 8G -t freebsd ada0

gpart bootcode -b /boot/mbr ada0
dd if=/boot/zfsboot of=/dev/ada0s1 count=1
dd if=/boot/zfsboot of=/dev/ada0s1 skip=1 seek=1024

zpool create ... /dev/ada0s1
swapon /dev/ada0s2

If you still want that 1MB alignment then you will have to (as explained 
by others) align the MBR partition.

On 2013-03-10 06:12, Cody Ritts wrote:
> Hello all,
>
> I am really struggling to understand what is going on, if anyone could
> tell me where I am going wrong, I would greatly appreciate it.
>
> I have a new intel atom appliance that will not boot from a GPT
> partition table.  It came with an SSD, so I am trying to align it to 1MB
> for the erase block size.
>
> All of these commands are being run from a 9.1-RELEASE-amd64-memstick
>
>
> These commands partition the drive, the system boots just fine:
>> gpart create -s mbr ada0
>> gpart add -t freebsd ada0
>> gpart create -s bsd ada0s1
>> gpart add -s 52862M -t freebsd-zfs  ada0s1
>> gpart add -s 8G  -t freebsd-swap ada0s1
>> gpart set -a active -i 1 ada0
>> gpart bootcode -b /boot/mbr ada0
>> dd if=/boot/zfsboot of=/dev/ada0s1 count=1
>> dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024
>
> This is the gpart print output of those commands
>> =>       63  125045361  ada0  MBR  (59G)
>>          63  125045361     1  freebsd  [active]  (59G)
>>
>> =>        0  125045361  ada0s1  BSD  (59G)
>>           0  108261376       1  freebsd-zfs  (51G)
>>   108261376   16777216       2  freebsd-swap  (8.0G)
>>   125038592       6769          - free -  (3.3M)
>
> Here is my disk info
>> root@:/root # diskinfo -v ada0
>> ada0
>>     512             # sectorsize
>>     64023257088     # mediasize in bytes (59G)
>>     125045424       # mediasize in sectors
>>     0               # stripesize
>>     0               # stripeoffset
>>     124053          # Cylinders according to firmware.
>>     16              # Heads according to firmware.
>>     63              # Sectors according to firmware.
>
> 125045361 + 63 = 125045424
> So gpart is for sure printing sectors.
> freebsd-zfs starts at sector 63
>
> So, I need that freebsd-zfs slice to start at 1MB
> 1MB = 2048s
> 2048 - 63 = 1985
> so if I add an offset to my slice:
>> gpart add -b 1985 -s 52862M -t freebsd-zfs ada0s1
>
> should start me at 2048.
>> =>       63  125045361  ada0  MBR  (59G)
>>          63  125045361     1  freebsd  [active]  (59G)
>> =>        0  125045361  ada0s1  BSD  (59G)
>>           0       1985          - free -  (992k)
>>        1985  108261376       1  freebsd-zfs  (51G)
>
> BUT, when i boot, I get this:
>> zfsboot: No ZFS Pools located, can't boot
>
> I think remember reading that freebsd-zfs had to be the first slice (I
> cannot remember where i read that).  And it apparently does not think an
> offset is funny.
>
> So, that leaves me with trying to adjust my MBR partition, so I start
> over and run:
>> gpart add -b 1985 -t freebsd ada0
>
> but that gives me:
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       1953        - free -  (976k)
>>        2016  125043408     1  freebsd  (59G)
>
> HHHMMMMM????  well, 2016 - 1953 = 63  coincidence?  i doubt it, but I
> dont get it.
>
> Poking around on the internet, it looks like gpart is possibly enforcing
> geometry boundaries? so I do the following:
>
>> sysctl kern.geom.part.check_integrity=0
>> root@:/root # gpart add -a 1m -t freebsd ada0
>> ada0s1 added
>> root@:/root # gpart show
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       2016        - free -  (1M)
>>        2079  125042652     1  freebsd  (59G)
>>   125044731        693        - free -  (346k)
>
> Obviously still didnt work.
>
>
> I try a 10MB offset.
> 10MB = 20480s
> 20480-63  = 20417s
>> gpart add -b 20417 -t freebsd ada0
>> =>       63  125045361  ada0  MBR  (59G)
>>          63      20412        - free -  (10M)
>>       20475  125024949     1  freebsd  (59G)
>
> It is still just a few sectors off.  So what if i let gpart
> automatically align it.
>
>> gpart add -a 1m -t freebsd ada0
>
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       2016        - free -  (1M)
>>        2079  125042652     1  freebsd  (59G)
>>   125044731        693        - free -  (346k)
>
>
> And 2079 is still != 2048.
>
> I have tried adjusting those numbers one by one, and it just hops around
> the number I am looking for.  I have tried adding partitions in-front of
> it, setting the alignment to 1s, and adjusting the size.  I cannot get
> it to land on 2048.
>
> It does boot with the padding in the MBR table, but I don't think it is
> aligned.  Maybe it is aligned, and I just dont know any better.
>
> I am at a loss.
>
> Any suggestions would be greatly appreciated.
>
> Thanks,
>
> Cody
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"


From owner-freebsd-fs@FreeBSD.ORG  Sun Mar 10 23:59:18 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id E40C3D6E
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 23:59:18 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id 9B306827
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 23:59:18 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r2ANxHxN009350;
 Sun, 10 Mar 2013 17:59:17 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2ANxHob009347;
 Sun, 10 Mar 2013 17:59:17 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Date: Sun, 10 Mar 2013 17:59:17 -0600 (MDT)
From: Warren Block <wblock@wonkity.com>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513D0E90.5090105@platinum.linux.pl>
Message-ID: <alpine.BSF.2.00.1303101749370.8481@wonkity.com>
References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 17:59:17 -0600 (MDT)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 23:59:19 -0000

On Sun, 10 Mar 2013, Adam Nowacki wrote:

> I don't think zfsboot is aware of BSD disklabel (offsets other than 0 won't 
> boot). Is there any reason you are using BSD disklabel and not two partition 
> MBR?

MBR slices created on FreeBSD are forced to CHS alignment, pretty much 
always misaligned for 4K-block hard drives or SSDs.

> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs will 
> be aligned to sector size (ashift). Unless ZFS pool is created with higher 
> ashift then the 63 sector offset is as good as any.

If the drive has 4K sectors, that will be misaligned, potentially 
cutting speeds drastically even if ashift is 9.

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 00:39:33 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 083632B1
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 00:39:33 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id AFC05974
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 00:39:32 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r2B0dU2X009599;
 Sun, 10 Mar 2013 18:39:30 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2B0dUKW009596;
 Sun, 10 Mar 2013 18:39:30 -0600 (MDT)
 (envelope-from wblock@wonkity.com)
Date: Sun, 10 Mar 2013 18:39:30 -0600 (MDT)
From: Warren Block <wblock@wonkity.com>
To: Dennis Glatting <freebsd@penx.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <1362951595.99445.2.camel@btw.pki2.com>
Message-ID: <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 18:39:30 -0600 (MDT)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 00:39:33 -0000

On Sun, 10 Mar 2013, Dennis Glatting wrote:

> Sorry for the stupid question but is this issue (and issues) and
> procedures written up somewhere?

The wiki shows using GPT:

https://wiki.freebsd.org/RootOnZFS/GPTZFSBoot/9.0-RELEASE

The issue with MBR is that FreeBSD strictly follows the standard of 
alignment to CHS values.  However, for the last couple of decades, 
drives have used variable geometry to fit more data on the outside 
tracks, so CHS values don't apply any more.  Combine this with the new 
need to align MBR slices to particular values of 4K or 1M, and FreeBSD 
has a problem.

Solution: use GPT when possible.  But some systems won't boot from GPT.

If MBR partitioning is required, and alignment is needed, use some other 
operating system to create the MBR, and don't try to edit the slices on 
FreeBSD.

I don't know if anyone has documented all this in one place.  If FreeBSD 
had a way to turn off the strict enforcement of CHS values for MBR, it 
would make that unnecessary.

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 04:19:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id E08A121F
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 04:19:40 +0000 (UTC)
 (envelope-from jdavidlists@gmail.com)
Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com
 [IPv6:2607:f8b0:4001:c03::22d])
 by mx1.freebsd.org (Postfix) with ESMTP id B496D309
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 04:19:40 +0000 (UTC)
Received: by mail-ie0-f173.google.com with SMTP id 9so4251761iec.32
 for <freebsd-fs@freebsd.org>; Sun, 10 Mar 2013 21:19:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=Dxu8GRKfhgaKjyOXqfsUVmWp9yzwDVq/TOrQjV66Nds=;
 b=wSI2t8Pp84JbALL1tizqWjHdOAMxgbvcZcR5TC2naSL8V9eEs4YBiai0EZf12cNfwc
 v1d1+EjpDBxJ7VMXuZmJfXxCtH78i95JdSBw7Ib0hUMVkGVZ4XYbIRw05B2yytDUjRfe
 eakYl+vWmJBtBaU/AX5nUFG6zATJXUk1CtR6ZMcvv+WTvit+Kz6NHmUkGwMCy0/pRWDJ
 BzE+4J583fEiZNWsYDvyQUvJUhoBfkDjP0RnpAuCwbt8TiD27w/GdAXphjoW3ycQVMMY
 njsnJasc4lmVqv39IPOoNpaQXf7D0kyNja68Rbcn2c/neXWtrR60Rymg6sGqeo9YGZZD
 9rEA==
MIME-Version: 1.0
X-Received: by 10.42.150.131 with SMTP id a3mr7347785icw.8.1362975580465; Sun,
 10 Mar 2013 21:19:40 -0700 (PDT)
Sender: jdavidlists@gmail.com
Received: by 10.42.153.133 with HTTP; Sun, 10 Mar 2013 21:19:40 -0700 (PDT)
In-Reply-To: <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
Date: Mon, 11 Mar 2013 00:19:40 -0400
X-Google-Sender-Auth: eGnFIBBt0Gm9FcXDKjkiA8kgoS8
Message-ID: <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
Subject: Re: Aligning MBR for ZFS boot help
From: J David <j.david.lists@gmail.com>
To: Warren Block <wblock@wonkity.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 04:19:40 -0000

On Sun, Mar 10, 2013 at 8:39 PM, Warren Block <wblock@wonkity.com> wrote:

> If FreeBSD had a way to turn off the strict enforcement of CHS values for
> MBR, it would make that unnecessary.
>

The solution to that for a drive of this size *might* be to use fdisk
instead of gpart to do the partitioning, and just flat out lie to it about
the geometry when it asks you if you want to use something other than what
it read.  (Which was also a lie anyway.)

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 07:47:18 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 1961714B;
 Mon, 11 Mar 2013 07:47:18 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 859EFAFA;
 Mon, 11 Mar 2013 07:47:17 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2B7lD7w092264;
 Mon, 11 Mar 2013 09:47:13 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2B7lD7w092264
Received: (from kostik@localhost)
 by tom.home (8.14.6/8.14.6/Submit) id r2B7lDQg092263;
 Mon, 11 Mar 2013 09:47:13 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 11 Mar 2013 09:47:13 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: mckusick@FreeBSD.org
Subject: Re: kern/162362: [snapshots] [panic] ufs with snapshot(s) panics
 when getting full
Message-ID: <20130311074713.GO3794@kib.kiev.ua>
References: <201303101557.r2AFva5E065896@freefall.freebsd.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="8CLqiYUo6qiluZPB"
Content-Disposition: inline
In-Reply-To: <201303101557.r2AFva5E065896@freefall.freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home
Cc: freebsd-fs@FreeBSD.org, pho@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 07:47:18 -0000


--8CLqiYUo6qiluZPB
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Mar 10, 2013 at 03:57:36PM +0000, mckusick@FreeBSD.org wrote:
> Synopsis: [snapshots] [panic] ufs with snapshot(s) panics when getting fu=
ll
>=20
> State-Changed-From-To: open->closed
> State-Changed-By: mckusick
> State-Changed-When: Sun Mar 10 15:57:04 UTC 2013
> State-Changed-Why:=20
> Closed at the request of the submitter.
>=20
> http://www.freebsd.org/cgi/query-pr.cgi?pr=3D162362

This is known and still unresolved issue. It is reproducable on HEAD
as well.

--8CLqiYUo6qiluZPB
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJRPYwBAAoJEJDCuSvBvK1BatgP/3xU6T5cJW0fuNWGJP1HFw7Q
JiC/DiqHyCse7PG+FFTlY7ACw/U1oYHu2tvCzE2q4AMwOCzz3Wup503VcVa0OuC3
BSKDDiqF6HnLI3a6XC0bX3R34BTbAzAAxUbap1O1GtEEAhiUsYFOadyYD+WK8sTz
ASuFV6VGBCX6tFtP77RO3B67Hb6RW1m9L+BAPKnaMAN85Fdgau71/rnxVEGkRrHl
X48a7a9H3OyD3L5RC/a0NKe/gFHUOQ7S64hyIook0yoeakmRD1+Rt4ZgVlsqkw+W
c4CN80VcR4pHMamdG4LRAxypAl4cgzImm13xoennKjgp9Tt85ZSAaLI6r1yC5SRS
/vS9VVr27SPnyX7dqySpaCYVs36K+o7LOhhog8rPEph8C8L6Z+qWmzZjbnnTrifD
oX2zFPWcxpte16mCGfc+GevYNc2SmZ17zOqt9yWYbbPoMCA7adfs+nMtLv93eGVX
/0zpyCgQikRk/tyLwHySSNeU4RRS3vayDECsVJceQLgAGSBXWMIKt6B1r/l3d8bq
6gw2heWtSLlCK6bJuKTkf48Lp/+zc2uH05dhIUTiAezEn1+IVAowZP0OZujUSg5Z
O0Tb2+GBe4fLNVmlgLc3+lxxBsAkwy00eFCGU+u5PlkQAXrxqIJULzhxAzJ8EbjP
wh7nttKcevl42LRn4Ihe
=KM3W
-----END PGP SIGNATURE-----

--8CLqiYUo6qiluZPB--

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 11:06:42 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0859CA8E
 for <freebsd-fs@FreeBSD.org>; Mon, 11 Mar 2013 11:06:42 +0000 (UTC)
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id EDAB77C4
 for <freebsd-fs@FreeBSD.org>; Mon, 11 Mar 2013 11:06:41 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2BB6fsV088967
 for <freebsd-fs@FreeBSD.org>; Mon, 11 Mar 2013 11:06:41 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2BB6fRd088965
 for freebsd-fs@FreeBSD.org; Mon, 11 Mar 2013 11:06:41 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 11 Mar 2013 11:06:41 GMT
Message-Id: <201303111106.r2BB6fRd088965@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
 owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@freebsd.org>
To: freebsd-fs@FreeBSD.org
Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 11:06:42 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o bin/176253   fs         zpool(8): zfs pool indentation is misleading/wrong
o kern/176141  fs         [zfs] sharesmb=on makes errors for sharenfs, and still
o kern/175950  fs         [zfs] Possible deadlock in zfs after long uptime
o kern/175897  fs         [zfs] operations on readonly zpool hang
o kern/175179  fs         [zfs] ZFS may attach wrong device on move
o kern/175071  fs         [ufs] [panic] softdep_deallocate_dependencies: unrecov
o kern/174372  fs         [zfs] Pagefault appears to be related to ZFS
o kern/174315  fs         [zfs] chflags uchg not supported
o kern/174310  fs         [zfs] root point mounting broken on CURRENT with multi
o kern/174279  fs         [ufs] UFS2-SU+J journal and filesystem corruption
o kern/174060  fs         [ext2fs] Ext2FS system crashes (buffer overflow?)
o kern/173830  fs         [zfs] Brain-dead simple change to ZFS error descriptio
o kern/173718  fs         [zfs] phantom directory in zraid2 pool
f kern/173657  fs         [nfs] strange UID map with nfsuserd
o kern/173363  fs         [zfs] [panic] Panic on 'zpool replace' on readonly poo
o kern/173136  fs         [unionfs] mounting above the NFS read-only share panic
o kern/172348  fs         [unionfs] umount -f of filesystem in use with readonly
o kern/172334  fs         [unionfs] unionfs permits recursive union mounts; caus
o kern/171626  fs         [tmpfs] tmpfs should be noisier when the requested siz
o kern/171415  fs         [zfs] zfs recv fails with "cannot receive incremental 
o kern/170945  fs         [gpt] disk layout not portable between direct connect 
o bin/170778   fs         [zfs] [panic] FreeBSD panics randomly
o kern/170680  fs         [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA
o kern/170497  fs         [xfs][panic] kernel will panic whenever I ls a mounted
o kern/169945  fs         [zfs] [panic] Kernel panic while importing zpool (afte
o kern/169480  fs         [zfs] ZFS stalls on heavy I/O
o kern/169398  fs         [zfs] Can't remove file with permanent error
o kern/169339  fs         panic while " : > /etc/123"
o kern/169319  fs         [zfs] zfs resilver can't complete
o kern/168947  fs         [nfs] [zfs] .zfs/snapshot directory is messed up when 
o kern/168942  fs         [nfs] [hang] nfsd hangs after being restarted (not -HU
o kern/168158  fs         [zfs] incorrect parsing of sharenfs options in zfs (fs
o kern/167979  fs         [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste
o kern/167977  fs         [smbfs] mount_smbfs results are differ when utf-8 or U
o kern/167688  fs         [fusefs] Incorrect signal handling with direct_io
o kern/167685  fs         [zfs] ZFS on USB drive prevents shutdown / reboot
o kern/167612  fs         [portalfs] The portal file system gets stuck inside po
o kern/167272  fs         [zfs] ZFS Disks reordering causes ZFS to pick the wron
o kern/167260  fs         [msdosfs] msdosfs disk was mounted the second time whe
o kern/167109  fs         [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene
o kern/167105  fs         [nfs] mount_nfs can not handle source exports wiht mor
o kern/167067  fs         [zfs] [panic] ZFS panics the server
o kern/167065  fs         [zfs] boot fails when a spare is the boot disk
o kern/167048  fs         [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF
o kern/166912  fs         [ufs] [panic] Panic after converting Softupdates to jo
o kern/166851  fs         [zfs] [hang] Copying directory from the mounted UFS di
o kern/166477  fs         [nfs] NFS data corruption.
o kern/165950  fs         [ffs] SU+J and fsck problem
o kern/165521  fs         [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31
o kern/165392  fs         Multiple mkdir/rmdir fails with errno 31
o kern/165087  fs         [unionfs] lock violation in unionfs
o kern/164472  fs         [ufs] fsck -B panics on particular data inconsistency
o kern/164370  fs         [zfs] zfs destroy for snapshot fails on i386 and sparc
o kern/164261  fs         [nullfs] [patch] fix panic with NFS served from NULLFS
o kern/164256  fs         [zfs] device entry for volume is not created after zfs
o kern/164184  fs         [ufs] [panic] Kernel panic with ufs_makeinode
o kern/163801  fs         [md] [request] allow mfsBSD legacy installed in 'swap'
o kern/163770  fs         [zfs] [hang] LOR between zfs&syncer + vnlru leading to
o kern/163501  fs         [nfs] NFS exporting a dir and a subdir in that dir to 
o kern/162944  fs         [coda] Coda file system module looks broken in 9.0
o kern/162860  fs         [zfs] Cannot share ZFS filesystem to hosts with a hyph
o kern/162751  fs         [zfs] [panic] kernel panics during file operations
o kern/162591  fs         [nullfs] cross-filesystem nullfs does not work as expe
o kern/162519  fs         [zfs] "zpool import" relies on buggy realpath() behavi
o kern/161968  fs         [zfs] [hang] renaming snapshot with -r including a zvo
o kern/161864  fs         [ufs] removing journaling from UFS partition fails on 
o bin/161807   fs         [patch] add option for explicitly specifying metadata 
o kern/161579  fs         [smbfs] FreeBSD sometimes panics when an smb share is 
o kern/161533  fs         [zfs] [panic] zfs receive panic: system ioctl returnin
o kern/161438  fs         [zfs] [panic] recursed on non-recursive spa_namespace_
o kern/161424  fs         [nullfs] __getcwd() calls fail when used on nullfs mou
o kern/161280  fs         [zfs] Stack overflow in gptzfsboot
o kern/161205  fs         [nfs] [pfsync] [regression] [build] Bug report freebsd
o kern/161169  fs         [zfs] [panic] ZFS causes kernel panic in dbuf_dirty
o kern/161112  fs         [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3
o kern/160893  fs         [zfs] [panic] 9.0-BETA2 kernel panic
o kern/160860  fs         [ufs] Random UFS root filesystem corruption with SU+J 
o kern/160801  fs         [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o
o kern/160790  fs         [fusefs] [panic] VPUTX: negative ref count with FUSE
o kern/160777  fs         [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo
o kern/160706  fs         [zfs] zfs bootloader fails when a non-root vdev exists
o kern/160591  fs         [zfs] Fail to boot on zfs root with degraded raidz2 [r
o kern/160410  fs         [smbfs] [hang] smbfs hangs when transferring large fil
o kern/160283  fs         [zfs] [patch] 'zfs list' does abort in make_dataset_ha
o kern/159930  fs         [ufs] [panic] kernel core
o kern/159402  fs         [zfs][loader] symlinks cause I/O errors
o kern/159357  fs         [zfs] ZFS MAXNAMELEN macro has confusing name (off-by-
o kern/159356  fs         [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s
o kern/159351  fs         [nfs] [patch] - divide by zero in mountnfs()
o kern/159251  fs         [zfs] [request]: add FLETCHER4 as DEDUP hash option
o kern/159077  fs         [zfs] Can't cd .. with latest zfs version
o kern/159048  fs         [smbfs] smb mount corrupts large files
o kern/159045  fs         [zfs] [hang] ZFS scrub freezes system
o kern/158839  fs         [zfs] ZFS Bootloader Fails if there is a Dead Disk
o kern/158802  fs         amd(8) ICMP storm and unkillable process.
o kern/158231  fs         [nullfs] panic on unmounting nullfs mounted over ufs o
f kern/157929  fs         [nfs] NFS slow read
o kern/157399  fs         [zfs] trouble with: mdconfig force delete && zfs strip
o kern/157179  fs         [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov
o kern/156797  fs         [zfs] [panic] Double panic with FreeBSD 9-CURRENT and 
o kern/156781  fs         [zfs] zfs is losing the snapshot directory,
p kern/156545  fs         [ufs] mv could break UFS on SMP systems
o kern/156193  fs         [ufs] [hang] UFS snapshot hangs && deadlocks processes
o kern/156039  fs         [nullfs] [unionfs] nullfs + unionfs do not compose, re
o kern/155615  fs         [zfs] zfs v28 broken on sparc64 -current
o kern/155587  fs         [zfs] [panic] kernel panic with zfs
p kern/155411  fs         [regression] [8.2-release] [tmpfs]: mount: tmpfs : No 
o kern/155199  fs         [ext2fs] ext3fs mounted as ext2fs gives I/O errors
o bin/155104   fs         [zfs][patch] use /dev prefix by default when importing
o kern/154930  fs         [zfs] cannot delete/unlink file from full volume -> EN
o kern/154828  fs         [msdosfs] Unable to create directories on external USB
o kern/154491  fs         [smbfs] smb_co_lock: recursive lock for object 1
p kern/154228  fs         [md] md getting stuck in wdrain state
o kern/153996  fs         [zfs] zfs root mount error while kernel is not located
o kern/153753  fs         [zfs] ZFS v15 - grammatical error when attempting to u
o kern/153716  fs         [zfs] zpool scrub time remaining is incorrect
o kern/153695  fs         [patch] [zfs] Booting from zpool created on 4k-sector 
o kern/153680  fs         [xfs] 8.1 failing to mount XFS partitions
o kern/153418  fs         [zfs] [panic] Kernel Panic occurred writing to zfs vol
o kern/153351  fs         [zfs] locking directories/files in ZFS
o bin/153258   fs         [patch][zfs] creating ZVOLs requires `refreservation' 
s kern/153173  fs         [zfs] booting from a gzip-compressed dataset doesn't w
o bin/153142   fs         [zfs] ls -l outputs `ls: ./.zfs: Operation not support
o kern/153126  fs         [zfs] vdev failure, zpool=peegel type=vdev.too_small
o kern/152022  fs         [nfs] nfs service hangs with linux client [regression]
o kern/151942  fs         [zfs] panic during ls(1) zfs snapshot directory
o kern/151905  fs         [zfs] page fault under load in /sbin/zfs
o bin/151713   fs         [patch] Bug in growfs(8) with respect to 32-bit overfl
o kern/151648  fs         [zfs] disk wait bug
o kern/151629  fs         [fs] [patch] Skip empty directory entries during name 
o kern/151330  fs         [zfs] will unshare all zfs filesystem after execute a 
o kern/151326  fs         [nfs] nfs exports fail if netgroups contain duplicate 
o kern/151251  fs         [ufs] Can not create files on filesystem with heavy us
o kern/151226  fs         [zfs] can't delete zfs snapshot
o kern/150503  fs         [zfs] ZFS disks are UNAVAIL and corrupted after reboot
o kern/150501  fs         [zfs] ZFS vdev failure vdev.bad_label on amd64
o kern/150390  fs         [zfs] zfs deadlock when arcmsr reports drive faulted
o kern/150336  fs         [nfs] mountd/nfsd became confused; refused to reload n
o kern/149208  fs         mksnap_ffs(8) hang/deadlock
o kern/149173  fs         [patch] [zfs] make OpenSolaris <sys/nvpair.h> installa
o kern/149015  fs         [zfs] [patch] misc fixes for ZFS code to build on Glib
o kern/149014  fs         [zfs] [patch] declarations in ZFS libraries/utilities 
o kern/149013  fs         [zfs] [patch] make ZFS makefiles use the libraries fro
o kern/148504  fs         [zfs] ZFS' zpool does not allow replacing drives to be
o kern/148490  fs         [zfs]: zpool attach - resilver bidirectionally, and re
o kern/148368  fs         [zfs] ZFS hanging forever on 8.1-PRERELEASE
o kern/148138  fs         [zfs] zfs raidz pool commands freeze
o kern/147903  fs         [zfs] [panic] Kernel panics on faulty zfs device
o kern/147881  fs         [zfs] [patch] ZFS "sharenfs" doesn't allow different "
o kern/147420  fs         [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt 
o kern/146941  fs         [zfs] [panic] Kernel Double Fault - Happens constantly
o kern/146786  fs         [zfs] zpool import hangs with checksum errors
o kern/146708  fs         [ufs] [panic] Kernel panic in softdep_disk_write_compl
o kern/146528  fs         [zfs] Severe memory leak in ZFS on i386
o kern/146502  fs         [nfs] FreeBSD 8 NFS Client Connection to Server
s kern/145712  fs         [zfs] cannot offline two drives in a raidz2 configurat
o kern/145411  fs         [xfs] [panic] Kernel panics shortly after mounting an 
f bin/145309   fs         bsdlabel: Editing disk label invalidates the whole dev
o kern/145272  fs         [zfs] [panic] Panic during boot when accessing zfs on 
o kern/145246  fs         [ufs] dirhash in 7.3 gratuitously frees hashes when it
o kern/145238  fs         [zfs] [panic] kernel panic on zpool clear tank
o kern/145229  fs         [zfs] Vast differences in ZFS ARC behavior between 8.0
o kern/145189  fs         [nfs] nfsd performs abysmally under load
o kern/144929  fs         [ufs] [lor] vfs_bio.c + ufs_dirhash.c
p kern/144447  fs         [zfs] sharenfs fsunshare() & fsshare_main() non functi
o kern/144416  fs         [panic] Kernel panic on online filesystem optimization
s kern/144415  fs         [zfs] [panic] kernel panics on boot after zfs crash
o kern/144234  fs         [zfs] Cannot boot machine with recent gptzfsboot code 
o kern/143825  fs         [nfs] [panic] Kernel panic on NFS client
o bin/143572   fs         [zfs] zpool(1): [patch] The verbose output from iostat
o kern/143212  fs         [nfs] NFSv4 client strange work ...
o kern/143184  fs         [zfs] [lor] zfs/bufwait LOR
o kern/142878  fs         [zfs] [vfs] lock order reversal
o kern/142597  fs         [ext2fs] ext2fs does not work on filesystems with real
o kern/142489  fs         [zfs] [lor] allproc/zfs LOR
o kern/142466  fs         Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re
o kern/142306  fs         [zfs] [panic] ZFS drive (from OSX Leopard) causes two 
o kern/142068  fs         [ufs] BSD labels are got deleted spontaneously
o kern/141897  fs         [msdosfs] [panic] Kernel panic. msdofs: file name leng
o kern/141463  fs         [nfs] [panic] Frequent kernel panics after upgrade fro
o kern/141305  fs         [zfs] FreeBSD ZFS+sendfile severe performance issues (
o kern/141091  fs         [patch] [nullfs] fix panics with DIAGNOSTIC enabled
o kern/141086  fs         [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS
o kern/141010  fs         [zfs] "zfs scrub" fails when backed by files in UFS2
o kern/140888  fs         [zfs] boot fail from zfs root while the pool resilveri
o kern/140661  fs         [zfs] [patch] /boot/loader fails to work on a GPT/ZFS-
o kern/140640  fs         [zfs] snapshot crash
o kern/140068  fs         [smbfs] [patch] smbfs does not allow semicolon in file
o kern/139725  fs         [zfs] zdb(1) dumps core on i386 when examining zpool c
o kern/139715  fs         [zfs] vfs.numvnodes leak on busy zfs
p bin/139651   fs         [nfs] mount(8): read-only remount of NFS volume does n
o kern/139407  fs         [smbfs] [panic] smb mount causes system crash if remot
o kern/138662  fs         [panic] ffs_blkfree: freeing free block
o kern/138421  fs         [ufs] [patch] remove UFS label limitations
o kern/138202  fs         mount_msdosfs(1) see only 2Gb
o kern/136968  fs         [ufs] [lor] ufs/bufwait/ufs (open)
o kern/136945  fs         [ufs] [lor] filedesc structure/ufs (poll)
o kern/136944  fs         [ffs] [lor] bufwait/snaplk (fsync)
o kern/136873  fs         [ntfs] Missing directories/files on NTFS volume
o kern/136865  fs         [nfs] [patch] NFS exports atomic and on-the-fly atomic
p kern/136470  fs         [nfs] Cannot mount / in read-only, over NFS
o kern/135546  fs         [zfs] zfs.ko module doesn't ignore zpool.cache filenam
o kern/135469  fs         [ufs] [panic] kernel crash on md operation in ufs_dirb
o kern/135050  fs         [zfs] ZFS clears/hides disk errors on reboot
o kern/134491  fs         [zfs] Hot spares are rather cold...
o kern/133676  fs         [smbfs] [panic] umount -f'ing a vnode-based memory dis
p kern/133174  fs         [msdosfs] [patch] msdosfs must support multibyte inter
o kern/132960  fs         [ufs] [panic] panic:ffs_blkfree: freeing free frag
o kern/132397  fs         reboot causes filesystem corruption (failure to sync b
o kern/132331  fs         [ufs] [lor] LOR ufs and syncer
o kern/132237  fs         [msdosfs] msdosfs has problems to read MSDOS Floppy
o kern/132145  fs         [panic] File System Hard Crashes
o kern/131441  fs         [unionfs] [nullfs] unionfs and/or nullfs not combineab
o kern/131360  fs         [nfs] poor scaling behavior of the NFS server under lo
o kern/131342  fs         [nfs] mounting/unmounting of disks causes NFS to fail
o bin/131341   fs         makefs: error "Bad file descriptor"  on the mount poin
o kern/130920  fs         [msdosfs] cp(1) takes 100% CPU time while copying file
o kern/130210  fs         [nullfs] Error by check nullfs
o kern/129760  fs         [nfs] after 'umount -f' of a stale NFS share FreeBSD l
o kern/129488  fs         [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: 
o kern/129231  fs         [ufs] [patch] New UFS mount (norandom) option - mostly
o kern/129152  fs         [panic] non-userfriendly panic when trying to mount(8)
o kern/127787  fs         [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs
o bin/127270   fs         fsck_msdosfs(8) may crash if BytesPerSec is zero
o kern/127029  fs         [panic] mount(8): trying to mount a write protected zi
o kern/126287  fs         [ufs] [panic] Kernel panics while mounting an UFS file
o kern/125895  fs         [ffs] [panic] kernel: panic: ffs_blkfree: freeing free
s kern/125738  fs         [zfs] [request] SHA256 acceleration in ZFS
o kern/123939  fs         [msdosfs] corrupts new files
o kern/122380  fs         [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash
o bin/122172   fs         [fs]: amd(8) automount daemon dies on 6.3-STABLE i386,
o bin/121898   fs         [nullfs] pwd(1)/getcwd(2) fails with Permission denied
o bin/121072   fs         [smbfs] mount_smbfs(8) cannot normally convert the cha
o kern/120483  fs         [ntfs] [patch] NTFS filesystem locking changes
o kern/120482  fs         [ntfs] [patch] Sync style changes between NetBSD and F
o kern/118912  fs         [2tb] disk sizing/geometry problem with large array
o kern/118713  fs         [minidump] [patch] Display media size required for a k
o kern/118318  fs         [nfs] NFS server hangs under special circumstances
o bin/118249   fs         [ufs] mv(1): moving a directory changes its mtime
o kern/118126  fs         [nfs] [patch] Poor NFS server write performance
o kern/118107  fs         [ntfs] [panic] Kernel panic when accessing a file at N
o kern/117954  fs         [ufs] dirhash on very large directories blocks the mac
o bin/117315   fs         [smbfs] mount_smbfs(8) and related options can't mount
o kern/117158  fs         [zfs] zpool scrub causes panic if geli vdevs detach on
o bin/116980   fs         [msdosfs] [patch] mount_msdosfs(8) resets some flags f
o conf/116931  fs         lack of fsck_cd9660 prevents mounting iso images with 
o kern/116583  fs         [ffs] [hang] System freezes for short time when using 
o bin/115361   fs         [zfs] mount(8) gets into a state where it won't set/un
o kern/114955  fs         [cd9660] [patch] [request] support for mask,dirmask,ui
o kern/114847  fs         [ntfs] [patch] [request] dirmask support for NTFS ala 
o kern/114676  fs         [ufs] snapshot creation panics: snapacct_ufs2: bad blo
o bin/114468   fs         [patch] [request] add -d option to umount(8) to detach
o kern/113852  fs         [smbfs] smbfs does not properly implement DFS referral
o bin/113838   fs         [patch] [request] mount(8): add support for relative p
o bin/113049   fs         [patch] [request] make quot(8) use getopt(3) and show 
o kern/112658  fs         [smbfs] [patch] smbfs and caching problems (resolves b
o kern/111843  fs         [msdosfs] Long Names of files are incorrectly created 
o kern/111782  fs         [ufs] dump(8) fails horribly for large filesystems
s bin/111146   fs         [2tb] fsck(8) fails on 6T filesystem
o bin/107829   fs         [2TB] fdisk(8): invalid boundary checking in fdisk / w
o kern/106107  fs         [ufs] left-over fsck_snapshot after unfinished backgro
o kern/104406  fs         [ufs] Processes get stuck in "ufs" state under persist
o kern/104133  fs         [ext2fs] EXT2FS module corrupts EXT2/3 filesystems
o kern/103035  fs         [ntfs] Directories in NTFS mounted disc images appear 
o kern/101324  fs         [smbfs] smbfs sometimes not case sensitive when it's s
o kern/99290   fs         [ntfs] mount_ntfs ignorant of cluster sizes
s bin/97498    fs         [request] newfs(8) has no option to clear the first 12
o kern/97377   fs         [ntfs] [patch] syntax cleanup for ntfs_ihash.c
o kern/95222   fs         [cd9660] File sections on ISO9660 level 3 CDs ignored
o kern/94849   fs         [ufs] rename on UFS filesystem is not atomic
o bin/94810    fs         fsck(8) incorrectly reports 'file system marked clean'
o kern/94769   fs         [ufs] Multiple file deletions on multi-snapshotted fil
o kern/94733   fs         [smbfs] smbfs may cause double unlock
o kern/93942   fs         [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D
o kern/92272   fs         [ffs] [hang] Filling a filesystem while creating a sna
o kern/91134   fs         [smbfs] [patch] Preserve access and modification time 
a kern/90815   fs         [smbfs] [patch] SMBFS with character conversions somet
o kern/88657   fs         [smbfs] windows client hang when browsing a samba shar
o kern/88555   fs         [panic] ffs_blkfree: freeing free frag on AMD 64
o bin/87966    fs         [patch] newfs(8): introduce -A flag for newfs to enabl
o kern/87859   fs         [smbfs] System reboot while umount smbfs.
o kern/86587   fs         [msdosfs] rm -r /PATH fails with lots of small files
o bin/85494    fs         fsck_ffs: unchecked use of cg_inosused macro etc.
o kern/80088   fs         [smbfs] Incorrect file time setting on NTFS mounted vi
o bin/74779    fs         Background-fsck checks one filesystem twice and omits 
o kern/73484   fs         [ntfs] Kernel panic when doing `ls` from the client si
o bin/73019    fs         [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino
o kern/71774   fs         [ntfs] NTFS cannot "see" files on a WinXP filesystem
o bin/70600    fs         fsck(8) throws files away when it can't grow lost+foun
o kern/68978   fs         [panic] [ufs] crashes with failing hard disk, loose po
o kern/65920   fs         [nwfs] Mounted Netware filesystem behaves strange
o kern/65901   fs         [smbfs] [patch] smbfs fails fsx write/truncate-down/tr
o kern/61503   fs         [smbfs] mount_smbfs does not work as non-root
o kern/55617   fs         [smbfs] Accessing an nsmb-mounted drive via a smb expo
o kern/51685   fs         [hang] Unbounded inode allocation causes kernel to loc
o kern/36566   fs         [smbfs] System reboot with dead smb mount and umount
o bin/27687    fs         fsck(8) wrapper is not properly passing options to fsc
o kern/18874   fs         [2TB] 32bit NFS servers export wrong negative values t

298 problems total.


From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 17:19:22 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A759C297
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 17:19:22 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 8D5EB19C
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 17:19:22 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEAEkRPlFCZpCq/2dsb2JhbABDxGCBc3SCKQEBAQMBAQEBNRYgCgYLCxgJFg8JAwIBAgEVAQkmDgUCBAEBAQEXAgSHbAYMvWWNXYE4FoMqA4hyiyWCPoEehEmLDoMqHDKBBQ
X-IPAS-Result: AqEEAEkRPlFCZpCq/2dsb2JhbABDxGCBc3SCKQEBAQMBAQEBNRYgCgYLCxgJFg8JAwIBAgEVAQkmDgUCBAEBAQEXAgSHbAYMvWWNXYE4FoMqA4hyiyWCPoEehEmLDoMqHDKBBQ
X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; 
   d="scan'208";a="842243"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 11 Mar 2013 10:19:05 -0700
Message-ID: <513E1208.5020804@caltel.com>
Date: Mon, 11 Mar 2013 10:19:04 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
In-Reply-To: <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 17:19:22 -0000

Update --

fdisk WILL allow you to align without regards to drive geometry

It can only be done in interactive mode:
http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html

> fdisk -i /dev/ada0
> Do you want to change our idea of what BIOS thinks ? [n]
> The data for partition 1 is:
> Do you want to change it? [n] y
> Supply a decimal value for "sysid (165=FreeBSD)" [165]
> Supply a decimal value for "start" [63] 2048
> Supply a decimal value for "size" [125045361] 125043376
> Correct this automatically? [n]
> Explicitly specify beg/end address ? [n]
> Are we happy with this entry? [n] y
> Do you want to change the active partition? [n]
> Should we write new partition table? [n] y
>
> gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63       1985        - free -  (992k)
>        2048  125043376     1  freebsd  [active]  (59G)

Thanks,

Cody



On 3/10/13 9:19 PM, J David wrote:
> On Sun, Mar 10, 2013 at 8:39 PM, Warren Block <wblock@wonkity.com> wrote:
>
>> If FreeBSD had a way to turn off the strict enforcement of CHS values for
>> MBR, it would make that unnecessary.
>>
>
> The solution to that for a drive of this size *might* be to use fdisk
> instead of gpart to do the partitioning, and just flat out lie to it about
> the geometry when it asks you if you want to use something other than what
> it read.  (Which was also a lie anyway.)
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 18:09:23 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id AAFA810E
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 18:09:23 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 9647465E
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 18:09:23 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AmYKAA8dPlFCZpCq/2dsb2JhbABDhkSBIAG8fYF1dIIpAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGDL5ZjxUWgyoDiHKNY4EehEmLDoMqHA
X-IPAS-Result: AmYKAA8dPlFCZpCq/2dsb2JhbABDhkSBIAG8fYF1dIIpAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGDL5ZjxUWgyoDiHKNY4EehEmLDoMqHA
X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; 
   d="scan'208";a="845956"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 11 Mar 2013 11:09:23 -0700
Message-ID: <513E1DD2.7030609@caltel.com>
Date: Mon, 11 Mar 2013 11:09:22 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130216 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl>
In-Reply-To: <513D0E90.5090105@platinum.linux.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 18:09:23 -0000

On 3/10/13 3:52 PM, Adam Nowacki wrote:
> I don't think zfsboot is aware of BSD disklabel (offsets other than 0
> won't boot). Is there any reason you are using BSD disklabel and not two
> partition MBR?

The reason is because every example I saw used labels.
I just tried it, and it does not boot.
I get:

FreeBSD/x86 ZFS enabled bootstrap loader. Revision 1.1
ZFS: can't find pool by guid.


> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs
> will be aligned to sector size (ashift). Unless ZFS pool is created with
> higher ashift then the 63 sector offset is as good as any.

Aligning to the Erase block:

http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/
Also I will be forcing ashift to 12 using the gnop trick.

If you still feel that is not necessary, I would be interested in 
knowing why?

Thanks,

Cody

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 18:20:31 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id AB5D47FE
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 18:20:31 +0000 (UTC)
 (envelope-from uros.gruber@gmail.com)
Received: from mail-ia0-x235.google.com (mail-ia0-x235.google.com
 [IPv6:2607:f8b0:4001:c02::235])
 by mx1.freebsd.org (Postfix) with ESMTP id 860D071C
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 18:20:31 +0000 (UTC)
Received: by mail-ia0-f181.google.com with SMTP id w33so3906719iag.26
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 11:20:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:date:message-id:subject:from:to
 :content-type; bh=uiSgPQrUHHgLifiVj8B6cgFVzWa5ev0Ra48PcuEVLjM=;
 b=GFKrMg+FyF932q89YzDgIFneeuErMazi4gSnj7gPrulR0qxBSB8Zjk3pXC5CLi3obG
 sMa8oTtDqTnpftFaGanZntdwcYP63NwgLSLjoeFtoGu+PzNZTdhFp4GN+ytYRXhw3p1g
 cFnNpufujDQ7dwHJCMSDnIg5bhG2t88z/UC07NZru15aWvCagEvFeEtdWoOhxk6r2WbI
 NnXl5yLQwfpMwvp9biRvGM3LCUf1a+Xl7ov9EUlif1wIriY2eDK5Oc/QzfZ54by8q1Su
 /Tl+uz4lu7JNeyen04dE1fLgUuFK55HSPAjtoY7Xg9fPSvpit5mD5J+n/+rzkSt7q9dg
 StKw==
MIME-Version: 1.0
X-Received: by 10.42.189.199 with SMTP id df7mr9365292icb.16.1363026029590;
 Mon, 11 Mar 2013 11:20:29 -0700 (PDT)
Received: by 10.64.26.166 with HTTP; Mon, 11 Mar 2013 11:20:29 -0700 (PDT)
Date: Mon, 11 Mar 2013 19:20:29 +0100
Message-ID: <CAHGMo95Giii7ttxbNVEL1BKez5BGhktqBqamuuSd91n2yReiWw@mail.gmail.com>
Subject: zfs hang with umount
From: =?UTF-8?B?VXJvxaEgR3J1YmVy?= <uros.gruber@gmail.com>
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 18:20:31 -0000

Hi,

I don't know what causes this, but while stopping one of jails I also
run zfs inherit mountpoint on this jails fs. This jail was in stopping
state at this moment. Process than hanged in D state. Then I was doing
some stuff on the server, and while "zfs unmount -f" of that fs,
server chrased. Now everytime I wan't to unmount that fs process hang.
I've managed to send & receive this fs to other fs and mounted
sucessfuly.

Before I reboot and try to destroy this fs, here is output of procstat
-k PID (zfs umount zroot/myfs)

  PID    TID COMM             TDNAME           KSTACK
 3937 100559 zfs              -                mi_switch
sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit
zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages
vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush
vm_object_page_clean vm_object_terminate vnode_destroy_vobject
zfs_freebsd_reclaim vgonel vflush

Is there anything I can check or is this know bug?

Server is running on 9.1-RELEASE

regards

Uros

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 19:52:39 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id A196CF48
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 19:52:39 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id 664AFC42
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 19:52:39 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id 8520547E11; Mon, 11 Mar 2013 20:52:37 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.1.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id 277D847DE8
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 20:52:37 +0100 (CET)
Message-ID: <513E35EC.4080309@platinum.linux.pl>
Date: Mon, 11 Mar 2013 20:52:12 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl>
 <513E1DD2.7030609@caltel.com>
In-Reply-To: <513E1DD2.7030609@caltel.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 19:52:39 -0000

On 2013-03-11 19:09, Cody Ritts wrote:
> On 3/10/13 3:52 PM, Adam Nowacki wrote:
>> I don't think zfsboot is aware of BSD disklabel (offsets other than 0
>> won't boot). Is there any reason you are using BSD disklabel and not two
>> partition MBR?
>
> The reason is because every example I saw used labels.
> I just tried it, and it does not boot.
> I get:
>
> FreeBSD/x86 ZFS enabled bootstrap loader. Revision 1.1
> ZFS: can't find pool by guid.

Then I guess zfsloader requires BSD disklabel for MBR (but zfsboot still 
has to be at offset 0 and 1024 sectors relative to MBR partition as it 
doesn't read the BSD disklabel).

>> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs
>> will be aligned to sector size (ashift). Unless ZFS pool is created with
>> higher ashift then the 63 sector offset is as good as any.
>
> Aligning to the Erase block:
>
> http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/
> Also I will be forcing ashift to 12 using the gnop trick.
>
> If you still feel that is not necessary, I would be interested in
> knowing why?

The mapping between sectors and physical flash pages/blocks is not fixed 
and will change on each write or internal garbage collect. 
http://www.devwhy.com/blog/2009/8/4/from-write-down-to-the-flash-chips.html 
seems to explain this nicely. Aligning to more than page size offers no 
benefit since this is the biggest continuous chunk of data that remains 
continuous all the way to physical flash.

If your SSD has page size of 4KiB then align to that. This is sector 504 
on FreeBSD (due to the multiple of 63 issue). ZFS pool will have to be 
created with ashift=12.


From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 19:55:49 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id A2E724CC;
 Mon, 11 Mar 2013 19:55:49 +0000 (UTC) (envelope-from marck@rinet.ru)
Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68])
 by mx1.freebsd.org (Postfix) with ESMTP id 322F1CE1;
 Mon, 11 Mar 2013 19:55:48 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by woozle.rinet.ru (8.14.5/8.14.5) with ESMTP id r2BJtmj7040409;
 Mon, 11 Mar 2013 23:55:48 +0400 (MSK) (envelope-from marck@rinet.ru)
Date: Mon, 11 Mar 2013 23:55:48 +0400 (MSK)
From: Dmitry Morozovsky <marck@rinet.ru>
To: araujo@FreeBSD.org
Subject: Re: carp on stable/9: is there a way to keep jumbo?
In-Reply-To: <CAOfEmZip1wPzxp2tVyppDsRs_HEncme=2+DjLDyhXW_LswiPxw@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.1303112339390.9408@woozle.rinet.ru>
References: <alpine.BSF.2.00.1303050228050.32868@woozle.rinet.ru>
 <CAOfEmZip1wPzxp2tVyppDsRs_HEncme=2+DjLDyhXW_LswiPxw@mail.gmail.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
X-NCC-RegID: ru.rinet
X-OpenPGP-Key-ID: 6B691B03
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (woozle.rinet.ru [0.0.0.0]); Mon, 11 Mar 2013 23:55:48 +0400 (MSK)
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 19:55:49 -0000

On Tue, 5 Mar 2013, Marcelo Araujo wrote:

> > yes, I know glebius@ overhauled carp in -current, but I'm a bit nervous to
> > deploy bleeding edge system on a NAS/SAN ;)
> >
> > So, my question is about current state of carp in stable/9: building HA
> > pair I
> > found that carp interfaces lose jumbo capabilities:
> >
> >
> Hello Dmitry,
> 
> I made a patch for 9.1-RELEASE, it is totally based on glebius@ work, or
> partially :). I'm using it nowadays and it just works pretty fine for me.
> 
> I didn't test with JUMBO frame, but you can give a try and let us know if
> it works or not.
> 
> PATCH: http://people.freebsd.org/~araujo/carpdev/

I'vr managed to apply this finally :)

It seems your path is sometimes spammed with $FreeBSD$ changes, which leads to 
4 .rej's for me (nothing except ./sys/netinet/ip_carp.c.rej are sighnificany, 
but they may produce problem in future merging)

Only buildworld tests are finished for me yet; more to test later and/or
tomorrow.

Thank you!

-- 
Sincerely,
D.Marck                                     [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer:                                 marck@FreeBSD.org ]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru ***
------------------------------------------------------------------------

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 21:22:32 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9514F940
 for <freebsd-fs@FreeBSD.org>; Mon, 11 Mar 2013 21:22:32 +0000 (UTC)
 (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id E80912A0
 for <freebsd-fs@FreeBSD.org>; Mon, 11 Mar 2013 21:22:31 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA08615;
 Mon, 11 Mar 2013 23:22:22 +0200 (EET) (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1UFAAo-0005M9-6s; Mon, 11 Mar 2013 23:22:22 +0200
Message-ID: <513E4B0B.3090807@FreeBSD.org>
Date: Mon, 11 Mar 2013 23:22:19 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130220 Thunderbird/17.0.3
MIME-Version: 1.0
To: =?windows-1252?Q?Uro=9A_Gruber?= <uros.gruber@gmail.com>
Subject: Re: zfs hang with umount
References: <CAHGMo95Giii7ttxbNVEL1BKez5BGhktqBqamuuSd91n2yReiWw@mail.gmail.com>
In-Reply-To: <CAHGMo95Giii7ttxbNVEL1BKez5BGhktqBqamuuSd91n2yReiWw@mail.gmail.com>
X-Enigmail-Version: 1.5.1
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 21:22:32 -0000

on 11/03/2013 20:20 Uro� Gruber said the following:
> Hi,
> 
> I don't know what causes this, but while stopping one of jails I also
> run zfs inherit mountpoint on this jails fs. This jail was in stopping
> state at this moment. Process than hanged in D state. Then I was doing
> some stuff on the server, and while "zfs unmount -f" of that fs,
> server chrased. Now everytime I wan't to unmount that fs process hang.
> I've managed to send & receive this fs to other fs and mounted
> sucessfuly.
> 
> Before I reboot and try to destroy this fs, here is output of procstat
> -k PID (zfs umount zroot/myfs)
> 
>   PID    TID COMM             TDNAME           KSTACK
>  3937 100559 zfs              -                mi_switch
> sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit
> zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages
> vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush
> vm_object_page_clean vm_object_terminate vnode_destroy_vobject
> zfs_freebsd_reclaim vgonel vflush
> 
> Is there anything I can check or is this know bug?
> 
> Server is running on 9.1-RELEASE

This should be fixed in stable/9.

-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 21:31:50 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5EE81BBA
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 21:31:50 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 4319C320
 for <freebsd-fs@freebsd.org>; Mon, 11 Mar 2013 21:31:49 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap8EAP5LPlFCZpCq/2dsb2JhbABDxGaBdXSCKQEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy/YI8VFoMqA4hyjWOBHoRJiw6DKhw
X-IPAS-Result: Ap8EAP5LPlFCZpCq/2dsb2JhbABDxGaBdXSCKQEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy/YI8VFoMqA4hyjWOBHoRJiw6DKhw
X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; 
   d="scan'208";a="853759"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 11 Mar 2013 14:31:50 -0700
Message-ID: <513E4D45.1020804@caltel.com>
Date: Mon, 11 Mar 2013 14:31:49 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl>
 <513E1DD2.7030609@caltel.com> <513E35EC.4080309@platinum.linux.pl>
In-Reply-To: <513E35EC.4080309@platinum.linux.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 21:31:50 -0000

On 3/11/13 12:52 PM, Adam Nowacki wrote:
>>> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs
>>> will be aligned to sector size (ashift). Unless ZFS pool is created with
>>> higher ashift then the 63 sector offset is as good as any.
>>
>> Aligning to the Erase block:
>>
>> http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/
>> Also I will be forcing ashift to 12 using the gnop trick.
>>
>> If you still feel that is not necessary, I would be interested in
>> knowing why?
>
> The mapping between sectors and physical flash pages/blocks is not fixed
> and will change on each write or internal garbage collect.
> http://www.devwhy.com/blog/2009/8/4/from-write-down-to-the-flash-chips.html
> seems to explain this nicely. Aligning to more than page size offers no
> benefit since this is the biggest continuous chunk of data that remains
> continuous all the way to physical flash.
>
> If your SSD has page size of 4KiB then align to that. This is sector 504
> on FreeBSD (due to the multiple of 63 issue). ZFS pool will have to be
> created with ashift=12.

hmmmm...  I see the point you are making, and there is so much that I 
dont know about zfs, SSDs and ATA.

There is a commenter on there who certainly seems to agree with you:
https://github.com/zfsonlinux/zfs/pull/924

But the vast majority of pages that claim aligning the partition 
boundaries to multiples of the erase block is really important.  (Not 
that more pages makes it correct)

But if you are right, and aligning to the erase block is pointless 
because the SSD doesn't care, then it should not hurt if I do add an 
offset, other than I will loose a few MB of space.

It is certainly a good point you make but I just don't have the time to 
learn everything I need to know to make an educated decision for myself. 
  So if I can satisfy the majority with no detriment, I will just do 
that so I can get this thing into production.

Thanks

Cody



From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 11 21:43:07 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 020F8183;
 Mon, 11 Mar 2013 21:43:07 +0000 (UTC)
 (envelope-from uros.gruber@gmail.com)
Received: from mail-ie0-x233.google.com (mail-ie0-x233.google.com
 [IPv6:2607:f8b0:4001:c03::233])
 by mx1.freebsd.org (Postfix) with ESMTP id C2B413CA;
 Mon, 11 Mar 2013 21:43:06 +0000 (UTC)
Received: by mail-ie0-f179.google.com with SMTP id k11so5403811iea.38
 for <multiple recipients>; Mon, 11 Mar 2013 14:43:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:content-transfer-encoding;
 bh=fH6dR7rX5IiZs/qX4LkoHNlucOp2Ohd/7CvgwtYXdhs=;
 b=EIowviBv7PunTSDYT8MCUiDC2Pdj01YncbPf4JpOb/QcRMshcPvv2MWOdsyKZTSzKJ
 S0YagyS46IYjxzIRtWhH9+xzvxaU0O5xNuEewv79PD09nimT6S0k5X5fl31ktipQZqTa
 fW5QkusJ9qXMaaX9OMEgP9MhnYxMdMKfp0aLtiMNMXxEQm299bUctFY4Ps1tH0MV+ZCg
 VeabmDw052yV0Uwj2zVAPT9SjEWO8pO+zQFpjy26I9INw4jhvbu0bSo64qQ2n05pqNk4
 IVjCDTMgH7ICDaUfP+oSl/B6keGNoZwk1QFRjf/F8nwcP0RJjotwVn4/+IREfk6qm72a
 AyWA==
MIME-Version: 1.0
X-Received: by 10.50.151.179 with SMTP id ur19mr8954487igb.79.1363038186181;
 Mon, 11 Mar 2013 14:43:06 -0700 (PDT)
Received: by 10.64.26.166 with HTTP; Mon, 11 Mar 2013 14:43:06 -0700 (PDT)
In-Reply-To: <513E4B0B.3090807@FreeBSD.org>
References: <CAHGMo95Giii7ttxbNVEL1BKez5BGhktqBqamuuSd91n2yReiWw@mail.gmail.com>
 <513E4B0B.3090807@FreeBSD.org>
Date: Mon, 11 Mar 2013 22:43:06 +0100
Message-ID: <CAHGMo95Sx_5EEvj9-8m_Eck0D16ricbLrfPTeMbUV0=xwP4zVg@mail.gmail.com>
Subject: Re: zfs hang with umount
From: =?UTF-8?B?VXJvxaEgR3J1YmVy?= <uros.gruber@gmail.com>
To: Andriy Gapon <avg@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Mar 2013 21:43:07 -0000

Hi Andriy,

can you tell me more about this or what was the cause for this so I
can avoid it until 9.2 is released. I don't want to jump on 9-STABLE
and maybe have more problems with some other stuff.

regards

Uros

On 11 March 2013 22:22, Andriy Gapon <avg@freebsd.org> wrote:
> on 11/03/2013 20:20 Uro=C5=A1 Gruber said the following:
>> Hi,
>>
>> I don't know what causes this, but while stopping one of jails I also
>> run zfs inherit mountpoint on this jails fs. This jail was in stopping
>> state at this moment. Process than hanged in D state. Then I was doing
>> some stuff on the server, and while "zfs unmount -f" of that fs,
>> server chrased. Now everytime I wan't to unmount that fs process hang.
>> I've managed to send & receive this fs to other fs and mounted
>> sucessfuly.
>>
>> Before I reboot and try to destroy this fs, here is output of procstat
>> -k PID (zfs umount zroot/myfs)
>>
>>   PID    TID COMM             TDNAME           KSTACK
>>  3937 100559 zfs              -                mi_switch
>> sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit
>> zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages
>> vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush
>> vm_object_page_clean vm_object_terminate vnode_destroy_vobject
>> zfs_freebsd_reclaim vgonel vflush
>>
>> Is there anything I can check or is this know bug?
>>
>> Server is running on 9.1-RELEASE
>
> This should be fixed in stable/9.
>
> --
> Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 01:45:12 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 7AA797CC
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 01:45:12 +0000 (UTC)
 (envelope-from bryan-lists@shatow.net)
Received: from secure.xzibition.com (secure.xzibition.com [173.160.118.92])
 by mx1.freebsd.org (Postfix) with ESMTP id 1ACA3EF3
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 01:45:11 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=shatow.net; h=message-id
 :date:from:mime-version:to:cc:subject:references:in-reply-to
 :content-type:content-transfer-encoding; q=dns; s=sweb; b=dd3SC0
 0ldUj9fu7yVJvyBYNsmXExBU/QVFm52Sogjt9b8b3Mz3Kok/0qQXuWxNJKjieBiD
 tbxkJpfRtgc/jmB4oaz4TD6MDFn5gxQkjxjBBK4sJ8MP10hu8cfP/8IJSwl/Ym7u
 GklMSx+CxMaLw7Deu8bP0hQY5pxMk9ver98nY=
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=shatow.net; h=message-id
 :date:from:mime-version:to:cc:subject:references:in-reply-to
 :content-type:content-transfer-encoding; s=sweb; bh=zTkqveSmzYP+
 g9DgEtIio0jBZql2HEJjltPTJX9nQV8=; b=pv3QGvpnTIeNw+lXJXsk34wKXJnz
 bvvohOCEbdNxwo75c+ykqmiP0Ge65rPnoqrVMcUVsiPNdjff4llIouOWJ2QgS2NE
 3lYOuqJCNh1i36cszir1Wr9CGegJdowz6+G/A7CdHlfV1wHw/x4mb2QZOaYFKvU0
 BIqxuy1kxuccppg=
Received: (qmail 38312 invoked from network); 11 Mar 2013 20:38:29 -0500
Received: from unknown (HELO ?10.10.0.24?) (bryan@shatow.net@10.10.0.24)
 by sweb.xzibition.com with ESMTPA; 11 Mar 2013 20:38:29 -0500
Message-ID: <513E8711.2080909@shatow.net>
Date: Mon, 11 Mar 2013 20:38:25 -0500
From: Bryan Drewery <bryan-lists@shatow.net>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: =?UTF-8?B?VXJvxaEgR3J1YmVy?= <uros.gruber@gmail.com>
Subject: Re: zfs hang with umount
References: <CAHGMo95Giii7ttxbNVEL1BKez5BGhktqBqamuuSd91n2yReiWw@mail.gmail.com>
 <513E4B0B.3090807@FreeBSD.org>
 <CAHGMo95Sx_5EEvj9-8m_Eck0D16ricbLrfPTeMbUV0=xwP4zVg@mail.gmail.com>
In-Reply-To: <CAHGMo95Sx_5EEvj9-8m_Eck0D16ricbLrfPTeMbUV0=xwP4zVg@mail.gmail.com>
X-Enigmail-Version: 1.5.1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 01:45:12 -0000

On 3/11/2013 4:43 PM, Uroš Gruber wrote:
> Hi Andriy,
> 
> can you tell me more about this or what was the cause for this so I
> can avoid it until 9.2 is released. I don't want to jump on 9-STABLE
> and maybe have more problems with some other stuff.
> 


This can very easily happen with `umount -f`. My understanding is it
just takes a file descriptor being open to an unlinked file. Best to
avoid -f if possible.

If you run into one of these deadlocks you'll need to restart, possibly
with reboot(8).

> regards
> 
> Uros
> 
> On 11 March 2013 22:22, Andriy Gapon <avg@freebsd.org> wrote:
>> on 11/03/2013 20:20 Uroš Gruber said the following:
>>> Hi,
>>>
>>> I don't know what causes this, but while stopping one of jails I also
>>> run zfs inherit mountpoint on this jails fs. This jail was in stopping
>>> state at this moment. Process than hanged in D state. Then I was doing
>>> some stuff on the server, and while "zfs unmount -f" of that fs,
>>> server chrased. Now everytime I wan't to unmount that fs process hang.
>>> I've managed to send & receive this fs to other fs and mounted
>>> sucessfuly.
>>>
>>> Before I reboot and try to destroy this fs, here is output of procstat
>>> -k PID (zfs umount zroot/myfs)
>>>
>>>   PID    TID COMM             TDNAME           KSTACK
>>>  3937 100559 zfs              -                mi_switch
>>> sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit
>>> zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages
>>> vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush
>>> vm_object_page_clean vm_object_terminate vnode_destroy_vobject
>>> zfs_freebsd_reclaim vgonel vflush
>>>
>>> Is there anything I can check or is this know bug?
>>>
>>> Server is running on 9.1-RELEASE
>>
>> This should be fixed in stable/9.
>>
>> --
>> Andriy Gapon
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 


-- 
Regards,
Bryan Drewery
bdrewery@freenode/EFNet

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 02:10:39 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 21F83B76
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 02:10:39 +0000 (UTC)
 (envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
 by mx1.freebsd.org (Postfix) with ESMTP id 860D4FA0
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 02:10:38 +0000 (UTC)
Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au
 [136.186.229.95])
 by lauren.room52.net (Postfix) with ESMTPSA id 7A1627E81E
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 13:10:30 +1100 (EST)
Message-ID: <513E8E95.6010802@freebsd.org>
Date: Tue, 12 Mar 2013 13:10:29 +1100
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130213 Thunderbird/17.0.2
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: ZFS triggered 9-STABLE r246646 panic "vdrop: holdcnt 0"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY
 autolearn=unavailable version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 02:10:39 -0000

Hi all,

I got this panic yesterday. I haven't seen it before (or since), but I
have the crashdump and kernel here if there's additional information I
can provide that would be useful in finding the cause.

The machine runs ZFS exclusively and was under quite heavy CPU and IO
load at the time of the crash as I was compiling in a VirtualBox VM and
on the host itself, as well as running a full KDE desktop environment.
I'm fairly certain the machine was not swapping at the time of the crash.

lstewart@lstewart> uname -a
FreeBSD lstewart 9.1-STABLE FreeBSD 9.1-STABLE #8 r246646M: Mon Feb 11
14:57:13 EST 2013
root@lstewart:/usr/obj/usr/src/sys/LSTEWART-DESKTOP  amd64

lstewart@lstewart> sudo kgdb /boot/kernel/kernel /var/crash/vmcore.0

[...]

(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:229
#1  0xffffffff808e5824 in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:448
#2  0xffffffff808e5d27 in panic (fmt=0x1 <Address 0x1 out of bounds>) at
/usr/src/sys/kern/kern_shutdown.c:636
#3  0xffffffff8097a71e in vdropl (vp=<value optimized out>) at
/usr/src/sys/kern/vfs_subr.c:2465
#4  0xffffffff80b4da2b in vm_page_alloc (object=0xffffffff8132c000,
pindex=143696, req=32) at /usr/src/sys/vm/vm_page.c:1569
#5  0xffffffff80b3f312 in kmem_back (map=0xfffffe00020000e8,
addr=18446743524542296064, size=131072, flags=705200752)
    at /usr/src/sys/vm/vm_kern.c:361
#6  0xffffffff80b3fc8b in kmem_malloc (map=0xfffffe00020000e8,
size=131072, flags=2) at /usr/src/sys/vm/vm_kern.c:312
#7  0xffffffff80b3685a in uma_large_malloc (size=131072, wait=2) at
/usr/src/sys/vm/uma_core.c:3068
#8  0xffffffff808d0539 in malloc (size=131072, mtp=0xffffffff817f4ce0,
flags=2) at /usr/src/sys/kern/kern_malloc.c:492
#9  0xffffffff816696e2 in zio_write_bp_init (zio=0xfffffe016b70a000)
    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1060
#10 0xffffffff81668e23 in zio_execute (zio=0xfffffe016b70a000)
    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1256
#11 0xffffffff80928474 in taskqueue_run_locked
(queue=0xfffffe0010484280) at /usr/src/sys/kern/subr_taskqueue.c:312
#12 0xffffffff80929426 in taskqueue_thread_loop (arg=<value optimized
out>) at /usr/src/sys/kern/subr_taskqueue.c:501
#13 0xffffffff808b67af in fork_exit (callout=0xffffffff809293e0
<taskqueue_thread_loop>, arg=0xfffffe00103869d0,
    frame=0xffffff823df70b00) at /usr/src/sys/kern/kern_fork.c:988
#14 0xffffffff80c4ddee in fork_trampoline () at
/usr/src/sys/amd64/amd64/exception.S:602
#15 0x0000000000000000 in ?? ()

(kgdb) frame 4
#4  0xffffffff80b4da2b in vm_page_alloc (object=0xffffffff8132c000,
pindex=143696, req=32) at /usr/src/sys/vm/vm_page.c:1569
1569                    vdrop(vp);

(kgdb) p *vp
$3 = {v_type = VREG, v_tag = 0xffffffff816f7842 "zfs", v_op =
0xffffffff816ff7a0, v_data = 0xfffffe00784e42e0,
  v_mount = 0xfffffe0010890000, v_nmntvnodes = {tqe_next =
0xfffffe00a95281f8, tqe_prev = 0xfffffe0091d09220}, v_un = {
    vu_mount = 0x0, vu_socket = 0x0, vu_cdev = 0x0, vu_fifoinfo = 0x0},
v_hashlist = {le_next = 0x0, le_prev = 0x0},
  v_hash = 19896209, v_cache_src = {lh_first = 0x0}, v_cache_dst =
{tqh_first = 0x0, tqh_last = 0xfffffe012f979258},
  v_cache_dd = 0x0, v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0,
v_lock = {lock_object = {
      lo_name = 0xffffffff816f7842 "zfs", lo_flags = 91947008, lo_data =
0, lo_witness = 0x0}, lk_lock = 1, lk_exslpfail = 0,
    lk_timo = 51, lk_pri = 96}, v_interlock = {lock_object = {lo_name =
0xffffffff80ec2790 "vnode interlock", lo_flags = 16973824,
      lo_data = 0, lo_witness = 0x0}, mtx_lock = 18446741874964127744},
v_vnlock = 0xfffffe012f979290, v_holdcnt = 0,
  v_usecount = 0, v_iflag = 256, v_vflag = 0, v_writecount = 0,
v_actfreelist = {tqe_next = 0xfffffe00a95281f8,
    tqe_prev = 0xfffffe0091d09308}, v_bufobj = {bo_mtx = {lock_object =
{lo_name = 0xffffffff80ec27a0 "bufobj interlock",
        lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock =
4}, bo_clean = {bv_hd = {tqh_first = 0x0,
        tqh_last = 0xfffffe012f979338}, bv_root = 0x0, bv_cnt = 0},
bo_dirty = {bv_hd = {tqh_first = 0x0,
        tqh_last = 0xfffffe012f979358}, bv_root = 0x0, bv_cnt = 0},
bo_numoutput = 0, bo_flag = 0, bo_ops = 0xffffffff81253920,
    bo_bsize = 131072, bo_object = 0xfffffe0070ba5910, bo_synclist =
{le_next = 0x0, le_prev = 0x0},
    bo_private = 0xfffffe012f9791f8, __bo_vnode = 0xfffffe012f9791f8},
v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = {
    rl_waiters = {tqh_first = 0x0, tqh_last = 0xfffffe012f9793d8},
rl_currdep = 0x0}}

(kgdb) p *object
$6 = {mtx = {lock_object = {lo_name = 0xffffffff80ee61ad "vm object",
lo_flags = 21168128, lo_data = 0, lo_witness = 0x0},
    mtx_lock = 18446741874964127744}, object_list = {tqe_next =
0xffffffff8132bcc0, tqe_prev = 0xffffffff8132bf20}, shadow_head = {
    lh_first = 0x0}, shadow_list = {le_next = 0x0, le_prev = 0x0}, memq
= {tqh_first = 0xfffffe021eebb880,
    tqh_last = 0xfffffe022a0882f8}, root = 0xfffffe022a0882e8, size =
134217727, generation = 1, ref_count = 2659,
  shadow_count = 0, memattr = 6 '\006', type = 4 '\004', flags = 4096,
pg_color = 0, pad1 = 0, resident_page_count = 124507,
  backing_object = 0x0, backing_object_offset = 0, pager_object_list =
{tqe_next = 0x0, tqe_prev = 0x0}, rvq = {
    lh_first = 0xfffffe021df8a5c0}, cache = 0x0, handle = 0x0, un_pager
= {vnp = {vnp_size = 0, writemappings = 0}, devp = {
      devp_pglist = {tqh_first = 0x0, tqh_last = 0x0}, ops = 0x0}, sgp =
{sgp_pglist = {tqh_first = 0x0, tqh_last = 0x0}}, swp = {
      swp_bcount = 0}}, cred = 0x0, charge = 0, paging_in_progress = 0}


Cheers,
Lawrence

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 06:53:39 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 42EF6162;
 Tue, 12 Mar 2013 06:53:39 +0000 (UTC)
 (envelope-from linimon@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id 1D38CB5C;
 Tue, 12 Mar 2013 06:53:39 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2C6rc7U068066;
 Tue, 12 Mar 2013 06:53:38 GMT
 (envelope-from linimon@freefall.freebsd.org)
Received: (from linimon@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2C6rcWH068065;
 Tue, 12 Mar 2013 06:53:38 GMT (envelope-from linimon)
Date: Tue, 12 Mar 2013 06:53:38 GMT
Message-Id: <201303120653.r2C6rcWH068065@freefall.freebsd.org>
To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org
From: linimon@FreeBSD.org
Subject: Re: kern/176857: [softupdates] [panic] 9.1-RELEASE/amd64/GENERIC
 panic in softdepflush/remove_from_journal
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 06:53:39 -0000

Old Synopsis: [panic] [suj] 9.1-RELEASE/amd64/GENERIC panic in softdepflush/remove_from_journal
New Synopsis: [softupdates] [panic] 9.1-RELEASE/amd64/GENERIC panic in softdepflush/remove_from_journal

Responsible-Changed-From-To: freebsd-bugs->freebsd-fs
Responsible-Changed-By: linimon
Responsible-Changed-When: Tue Mar 12 06:52:30 UTC 2013
Responsible-Changed-Why: 
Change the tag for consistency, and assign.

http://www.freebsd.org/cgi/query-pr.cgi?pr=176857

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 09:33:47 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 92174901
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 09:33:47 +0000 (UTC)
 (envelope-from peter.maloney@brockmann-consult.de)
Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.10])
 by mx1.freebsd.org (Postfix) with ESMTP id 13D4A677
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 09:33:47 +0000 (UTC)
Received: from [10.3.0.26] ([141.4.215.32])
 by mrelayeu.kundenserver.de (node=mreu4) with ESMTP (Nemesis)
 id 0M2CHo-1V4pkU08m4-00s8Es; Tue, 12 Mar 2013 10:33:46 +0100
Message-ID: <513EF679.2080402@brockmann-consult.de>
Date: Tue, 12 Mar 2013 10:33:45 +0100
From: Peter Maloney <peter.maloney@brockmann-consult.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
In-Reply-To: <513C1629.50501@caltel.com>
X-Enigmail-Version: 1.5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Provags-ID: V02:K0:kmknN+RWGhdQeaB54jmzQVrRkNXnDp2QGvsdMJSOmxZ
 8jysFi1+xfsUx1Qmm2trYvsI9jnIjEny0Bd1z5RPXB5mfNS75J
 vGOqJ7jmvy71ypPR6Cp8Cjx1ucLYrG24uGX9WNv4WWxAJcrXU3
 KI8bx6E2Pd+nFsdefHRdLSwt+PLi7aYDWKIjSV2I32kFjVWWu3
 l41lWySaPKEt4jSYuAnrz7WNTxDz5xb8SqFKMP2H7rhfn9Wshe
 fTbv1ce5nWkGppmoU5JLqQwuODdAw6haly5nj54irQNr7hUv/E
 zNK7lrndHYu3pUltGz3R7vAbIPyn+GfkJ37E+aTpEEkX1LPza3
 YipuMvgCLGj8ESTF7fJnwCxs1DiKqST5U3VED5Vo7
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 09:33:47 -0000

On 2013-03-10 06:12, Cody Ritts wrote:
> I think remember reading that freebsd-zfs had to be the first slice (I
> cannot remember where i read that).  And it apparently does not think
> an offset is funny. 

For the gptzfsboot boot loader (and don't know which others), the
bootable zfs slice has to be in the same pool as the first slice found.

Or in other words, the first ZFS slice found by the bootloader must be
the cache, log, or data vdev of the bootable pool.

I learned this the hard way ;)  www.freebsd.org/cgi/query-pr.cgi?pr=160706

eg.

this would fail:
slice 1 - freebsd-boot
slice 2 - zfs /tank
slice 3 - zfs root with /boot

this would also fail:
slice 1 - freebsd-boot
slice 2 - zfs /tank L2ARC cache
slice 3 - zfs root with /boot

this would probably work (fits the rule, but I didn't test it):
slice 1 - freebsd-boot
slice 2 - zfs root L2ARC cache
slice 3 - zfs /tank
slice 4 - zfs root with /boot

and this will definitely work:
slice 1 - freebsd-boot
slice 2 - zfs root with /boot
slice 3 - zfs /tank

Above examples are with gptzfsboot loader; not sure if you need a
freebsd-boot for others.

And FYI in possibly all BIOS machines (non-EFI), the boot slice probably
needs to be before a certain sector... not sure which number, maybe
2.2TB/2.0TiB. I always just put it first. On my FreeBSD zfs machines, I
put it at sector 34 which is badly aligned but means the next regular
one can start at 2048, which saves me a whole megabyte!


-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------


From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 10:25:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 205BF58C
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 10:25:40 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
 [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 90BE68FA
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 10:25:39 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2CAPPvi019168
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 12 Mar 2013 21:25:27 +1100
Date: Tue, 12 Mar 2013 21:25:25 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513E1208.5020804@caltel.com>
Message-ID: <20130312203745.A1130@besplex.bde.org>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com> <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com> <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=bNdOu4CZ c=1 sm=1 a=u3bVZBOdoLwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10
 a=6I5d2MoRAAAA:8 a=d1Asgjw0WGCMbu39xngA:9 a=CjuIK1q_8ugA:10
 a=3JYNrmlC3cAA:10 a=ApFyF_lCYB5S5Om7:21 a=2tyWdEw6j5jF6zq2:21
 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 10:25:40 -0000

On Mon, 11 Mar 2013, Cody Ritts wrote:

> Update --
>
> fdisk WILL allow you to align without regards to drive geometry
>
> It can only be done in interactive mode:
> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html

It can be set in all modes.  At least according to the man page:

@ CONFIGURATION FILE
@      When the -f option is given, a disk's slice table can be written using
@      values from a configfile.  The syntax of this file is very simple; each
@      line is either a comment or a specification, as follows:
@ 
@      # comment ...
@              Lines beginning with a # are comments and are ignored.
@ 
@      g spec1 spec2 spec3
@              Set the BIOS geometry used in slice calculations.  There must be
@              three values specified, with a letter preceding each number:
@ 
@              cnum    Set the number of cylinders to num.
@ 
@              hnum    Set the number of heads to num.
@ 
@              snum    Set the number of sectors/track to num.
@ 
@              These specs can occur in any order, as the leading letter deter-
@              mines which value is which; however, all three must be specified.
@ 
@              This line must occur before any lines that specify slice informa-
@              tion.
@ 
@              It is an error if the following is not true:
@ 
@                    1 <= number of cylinders
@                    1 <= number of heads <= 256

Using 256 risks stepping on BIOS bugs or bugs in other OS's.

The default for all large disks is 255.  But this should only be used if
you aren't trying to align things to a power of 2 boundary, since it is
not a power of 2, so using it makes the calculations more complicated
and/or requires skipping something like 8 full fake cylinders of size
63*255 sectors each to reach a fake cylinder starting on a 4K boundary.
(This assumes a sector size of 512.)  The old SCSI default should be used.
IIRC, it is 32 sectors and 64 heads.  32 is the largest power of 2 less
than the limit of 63, and 64 is a large though not maximal power of 2
less than the limit of 256.  This makes the fake cylinder size 1 MB.

@                    1 <= number of sectors/track < 64
@ 
@              The number of cylinders should be less than or equal to 1024, but

Nah, the number of cylinders shouldn't be less than or equal to 1024, since
if that is all it is then the maximum disk size with 512-byte sectors is
63*256*1024*512 = 7.87 GB = 8.46 disk maufacturers GB.  You can't buy a
new hard disk that small.

However, fdisk only uses the number of cylinders for initializing defaults
for partition sizes.  It can be set to almost any garbage value if you
don't use the defaults.

@              this is not enforced, although a warning will be printed.  Note

Indeed, it still prints bogus warnings.

@              that bootable FreeBSD slices (the ``/'' file system) must lie
@              completely within the first 1024 cylinders; if this is not true,
@              booting may fail.  Non-bootable slices do not have this restric-
@              tion.

Note that this is not true, except it only says "may fail".  Booting from
(fake) cylinders above 1024 was implemented in the FreeBSD boot loader on
26 June 2000.

@ 
@              Example (all of these are equivalent), for a disk with 1019
@              cylinders, 39 heads, and 63 sectors:
@ 
@                    g       c1019   h39     s63
@                    g       h39     c1019   s63
@                    g       s63     h39     c1019

>> fdisk -i /dev/ada0
>> Do you want to change our idea of what BIOS thinks ? [n]

Note that fdisk has no idea what the BIOS thinks.  The numbers here
are what FreeBSD thinks.  FreeBSD used to try to determine what the
BIOS thinks, but this was broken by GEOM.  GEOM just uses whatever
the disk says is its "firmware" geometry.  But the ATA standard
specifies that for disks larger than the magic 8.46 GB number mentioned
above, that the fake geometry is always 63 sectors only 16 heads.
Thus:
- the default geometry in fdisk and presumably in gpart is wrong if the
   BIOS doesn't use it.  Some BIOSes default to 240 heads.  Some BIOSes
   allow you to choose between several fake geometries.  Some BIOSes allow
   you to specify the precise geometry.
- with only 16 heads, the 1024-cylinder limit is reached at a disk size
   of only 528 disk manufacturers GB, so fdisk's bogus warnings about this
   occur for disks less that 20 years old instead of only for disks less
   than 14 years old.

Last time I looked, Linux fdisk[s] worked better on FreeBSD than FreeBSD
fdisk, partly because they don't depend on special ioctls, so that they
know that they don't know the BIOS geometry.  Specifying the geometry is
so routine that it is a command-line parameter in all Linux fdisks in
FreeBSD ports (at least in old versions).

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 10:49:36 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 7BC79971
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 10:49:36 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
 [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 0B058AEF
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 10:49:35 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2CAnMZW022890
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 12 Mar 2013 21:49:24 +1100
Date: Tue, 12 Mar 2013 21:49:22 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <20130312203745.A1130@besplex.bde.org>
Message-ID: <20130312213522.R1412@besplex.bde.org>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com> <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com> <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=DdhPMYRW c=1 sm=1 a=u3bVZBOdoLwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10
 a=GRtgF4SnNIrABPWBEScA:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 10:49:36 -0000

On Tue, 12 Mar 2013, Bruce Evans wrote:

> Last time I looked, Linux fdisk[s] worked better on FreeBSD than FreeBSD
> fdisk, partly because they don't depend on special ioctls, so that they
> know that they don't know the BIOS geometry.  Specifying the geometry is
> so routine that it is a command-line parameter in all Linux fdisks in
> FreeBSD ports (at least in old versions).

Just tried an old version of them on my version of an old version of
FreeBSD.  They worked not so well:

- fdisk-linux: worked OK except for bogus warnings about > 1024 cylinders
   and a not so bogus warning about a slice not ending on a cylinder
   boundary.  The slice just ends at the end of the disk, and since the
   cylinders are fake that happens not to be a cylinder boundary.  The
   disk has a firmware fake geometry of 63 sectors 16 heads and mumble
   cylinders, and the disk manufacturer throws away sectors at the end
   to make it end on a cylinder boundary with these fake cylinders, but
   I use different fake cylinders.  FreeBSD, Linux and WinXP don't care
   about the slice not ending on a cylinder boundary.
- sfdisk-linux: refused to start without write permission
- cfdisk-linux: refused to start due to 1 slice not ending on a cylinder
   boundary.  With its -z workaround for not starting, it starts but is
   useless since it doesn't display the existing partitions.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Tue Mar 12 20:24:49 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 7384C94
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 20:24:49 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 52D2C62C
 for <freebsd-fs@freebsd.org>; Tue, 12 Mar 2013 20:24:49 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap8EAF2OP1FCZpCq/2dsb2JhbABDxGqBYXSCKQEBAQMBAQI1RgsLGAklDwIXLxMIAQGICgYMsVSPco8UFoMqA4hziyWCPoEfhEmLDoMqHA
X-IPAS-Result: Ap8EAF2OP1FCZpCq/2dsb2JhbABDxGqBYXSCKQEBAQMBAQI1RgsLGAklDwIXLxMIAQGICgYMsVSPco8UFoMqA4hziyWCPoEfhEmLDoMqHA
X-IronPort-AV: E=Sophos;i="4.84,833,1355126400"; 
   d="scan'208";a="908682"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 12 Mar 2013 13:24:37 -0700
Message-ID: <513F8F04.60206@caltel.com>
Date: Tue, 12 Mar 2013 13:24:36 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
In-Reply-To: <20130312203745.A1130@besplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 20:24:49 -0000

On 3/12/13 3:25 AM, Bruce Evans wrote:
 >> Update --
 >>
 >> fdisk WILL allow you to align without regards to drive geometry
 >>
 >> It can only be done in interactive mode:
 >> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html
 >
 > It can be set in all modes.  At least according to the man page:


In interactive mode, you can simply set the start and size of your 
partition be done with it and boot away.  If you then export that config 
file, and re-run it, you are back to being aligned to CHS.

I would imagine that adjusting your CHS ~correctly~ in the config file 
will allow you to to do it, but I have not found myself motivated to 
really learn about adjusting those values.  I will have to pick that up 
someday I suppose.

For informational purposes here is
1) partition w/ offset
2) show results
3) export/import config
4) show results with adjusted offset

> root@:/root # fdisk -i ada0
> Do you want to change our idea of what BIOS thinks ? [n]
> Do you want to change it? [n] y
> Supply a decimal value for "sysid (165=FreeBSD)" [165]
> Supply a decimal value for "start" [63] 4096
> Supply a decimal value for "size" [125045361] 125041328
> Correct this automatically? [n]
> Explicitly specify beg/end address ? [n]
> Are we happy with this entry? [n] y
> Do you want to change it? [n]
> Do you want to change it? [n]
> Do you want to change it? [n]
> Do you want to change the active partition? [n]
> Should we write new partition table? [n] y
>
> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63       4033        - free -  (2M)
>        4096  125041328     1  freebsd  [active]  (59G)
>
> root@:/root # fdisk -p ada0
> # /dev/ada0
> g c124053 h16 s63
> p 1 0xa5 4096 125041328
> a 1
>
> root@:/root # fdisk -p ada0 > command
>
> root@:/root # fdisk -f command ada0
> ******* Working on device /dev/ada0 *******
> fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range
>     (must be within 1-1024 for normal BIOS operation, unless the entire disk
>     is dedicated to FreeBSD)
> fdisk: WARNING: adjusting start offset of partition 1
>     from 4096 to 4158, to fall on a head boundary
> fdisk: WARNING: adjusting size of partition 1 from 125041328 to 125041266
>     to end on a cylinder boundary
>
> root@:/root # fdisk -p ada0
> # /dev/ada0
> g c124053 h16 s63
> p 1 0xa5 4158 125041266
> a 1
>
> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63       4095        - free -  (2M)
>        4158  125041266     1  freebsd  [active]  (59G)

Thanks,

Cody

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 10:30:27 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id B5D2D691
 for <freebsd-fs@FreeBSD.org>; Wed, 13 Mar 2013 10:30:27 +0000 (UTC)
 (envelope-from girgen@FreeBSD.org)
Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200])
 by mx1.freebsd.org (Postfix) with ESMTP id 5A448622
 for <freebsd-fs@FreeBSD.org>; Wed, 13 Mar 2013 10:30:27 +0000 (UTC)
Received: from girgBook.local
 (c-2754e155.1525-1-64736c12.cust.bredbandsbolaget.se [85.225.84.39])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (No client certificate requested)
 by melon.pingpong.net (Postfix) with ESMTPSA id 4ED341431A
 for <freebsd-fs@FreeBSD.org>; Wed, 13 Mar 2013 11:23:13 +0100 (CET)
Message-ID: <51405391.1020006@FreeBSD.org>
Date: Wed, 13 Mar 2013 11:23:13 +0100
From: Palle Girgensohn <girgen@FreeBSD.org>
User-Agent: Postbox 3.0.7 (Macintosh/20130119)
MIME-Version: 1.0
To: freebsd-fs@FreeBSD.org
Subject: leaking lots of unreferenced inodes (pg_xlog files?), maybe after
 moving tables and indexes to tablespace on different volume
References: <513FCA39.7030709@FreeBSD.org>
In-Reply-To: <513FCA39.7030709@FreeBSD.org>
X-Enigmail-Version: 1.2.3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 10:30:27 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!

Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file system.

I have the postgresql base/ on the /usr disk, and a separate volume /opt
where the default tablespace resides. This means that the amount of data
on the /usr disk sould be stable. This is not the case, the disk usage
grows linearly (it seems to leave many inodes unreferenced).

The the discrepancy between df and du is now huge:

# du -sxh /usr; df -h /usr
4,6G	/usr
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/da0s1f    104G     88G    8.0G    92%    /usr

4,6G vs 88GB, that must be more than a rounding error?

Strange thing is I cannot find any open files among the missing.

# lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null

returns no errors (a missing file would render an error with ls). If
there where open files not referenced in any directory, they should be
found.

Next thing is fsck, and yes, there are plenty of unreferenced files.

I ran fsck while system is running (i.e. read only) to get a grip oif
the amount of lost inodes:

fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while (<>) {
$i += $_;}; print $i / 1024 / 1024; print "\n";'
85223.3530330658

~85 GB gone, that's 80% of the disk, and it accounts fo all the missing
space.

MTIME for the inodes are pretty evenly spread over time since the
machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was updated
to 9.2. All was done at the same time, so I can't really tell who's to
blaim, but this is the only server, out of a dozen that where updated to
exactly the same versions, that has this problem. All other servers have
their /usr disk usage stable (since all data resides on a separate
tablespace).

The unreferenced inodes are almost exclusively around 16 MB in size, so
they most certainly all are postgresql pg_xlog files. This means all
files are lost from the same portion of code in the database engine.

How could it possibly be able to leave unreferenced inodes around like
this at such a scale? Is the culprit a combination of postgresql and
file system code? Both where updated.

pg_xlog checkpoints seems to happen approximately every three minutes:

Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:41:38 dbserver postgres[5298]: [49-1] db=,user= LOG:
checkpoint complete: wrote 2542 buffers (0.3%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=149.667 s, sync=0.101 s,
total=149.770 s; sync files=628, longest=0.021 s, average=0.000 s
Mar 13 00:44:08 dbserver postgres[5298]: [50-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:46:38 dbserver postgres[5298]: [51-1] db=,user= LOG:
checkpoint complete: wrote 3996 buffers (0.4%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=149.438 s, sync=0.111 s,
total=149.551 s; sync files=823, longest=0.006 s, average=0.000 s
Mar 13 00:49:08 dbserver postgres[5298]: [52-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG:
checkpoint complete: wrote 13736 buffers (1.4%); 0 transaction log
file(s) added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s,
total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s
Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:56:38 dbserver postgres[5298]: [55-1] db=,user= LOG:
checkpoint complete: wrote 14638 buffers (1.5%); 0 transaction log
file(s) added, 0 removed, 17 recycled; write=149.330 s, sync=0.271 s,
total=149.603 s; sync files=1363, longest=0.017 s, average=0.000 s
Mar 13 00:59:08 dbserver postgres[5298]: [56-1] db=,user= LOG:
checkpoint starting: time
Mar 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG:
checkpoint complete: wrote 8035 buffers (0.8%); 0 transaction log
file(s) added, 0 removed, 21 recycled; write=149.285 s, sync=0.146 s,
total=149.433 s; sync files=1160, longest=0.003 s, average=0.000 s
Mar 13 01:04:08 dbserver postgres[5298]: [58-1] db=,user= LOG:
checkpoint starting: time
Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG:
checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log
file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s,
total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s
Mar 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG:
checkpoint starting: time


I'm pretty certain that unmounting the file system and running fsck will
regain the lost space, but will it stop there?

Stopping postgresql briefly did not help, I tried that. That would have
helped if the files where open, but they're not. It seems to postgresql
did the right thing, and FreeBSD failed to unreference the files.

The server has about 30 databases and ~127 concurrent connections (not
all beeing active simultaneously, though), so it is fair to say it is
pretty active, but nothing extreme.

Hardware is HP DL360, using their HT Smart Array P410i.

Any ideas how to debug this? Or shall I just reboot, fsck, hope the
problem will go away, and when it does, forget about it?

Thanks,
Palle
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq
XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy
jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs
O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv
H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw
NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q=
=3ZZs
-----END PGP SIGNATURE-----

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 13:34:15 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0D2792D7
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 13:34:15 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au
 [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id 8ADA9338
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 13:34:13 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2DDXxtC032494
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 14 Mar 2013 00:34:01 +1100
Date: Thu, 14 Mar 2013 00:33:59 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <513F8F04.60206@caltel.com>
Message-ID: <20130313232247.B1078@besplex.bde.org>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com> <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com> <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
 <513F8F04.60206@caltel.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=Q4OKePKa c=1 sm=1 a=u3bVZBOdoLwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10
 a=6I5d2MoRAAAA:8 a=ikclS7t6qE9F1bxxU1YA:9 a=CjuIK1q_8ugA:10
 a=3JYNrmlC3cAA:10 a=z_RuTD8k6CXfqtkX:21 a=7nNRCSddQ2EWGmUR:21
 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 13:34:15 -0000

On Tue, 12 Mar 2013, Cody Ritts wrote:

> On 3/12/13 3:25 AM, Bruce Evans wrote:
>>> Update --
>>>
>>> fdisk WILL allow you to align without regards to drive geometry
>>>
>>> It can only be done in interactive mode:
>>> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html
>>
>> It can be set in all modes.  At least according to the man page:
>
> In interactive mode, you can simply set the start and size of your partition 
> be done with it and boot away.

That will usually give nonsense beginning and ending CHS values.  This will
usually prevent BIOSes and non-broken versions of FreeBSD from detecting
the geometry.  It would also prevent booting on BIOSes that uses the CHS
values; this is not usually a problem now.  So this misconfiguration will
mainly mess up printing of the partition table in utilities that print
the CHS utilities, and cause warnings in utilities that want the CHS values
to be correct, and propagate the misconfiguratation.

> If you then export that config file, and 
> re-run it, you are back to being aligned to CHS.

Only if the config file is broken.

> I would imagine that adjusting your CHS ~correctly~ in the config file will 
> allow you to to do it, but I have not found myself motivated to really learn 
> about adjusting those values.  I will have to pick that up someday I suppose.

Yes this is necessary and very easy.  Just type in the same geometry that
you need to type in at the start of interactive fdisk to avoid the above
problems.

> For informational purposes here is
> 1) partition w/ offset
> 2) show results
> 3) export/import config
> 4) show results with adjusted offset
>
>> root@:/root # fdisk -i ada0
>> Do you want to change our idea of what BIOS thinks ? [n]

You do want the change this.  Use something like 32 sectors 64 heads
(1MB cylinders).

>> Do you want to change it? [n] y
>> Supply a decimal value for "sysid (165=FreeBSD)" [165]
>> Supply a decimal value for "start" [63] 4096
>> Supply a decimal value for "size" [125045361] 125041328
>> Correct this automatically? [n]

After changing "what BIOS thinks" to something like the above, fdisk
shouldn't see anything to correct.  The start of 4096 is a multiple
of 32*64 = 2048.

I just noticed that despite being too chatty, "what BIOS thinks"
has bad grammar.

>> Explicitly specify beg/end address ? [n]
>> Are we happy with this entry? [n] y
>> Do you want to change it? [n]
>> Do you want to change it? [n]
>> Do you want to change it? [n]
>> Do you want to change the active partition? [n]
>> Should we write new partition table? [n] y
>> 
>> root@:/root # gpart show ada0
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       4033        - free -  (2M)
>>        4096  125041328     1  freebsd  [active]  (59G)

Looks like it got a bogus 63 from the same place as fdisk (from the
"firmware" goemetry.

The free area isn't really 4033 block sectors at 63, but 4095 blocks
starting at 1 (for non-broken BIOSes and OSes).  I used to start
partitions at offset 1, but now use 63 for portability.  However, 63
isn't portable either.  The BIOS or another OS might have a different
idea of the geometry.

>> root@:/root # fdisk -p ada0
>> # /dev/ada0
>> g c124053 h16 s63
>> p 1 0xa5 4096 125041328
>> a 1

This shows fdisk -p using the same garbage default geometry as interactive
fdisk.  So fdisk -p output is not directly usable.

There is no place in the partition table to store the geometry directly.
It can sometimes be determined indirectly, but neither the kernel nor
fdisk does so.  Old kernels did so, and fdisk depended on this.
However, fdisk shouldn't depend on this, so that fdisk can work on
images of partition tables.  Linux fdisk (fdisk-linux in ports) does
this correctly.  Not using OS-specific ioctls for this also improves
portability.  The kernel support for this was mainly to ensure that
all FreeBSD utilities got a consistent view of the geometry, back when
consistent views mattered.  It's still confusing when the views are
different.

>> root@:/root # fdisk -p ada0 > command
>> 
>> root@:/root # fdisk -f command ada0
>> ******* Working on device /dev/ada0 *******
>> fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range
>>     (must be within 1-1024 for normal BIOS operation, unless the entire 
>> disk
>>     is dedicated to FreeBSD)
>> fdisk: WARNING: adjusting start offset of partition 1
>>     from 4096 to 4158, to fall on a head boundary
>> fdisk: WARNING: adjusting size of partition 1 from 125041328 to 125041266
>>     to end on a cylinder boundary

This is broken.  The non-interactive version should not adjust anything.
Even the interactive version defaults to not adjusting.

>> root@:/root # fdisk -p ada0
>> # /dev/ada0
>> g c124053 h16 s63
>> p 1 0xa5 4158 125041266
>> a 1
>> 
>> root@:/root # gpart show ada0
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       4095        - free -  (2M)
>>        4158  125041266     1  freebsd  [active]  (59G)

So the bugs of fdisk -p producing wrong geometry, not editing its output
to fix this, and the broken adjustment done by fdisk -f, result in the
partiton offsets being corrupted by fdisk -f.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 14:12:35 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 4F97ACBB
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 14:12:35 +0000 (UTC)
 (envelope-from darksoul@darkbsd.org)
Received: from denrei.darkbsd.org (denrei.darkbsd.org [91.121.179.66])
 by mx1.freebsd.org (Postfix) with ESMTP id 01F9D8AB
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 14:12:34 +0000 (UTC)
Received: from denrei.darkbsd.org (localhost [127.0.0.1])
 by denrei.darkbsd.org (Postfix) with ESMTP id E65A4C81
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 15:12:27 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=darkbsd.org; h=message-id
 :date:from:mime-version:to:subject:references:in-reply-to
 :content-type; s=selector1; bh=Jol/lfuPfs0++ZatAgIlehTA/wU=; b=q
 7mKZoze8PyB2g3wrLbL8JiZFvoh0V4Rle1QQ9qYoxISyrszKBDdjOR/0dbF12fge
 s46LkEVHpzflLryTeORZueBwcRegVeCe+1C2e4Y8GP/3OHbOPb2LU8KLfHVLZQS4
 kvGrhfcuBoFwPd8/kKzrxs3P9cWYo3o2I+mXknMrM8=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=darkbsd.org; h=message-id
 :date:from:mime-version:to:subject:references:in-reply-to
 :content-type; q=dns; s=selector1; b=HqbJAvYKTRkBCFu/P0aKA2WoY7l
 5YpZsO5WIq9xiCeGT9v7vgwjYW35eWg07RXRLmH1UpVyRRzgKvvgUWfjp13shu+b
 wGhdcZS14ha67K3QoQrqneun6+owtptaTf+9r7ifpQGAhTMDkzAiwgJJobyUmH5F
 kURYbb2Fai3gYEzs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=darkbsd.org; h=
 content-type:content-type:in-reply-to:references:subject:subject
 :mime-version:user-agent:from:from:date:date:message-id:received
 :received; s=selector1; t=1363183944; bh=4Nk4GeV3hp9KFCJNPlTRPK/
 ZEUAJrPISFLxKtkwNUo0=; b=mBEe01WWLPWEVEUw69EED8gDHckEvwozyswywpA
 vHpr9KseP0EfZWjDyMprw15VK5aM14xgH7ZtyHxRlVoFYXRTTCcIB0ESBxX/9MHo
 9S1CFqP0LOUbqyOpcR5kM9EJuMm4XG98+tSuEEhAe3VEz/bQSFOKExRU5W3/bHrb
 1Y3Q=
X-Virus-Scanned: amavisd-new at darkbsd.org
Received: from denrei.darkbsd.org ([127.0.0.1])
 by denrei.darkbsd.org (denrei.darkbsd.org [127.0.0.1]) (amavisd-new,
 port 10026) with ESMTP id 36xaddxBmxIn for <freebsd-fs@freebsd.org>;
 Wed, 13 Mar 2013 15:12:24 +0100 (CET)
Received: from [IPv6:2001:470:24:42d::42] (archer.yomi.darkbsd.org
 [IPv6:2001:470:24:42d::42])
 (Authenticated sender: darksoul@darkbsd.org)
 by denrei.darkbsd.org (Postfix) with ESMTPSA id 566A7C80
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 15:12:22 +0100 (CET)
Message-ID: <51408940.9000609@darkbsd.org>
Date: Wed, 13 Mar 2013 23:12:16 +0900
From: DarkSoul <darksoul@darkbsd.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130106 Thunderbird/17.0.2
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Panic loop on ZFS with 9.1-RELEASE
References: <513B58B6.2090903@darkbsd.org> <513B6E1E.6080805@darkbsd.org>
 <513B7555.1010701@darkbsd.org>
In-Reply-To: <513B7555.1010701@darkbsd.org>
X-Enigmail-Version: 1.4.6
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="------------enig2AE1FCAA1F9E474431CFBC3A"
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 14:12:35 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig2AE1FCAA1F9E474431CFBC3A
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit

Just a very quick heads up.

I still haven't succeeded in importing the pool readwrite, but I have
succeeded in importing it readonly.

This has been confirmed as a bug by the ZFS illumos ML people.
Description :
You can't import readonly a pool that has cache devices, because the
import will try to send write IOs to auxiliary vdevs, and hit an
assert() call, thus provoking a panic.

Workaround :
Destroy cache devices before zpool import -o readonly=on -f <pool>.

Cheers,

On 03/10/2013 02:45 AM, Stephane LAPIE wrote:
> Pinpoint analysis of the zpool on the broken vdev gives the following
> information :
>
> # zdb -AAA -e -mm prana 1 33
>
> Metaslabs:
>     vdev          1
>     metaslabs   145   offset                spacemap          free     
>     ---------------   -------------------   ---------------   -------------
>     metaslab     33   offset  21000000000   spacemap    303   free    11.9G
> WARNING: zfs: allocating allocated segment(offset=2335563722752 size=1024)
>
> Assertion failed: sm->sm_space == space (0x2f927f400 == 0x2f927f800),
> file
> /usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c,
> line 353.
> pid 51 (zdb), uid 0: exited on signal 6 (core dumped)
> Abort trap (core dumped)
>
> Just in case, root vdev 1 is made of the following devices :
> children[1]:
>     type: 'raidz'
>     id: 1
>     guid: 1078755695237588414
>     nparity: 1
>     metaslab_array: 175
>     metaslab_shift: 36
>     ashift: 9
>     asize: 10001970626560
>     is_log: 0
>     children[0]:
>         type: 'disk'
>         id: 0
>         guid: 12900041001921590764
>         path: '/dev/da10'
>         phys_path: '/dev/da10'
>         whole_disk: 0
>         DTL: 4127
>     children[1]:
>         type: 'disk'
>         id: 1
>         guid: 7211789756938666186
>         path: '/dev/da3'
>         phys_path: '/dev/da3'
>         whole_disk: 1
>         DTL: 4119
>     children[2]:
>         type: 'disk'
>         id: 2
>         guid: 12094368820342087236
>         path: '/dev/da5'
>         phys_path: '/dev/da5'
>         whole_disk: 1
>         DTL: 212
>     children[3]:
>         type: 'disk'
>         id: 3
>         guid: 6868867539761908697
>         path: '/dev/da4'
>         phys_path: '/dev/da4'
>         whole_disk: 0
>         DTL: 4173
>     children[4]:
>         type: 'disk'
>         id: 4
>         guid: 3091570768700552191
>         path: '/dev/da6'
>         phys_path: '/dev/da6'
>         whole_disk: 0
>         DTL: 4182
>
> At this point I am nearly considering ripping these out and zpool
> importing while ignoring missing devices... :/
>
> On 03/10/2013 02:15 AM, Stephane LAPIE wrote:
>> Posting a quick update.
>>
>> I ran a "zdb -emm" command to figure out what was going on, and it blew
>> up in my face with an abort trap here :
>> - vdev 0 has 145 metaslabs, which are cleared without any problems.
>> - vdev 1 has 145 metaslabs, but fails in the middle :
>> metaslab     32   offset  20000000000   spacemap    289   free    1.64G
>>                   segments      19509   maxsize   41.7M   freepct    2%
>> metaslab     33   offset  21000000000   spacemap    303   free    11.9G
>> error: zfs: allocating allocated segment(offset=2335563722752 size=1024)
>> Abort trap(core dumped)
>>
>> Converting offset 2335563722752 from earlier kernel panic messages gives
>> : 21fca723000, which matches the broken metaslab found by zdb.
>>
>> Is there anything I can do at this point, using zdb?
>> It just sounds surrealistic I have ONE broken metaslab (seemingly?) and
>> that I can't recover anything...
>>
>> Cheers,
>>
>> On 03/10/2013 12:43 AM, Stephane LAPIE wrote:
>>> Hello list,
>>>
>>> I currently am faced with a sudden death case I can't understand at all,
>>> and I would be very appreciating of any explanation or assistance :(
>>>
>>> Here is my current kernel version :
>>> FreeBSD  9.1-STABLE FreeBSD 9.1-STABLE #5 r245055: Thu Jan 17 13:12:59
>>> JST 2013
>>> darksoul@eirei-no-za.yomi.darkbsd.org:/usr/obj/usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/sys/DARK-2012KERN 
>>> amd64
>>> (Kernel is basically a lightened GENERIC kernel without VESA options and
>>> unneeded controllers removed)
>>>
>>> The pool is a set of 3x raidz1 (5 drives), + 2 cache devices + mirrored
>>> transaction log
>>>
>>> Booting and trying to import the pool is met with :
>>> Solaris(panic): zfs: panic: allocating allocated
>>> segment(offset=2335563722752 size=1024)
>>>
>>> Booting single mode on my emergency flash card with a base OS and zpool
>>> import -o readonly=on is met with :
>>> panic: solaris assert: zio->io_type != ZIO_TYPE_WRITE ||
>>> spa_writeable(spa), file:
>>> /usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c,
>>> line: 2461
>>>
>>> I tried zpool import -F -f, zpool import -F -f -m after removing the
>>> mirrored transaction log devices, but after 40s of trying to import, it
>>> just blows up.
>>>
>>> I am currently running "zdb -emm" as per the procedure suggested here :
>>> http://simplex.swordsaint.net/?p=199 if only to get some debug information.
>>>
>>> Thanks in advance for your time.
>>>
>>> Cheers,
>>>
>>>
>>> -- 
>>> Stephane LAPIE, EPITA SRS, Promo 2005
>>> "Even when they have digital readouts, I can't understand them."
>>> --MegaTokyo
>> -- 
>> Stephane LAPIE, EPITA SRS, Promo 2005
>> "Even when they have digital readouts, I can't understand them."
>> --MegaTokyo

-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo


--------------enig2AE1FCAA1F9E474431CFBC3A
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iF4EAREIAAYFAlFAiUEACgkQDJ4OK7D3FWQcogEAs/t505xibhP4EWsRGiAF8+qO
NDV/kBSdgU7Cd/UB118BALK603KdwiW4fxn/NnGBZa4T0k5NhUWUrwQ/YgjgUZWO
=6itv
-----END PGP SIGNATURE-----

--------------enig2AE1FCAA1F9E474431CFBC3A--

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 16:52:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 037127ED;
 Wed, 13 Mar 2013 16:52:34 +0000 (UTC)
 (envelope-from mckusick@mckusick.com)
Received: from chez.mckusick.com (chez.mckusick.com
 [IPv6:2001:5a8:4:7e72:4a5b:39ff:fe12:452])
 by mx1.freebsd.org (Postfix) with ESMTP id A38E874F;
 Wed, 13 Mar 2013 16:52:33 +0000 (UTC)
Received: from chez.mckusick.com (localhost [127.0.0.1])
 by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id r2DGqSr4051899;
 Wed, 13 Mar 2013 09:52:29 -0700 (PDT)
 (envelope-from mckusick@chez.mckusick.com)
Message-Id: <201303131652.r2DGqSr4051899@chez.mckusick.com>
To: Palle Girgensohn <girgen@freebsd.org>
Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?),
 maybe after moving tables and indexes to tablespace on different
 volume 
In-reply-to: <51405391.1020006@FreeBSD.org> 
Date: Wed, 13 Mar 2013 09:52:28 -0700
From: Kirk McKusick <mckusick@mckusick.com>
X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY
 autolearn=failed version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com
Cc: freebsd-fs@freebsd.org, Jeff Roberson <jroberson@jroberson.net>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 16:52:34 -0000

Thanks for your report. It is certainly unlike anything that we
have seen reported before. 

Are you running your /usr filesystem with (the default) journalled
soft updates? You can check this by running the `mount' command 
with no arguments.

Rather than rebooting your system, it would be most helpful if you
could instead shut it down to single user. Then do the following:

Create a transcript of your session by running `script'. Once
running in the session run these commands:

Run `mount' to show your filesystem configuration.
Run `df -hi /usr' to see whether the inodes are still missing.
Verify that you can cleanly unmount /usr (e.g., that the unmount
  does not hang and does not complain).
Remount /usr and run `df -hi' to see whether the inodes are
  still missing.
Unmount /usr again and run `fsck_ffs -p -f -d /usr'. If the fsck_ffs
fails with an unexpected inconsistency, you can run `fsck_ffs -y -d /usr'
to force it to clean up. When you have the filesystem successfully
cleaned up, type `exit' to get out of the script session and mail
me the transcript of the session (typescript).

Thanks for your help in tracking this down.

	Kirk McKusick

----- Original Message:

Date: Wed, 13 Mar 2013 11:23:13 +0100
From: Palle Girgensohn <girgen@freebsd.org>
To: freebsd-fs@freebsd.org
Subject: leaking lots of unreferenced inodes (pg_xlog files?), maybe after
 moving tables and indexes to tablespace on different volume

Hi!

Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file system.

I have the postgresql base/ on the /usr disk, and a separate volume /opt
where the default tablespace resides. This means that the amount of data
on the /usr disk sould be stable. This is not the case, the disk usage
grows linearly (it seems to leave many inodes unreferenced).

The the discrepancy between df and du is now huge:

# du -sxh /usr; df -h /usr
4,6G	/usr
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/da0s1f    104G     88G    8.0G    92%    /usr

4,6G vs 88GB, that must be more than a rounding error?

Strange thing is I cannot find any open files among the missing.

# lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null

returns no errors (a missing file would render an error with ls). If
there where open files not referenced in any directory, they should be
found.

Next thing is fsck, and yes, there are plenty of unreferenced files.

I ran fsck while system is running (i.e. read only) to get a grip oif
the amount of lost inodes:

fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while (<>) {
$i += $_;}; print $i / 1024 / 1024; print "\n";'
85223.3530330658

~85 GB gone, that's 80% of the disk, and it accounts fo all the missing
space.

MTIME for the inodes are pretty evenly spread over time since the
machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was updated
to 9.2. All was done at the same time, so I can't really tell who's to
blaim, but this is the only server, out of a dozen that where updated to
exactly the same versions, that has this problem. All other servers have
their /usr disk usage stable (since all data resides on a separate
tablespace).

The unreferenced inodes are almost exclusively around 16 MB in size, so
they most certainly all are postgresql pg_xlog files. This means all
files are lost from the same portion of code in the database engine.

How could it possibly be able to leave unreferenced inodes around like
this at such a scale? Is the culprit a combination of postgresql and
file system code? Both where updated.

pg_xlog checkpoints seems to happen approximately every three minutes:

Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:41:38 dbserver postgres[5298]: [49-1] db=,user= LOG:
checkpoint complete: wrote 2542 buffers (0.3%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=149.667 s, sync=0.101 s,
total=149.770 s; sync files=628, longest=0.021 s, average=0.000 s
Mar 13 00:44:08 dbserver postgres[5298]: [50-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:46:38 dbserver postgres[5298]: [51-1] db=,user= LOG:
checkpoint complete: wrote 3996 buffers (0.4%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=149.438 s, sync=0.111 s,
total=149.551 s; sync files=823, longest=0.006 s, average=0.000 s
Mar 13 00:49:08 dbserver postgres[5298]: [52-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG:
checkpoint complete: wrote 13736 buffers (1.4%); 0 transaction log
file(s) added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s,
total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s
Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG:
checkpoint starting: time
Mar 13 00:56:38 dbserver postgres[5298]: [55-1] db=,user= LOG:
checkpoint complete: wrote 14638 buffers (1.5%); 0 transaction log
file(s) added, 0 removed, 17 recycled; write=149.330 s, sync=0.271 s,
total=149.603 s; sync files=1363, longest=0.017 s, average=0.000 s
Mar 13 00:59:08 dbserver postgres[5298]: [56-1] db=,user= LOG:
checkpoint starting: time
Mar 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG:
checkpoint complete: wrote 8035 buffers (0.8%); 0 transaction log
file(s) added, 0 removed, 21 recycled; write=149.285 s, sync=0.146 s,
total=149.433 s; sync files=1160, longest=0.003 s, average=0.000 s
Mar 13 01:04:08 dbserver postgres[5298]: [58-1] db=,user= LOG:
checkpoint starting: time
Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG:
checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log
file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s,
total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s
Mar 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG:
checkpoint starting: time


I'm pretty certain that unmounting the file system and running fsck will
regain the lost space, but will it stop there?

Stopping postgresql briefly did not help, I tried that. That would have
helped if the files where open, but they're not. It seems to postgresql
did the right thing, and FreeBSD failed to unreference the files.

The server has about 30 databases and ~127 concurrent connections (not
all beeing active simultaneously, though), so it is fair to say it is
pretty active, but nothing extreme.

Hardware is HP DL360, using their HT Smart Array P410i.

Any ideas how to debug this? Or shall I just reboot, fsck, hope the
problem will go away, and when it does, forget about it?

Thanks,
Palle
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq
XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy
jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs
O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv
H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw
NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q=
=3ZZs
-----END PGP SIGNATURE-----
_______________________________________________
freebsd-fs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 19:20:35 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 22B1476F;
 Wed, 13 Mar 2013 19:20:35 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id AF1F3129;
 Wed, 13 Mar 2013 19:20:33 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id E70C5B999;
 Wed, 13 Mar 2013 15:20:32 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: fs@freebsd.org
Subject: Deadlock in the NFS client
Date: Wed, 13 Mar 2013 13:56:37 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201303131356.37919.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 13 Mar 2013 15:20:33 -0400 (EDT)
Cc: Rick Macklem <rmacklem@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 19:20:35 -0000

I ran into a machine that had a deadlock among certain files on a given NFS 
mount today.  I'm not sure how best to resolve it, though it seems like 
perhaps there is a bug with how the pool of nfsiod threads is managed.  
Anyway, more details on the actual hang below.  This was on 8.x with the
old NFS client, but I don't see anything in HEAD that would fix this.

First note that the system was idle so it had dropped down to only one
nfsiod thread.

The nfsiod thread is hung on a vnode lock:

(kgdb) proc 36927
[Switching to thread 150 (Thread 100679)]#0  sched_switch (
    td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable "flags" is 
not available.
)
    at /usr/src/sys/kern/sched_ule.c:1898
1898                    cpuid = PCPU_GET(cpuid);
(kgdb) where
#0  sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460, 
flags=Variable "flags" is not available.
)
    at /usr/src/sys/kern/sched_ule.c:1898
#1  0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:449
#2  0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8, pri=96)
    at /usr/src/sys/kern/subr_sleepqueue.c:629
#3  0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8, flags=524544, 
    ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available.
) at /usr/src/sys/kern/kern_lock.c:220
#4  0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not available.
) at lockmgr.h:94
#5  0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00, 
    a=0xffffff847ac10600) at vnode_if.c:1988
#6  0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760, flags=524288, 
    file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138)
    at vnode_if.h:859
#7  0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544, 
    td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138
#8  0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000, hash=227722108, 
    flags=Variable "flags" is not available.
) at /usr/src/sys/kern/vfs_hash.c:81
#9  0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000, 
    fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40, flags=524288)
    at /usr/src/sys/nfsclient/nfs_node.c:120
#10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760, 
    uiop=0xffffff847ac10ad0, cred=0xffffff005587c300)
    at /usr/src/sys/nfsclient/nfs_vnops.c:2636
---Type <return> to continue, or q <return> to quit---
#11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760, 
    bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not 
available.
)
    at /usr/src/sys/nfsclient/nfs_bio.c:1600
#12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is not 
available.
)
    at /usr/src/sys/nfsclient/nfs_nfsiod.c:303
#13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0 <nfssvc_iod>, 
    arg=0xffffffff809266e0, frame=0xffffff847ac10c40)
    at /usr/src/sys/kern/kern_fork.c:861

Thread stuck in getblk for that vnode (holds shared lock on this vnode):

(kgdb) proc 36902
[Switching to thread 149 (Thread 101543)]#0  sched_switch (
    td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable "flags" is 
not available.
)
    at /usr/src/sys/kern/sched_ule.c:1898
1898                    cpuid = PCPU_GET(cpuid);
(kgdb) where
#0  sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, 
flags=Variable "flags" is not available.
)
    at /usr/src/sys/kern/sched_ule.c:1898
#1  0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:449
#2  0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0, pri=96)
    at /usr/src/sys/kern/subr_sleepqueue.c:629
#3  0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0, flags=530688, 
    ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available.
) at /usr/src/sys/kern/kern_lock.c:220
#4  0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3, size=32768, 
    slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94
#5  0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3, 
    size=32768, td=0xffffff0378d8a8c0) at 
/usr/src/sys/nfsclient/nfs_bio.c:1259
#6  0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760, 
    uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available.
) at /usr/src/sys/nfsclient/nfs_bio.c:530
#7  0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0, 
    a=0xffffff847bcf09c0) at vnode_if.c:887
#8  0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0, 
    uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not 
available.
) at vnode_if.h:384
#9  0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3, 
    fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable "offset" 
is not available.
) at file.h:242

The buffer is locked by LK_KERNPROC:

(kgdb) bprint bp
0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO)
error = 0, bufsize = 32768, bcount = 32768, b_resid = 0
bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep = 0x0
 lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending

And this buffer is queued as the first pending buffer on the mount waiting
for service by nfsiod:

(kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data
(kgdb) p $nmp->nm_bufq.tqh_first
$24 = (struct buf *) 0xffffff83e11bc128
(kgdb) p bp
$25 = (struct buf *) 0xffffff83e11bc128

So, the first process is waiting for a block from an NFS directory.  That 
block is in queue to be completed as an async I/O by the nfsiod thread pool.  
However, the lone nfsiod thread in the pool is waiting to exclusively lock the 
original NFS directory to update its attributes, so it cannot service the 
async I/O request.

-- 
John Baldwin

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 21:45:58 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 703CE795
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 21:45:58 +0000 (UTC)
 (envelope-from cr@caltel.com)
Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6])
 by mx1.freebsd.org (Postfix) with ESMTP id 53895E70
 for <freebsd-fs@freebsd.org>; Wed, 13 Mar 2013 21:45:58 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAFzyQFFCZpCq/2dsb2JhbABDxFaBb3SCKgEBAQMBAQI1QAYLCxgJFg8JAwIBAgEWLxMIAQGICgYMwxuNYIE3g0ADiHSLJoI+gR+ES4sYgyoc
X-IPAS-Result: AqAEAFzyQFFCZpCq/2dsb2JhbABDxFaBb3SCKgEBAQMBAQI1QAYLCxgJFg8JAwIBAgEWLxMIAQGICgYMwxuNYIE3g0ADiHSLJoI+gR+ES4sYgyoc
X-IronPort-AV: E=Sophos;i="4.84,840,1355126400"; d="scan'208";a="17130627"
Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local)
 ([66.102.144.170])
 by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA;
 13 Mar 2013 14:45:23 -0700
Message-ID: <5140F373.1010907@caltel.com>
Date: Wed, 13 Mar 2013 14:45:23 -0700
From: Cody Ritts <cr@caltel.com>
Organization: CalTel
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
 <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org>
In-Reply-To: <20130313232247.B1078@besplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 21:45:58 -0000

Holy crap, there is so much baggage with CHS, I had no idea we were 
still dragging it along to this extent.  I really appreciate you 
emphasizing its importance, I was ready to just blow off the incorrect 
geometry and be happy with my hacked sector start.

So here is my drive...
> root@:/root # diskinfo -v ada0
> ada0
> 	512         	# sectorsize
> 	64023257088 	# mediasize in bytes (59G)
> 	125045424   	# mediasize in sectors
> 	0           	# stripesize
> 	0           	# stripeoffset
> 	124053      	# Cylinders according to firmware.
> 	16          	# Heads according to firmware.
> 	63          	# Sectors according to firmware.


So, if I now want to create an aligned single partition, here are the 
steps I think I should be taking:

Sectors should be < 64
Heads should be  < 256
for OLD OLD stuff, cylinders should be < 1024

if you want boundaries on a power of 2, those the number of sectors and 
heads should also be a power of 2.


So, would all of these be potential valid values?

s32  h128
512*32*128 = 2097152B = 2MB cylinder
s32  h64
512*32*64  = 1048576B = 1MB cylinder
s16  h128
512*16*128 = 1048576B = 1MB cylinder
s4  h8
512*4*4   = 8192B  = 8K cylinder


I am assuming that once I know my cylinder size, I just divide the total 
size of my hard drive to come up with cylinder count?

s4  h8
64023257088 / 8192 = 7815339c
(8k is the largest power of 2 that the drive will evenly divide into)

s32  h64
64023257088 / 1048576 = 61057.3359375
Round down to 61057.
(does the cylinder need to end on the end of the disk?)

So, here is what i calculated:
c61057 h64 s32

I want an offset of 2M, file system should be reduced to 61055M
   (61055 * 1024 * 1024)/512 = 125040640s)


Here are the commands that I ran:

> cat << EOF > command
> g c61057 h64 s32
> p 1 0xa5 4096 125040640
> a 1
> EOF
> root@:/root # fdisk -f command ada0
> ******* Working on device /dev/ada0 *******
> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range
>     (must be within 1-1024 for normal BIOS operation, unless the entire disk
>     is dedicated to FreeBSD)
> root@:/root # fdisk -p ada0
> # /dev/ada0
> g c124053 h16 s63
> p 1 0xa5 4096 125040640
> a 1
note, it auto goes back when exporting

> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63       4033        - free -  (2M)
>        4096  125040640     1  freebsd  [active]  (59G)
>   125044736        688        - free -  (344k)
> root@:/root # gpart delete -i 1 ada0
> root@:/root # gpart add -t freebsd -b 4096 -s 125040640 ada0
> ada0s1 added
> root@:/root # gpart show ada0
> =>       63  125045361  ada0  MBR  (59G)
>          63       4095        - free -  (2M)
>        4158  125040573     1  freebsd  (59G)
>   125044731        693        - free -  (346k)
gpart does not care

> root@:/root # fdisk -f command ada0
> ******* Working on device /dev/ada0 *******
> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range
>     (must be within 1-1024 for normal BIOS operation, unless the entire disk
>     is dedicated to FreeBSD)
> root@:/root # fdisk ada0
> ******* Working on device /dev/ada0 *******
> parameters extracted from in-core disklabel are:
> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl)
>
> Figures below won't work with BIOS for partitions not in cyl 1
> parameters to be used for BIOS calculations are:
> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl)
>
> Media sector size is 512
> Warning: BIOS sector numbering starts with sector 1
> Information from DOS bootblock is:
> The data for partition 1 is:
> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
>     start 4096, size 125040640 (61055 Meg), flag 80 (active)
> 	beg: cyl 2/ head 0/ sector 1;
> 	end: cyl 640/ head 63/ sector 32
> The data for partition 2 is:
> <UNUSED>
> The data for partition 3 is:
> <UNUSED>
> The data for partition 4 is:
> <UNUSED>


So, setting the geom simply does this:
>> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
>>     start 4096, size 125040640 (61055 Meg), flag 80 (active)
>> 	beg: cyl 2/ head 0/ sector 1;
>> 	end: cyl 640/ head 63/ sector 32


I cannot set geom in my bios, nor does not show me what it thinks geom 
is.  Obviously anything that only supports 1024 cylinders will not think 
it is very funny.


I feel like I am missing some part of this puzzle, or is that all there 
is to this to correct geom for proper alignment on an MBR?

So, by setting those CHS values I am:
   making the partition table more compatible with other operating 
systems and BIOSes?
   and giving some utilities the CHS stuff they need to function right?



Thanks,

Cody


On 3/13/13 6:33 AM, Bruce Evans wrote:
> On Tue, 12 Mar 2013, Cody Ritts wrote:
>
>> On 3/12/13 3:25 AM, Bruce Evans wrote:
>>>> Update --
>>>>
>>>> fdisk WILL allow you to align without regards to drive geometry
>>>>
>>>> It can only be done in interactive mode:
>>>> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html
>>>
>>> It can be set in all modes.  At least according to the man page:
>>
>> In interactive mode, you can simply set the start and size of your
>> partition be done with it and boot away.
>
> That will usually give nonsense beginning and ending CHS values.  This will
> usually prevent BIOSes and non-broken versions of FreeBSD from detecting
> the geometry.  It would also prevent booting on BIOSes that uses the CHS
> values; this is not usually a problem now.  So this misconfiguration will
> mainly mess up printing of the partition table in utilities that print
> the CHS utilities, and cause warnings in utilities that want the CHS values
> to be correct, and propagate the misconfiguratation.
>
>> If you then export that config file, and re-run it, you are back to
>> being aligned to CHS.
>
> Only if the config file is broken.
>
>> I would imagine that adjusting your CHS ~correctly~ in the config file
>> will allow you to to do it, but I have not found myself motivated to
>> really learn about adjusting those values.  I will have to pick that
>> up someday I suppose.
>
> Yes this is necessary and very easy.  Just type in the same geometry that
> you need to type in at the start of interactive fdisk to avoid the above
> problems.
>
>> For informational purposes here is
>> 1) partition w/ offset
>> 2) show results
>> 3) export/import config
>> 4) show results with adjusted offset
>>
>>> root@:/root # fdisk -i ada0
>>> Do you want to change our idea of what BIOS thinks ? [n]
>
> You do want the change this.  Use something like 32 sectors 64 heads
> (1MB cylinders).
>
>>> Do you want to change it? [n] y
>>> Supply a decimal value for "sysid (165=FreeBSD)" [165]
>>> Supply a decimal value for "start" [63] 4096
>>> Supply a decimal value for "size" [125045361] 125041328
>>> Correct this automatically? [n]
>
> After changing "what BIOS thinks" to something like the above, fdisk
> shouldn't see anything to correct.  The start of 4096 is a multiple
> of 32*64 = 2048.
>
> I just noticed that despite being too chatty, "what BIOS thinks"
> has bad grammar.
>
>>> Explicitly specify beg/end address ? [n]
>>> Are we happy with this entry? [n] y
>>> Do you want to change it? [n]
>>> Do you want to change it? [n]
>>> Do you want to change it? [n]
>>> Do you want to change the active partition? [n]
>>> Should we write new partition table? [n] y
>>>
>>> root@:/root # gpart show ada0
>>> =>       63  125045361  ada0  MBR  (59G)
>>>          63       4033        - free -  (2M)
>>>        4096  125041328     1  freebsd  [active]  (59G)
>
> Looks like it got a bogus 63 from the same place as fdisk (from the
> "firmware" goemetry.
>
> The free area isn't really 4033 block sectors at 63, but 4095 blocks
> starting at 1 (for non-broken BIOSes and OSes).  I used to start
> partitions at offset 1, but now use 63 for portability.  However, 63
> isn't portable either.  The BIOS or another OS might have a different
> idea of the geometry.
>
>>> root@:/root # fdisk -p ada0
>>> # /dev/ada0
>>> g c124053 h16 s63
>>> p 1 0xa5 4096 125041328
>>> a 1
>
> This shows fdisk -p using the same garbage default geometry as interactive
> fdisk.  So fdisk -p output is not directly usable.
>
> There is no place in the partition table to store the geometry directly.
> It can sometimes be determined indirectly, but neither the kernel nor
> fdisk does so.  Old kernels did so, and fdisk depended on this.
> However, fdisk shouldn't depend on this, so that fdisk can work on
> images of partition tables.  Linux fdisk (fdisk-linux in ports) does
> this correctly.  Not using OS-specific ioctls for this also improves
> portability.  The kernel support for this was mainly to ensure that
> all FreeBSD utilities got a consistent view of the geometry, back when
> consistent views mattered.  It's still confusing when the views are
> different.
>
>>> root@:/root # fdisk -p ada0 > command
>>>
>>> root@:/root # fdisk -f command ada0
>>> ******* Working on device /dev/ada0 *******
>>> fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range
>>>     (must be within 1-1024 for normal BIOS operation, unless the
>>> entire disk
>>>     is dedicated to FreeBSD)
>>> fdisk: WARNING: adjusting start offset of partition 1
>>>     from 4096 to 4158, to fall on a head boundary
>>> fdisk: WARNING: adjusting size of partition 1 from 125041328 to
>>> 125041266
>>>     to end on a cylinder boundary
>
> This is broken.  The non-interactive version should not adjust anything.
> Even the interactive version defaults to not adjusting.
>
>>> root@:/root # fdisk -p ada0
>>> # /dev/ada0
>>> g c124053 h16 s63
>>> p 1 0xa5 4158 125041266
>>> a 1
>>>
>>> root@:/root # gpart show ada0
>>> =>       63  125045361  ada0  MBR  (59G)
>>>          63       4095        - free -  (2M)
>>>        4158  125041266     1  freebsd  [active]  (59G)
>
> So the bugs of fdisk -p producing wrong geometry, not editing its output
> to fix this, and the broken adjustment done by fdisk -f, result in the
> partiton offsets being corrupted by fdisk -f.
>
> Bruce
>

From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 23:33:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 64B8B38F;
 Wed, 13 Mar 2013 23:33:37 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E59E1DD2;
 Wed, 13 Mar 2013 23:33:36 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEABoMQVGDaFvO/2dsb2JhbAA7CIgkuV6CXYFwdIIqAQEEASMEUgUWDgoCAg0ZAlkGiCEGr26SQxeBI4w4gQE0B4ItgRMDlliRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,840,1355115600"; d="scan'208";a="21123471"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Mar 2013 19:33:35 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6C4EDB402D;
 Wed, 13 Mar 2013 19:33:35 -0400 (EDT)
Date: Wed, 13 Mar 2013 19:33:35 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303131356.37919.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.203]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 23:33:37 -0000

John Baldwin wrote:
> I ran into a machine that had a deadlock among certain files on a
> given NFS
> mount today. I'm not sure how best to resolve it, though it seems like
> perhaps there is a bug with how the pool of nfsiod threads is managed.
> Anyway, more details on the actual hang below. This was on 8.x with
> the
> old NFS client, but I don't see anything in HEAD that would fix this.
> 
> First note that the system was idle so it had dropped down to only one
> nfsiod thread.
> 
Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
It seems to me that this snippet of code from nfs_asyncio() makes too
weak an assumption:
	/*
	 * If none are free, we may already have an iod working on this mount
	 * point.  If so, it will process our request.
	 */
	if (!gotiod) {
		if (nmp->nm_bufqiods > 0) {
			NFS_DPF(ASYNCIO,
		("nfs_asyncio: %d iods are already processing mount %p\n",
				 nmp->nm_bufqiods, nmp));
			gotiod = TRUE;
		}
	}
It assumes that, since an nfsiod thread is processing some buffer for the
mount, it will become available to do this one, which isn't true for your
deadlock.

I think the simple fix would be to recode nfs_asyncio() so that it only
returns 0 if it finds an AVAILABLE nfsiod thread that it has assigned to
do the I/O, getting rid of the above.
The problem with doing this is that it may result in a lot more synchronous
I/O (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
synchronous I/O could be avoided by allowing nfs_asyncio() to create a new thread
even if the total is above nfs_iodmax. (I think this would require the fixed array
to be replaced with a linked list and might result in a large number of nfsiod
threads.) Maybe just having a large nfs_iodmax would be an adequate compromise?

Does having a large # of nfsiod threads cause any serious problem for most
systems these days?

I'd be tempted to recode nfs_asyncio() as above and then, instead of nfs_iodmin
and nfs_iodmax, I'd simply have:
- a fixed number of nfsiod threads (this could be a tunable, with the
       understanding that it should be large for good performance)

rick

> The nfsiod thread is hung on a vnode lock:
> 
> (kgdb) proc 36927
> [Switching to thread 150 (Thread 100679)]#0 sched_switch (
> td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8,
> flags=524544,
> ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not
> available.
> ) at lockmgr.h:94
> #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00,
> a=0xffffff847ac10600) at vnode_if.c:1988
> #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760,
> flags=524288,
> file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138)
> at vnode_if.h:859
> #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544,
> td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138
> #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000,
> hash=227722108,
> flags=Variable "flags" is not available.
> ) at /usr/src/sys/kern/vfs_hash.c:81
> #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000,
> fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40,
> flags=524288)
> at /usr/src/sys/nfsclient/nfs_node.c:120
> #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760,
> uiop=0xffffff847ac10ad0, cred=0xffffff005587c300)
> at /usr/src/sys/nfsclient/nfs_vnops.c:2636
> ---Type <return> to continue, or q <return> to quit---
> #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760,
> bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_bio.c:1600
> #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is
> not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_nfsiod.c:303
> #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0
> <nfssvc_iod>,
> arg=0xffffffff809266e0, frame=0xffffff847ac10c40)
> at /usr/src/sys/kern/kern_fork.c:861
> 
> Thread stuck in getblk for that vnode (holds shared lock on this
> vnode):
> 
> (kgdb) proc 36902
> [Switching to thread 149 (Thread 101543)]#0 sched_switch (
> td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0,
> flags=530688,
> ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3,
> size=32768,
> slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94
> #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3,
> size=32768, td=0xffffff0378d8a8c0) at
> /usr/src/sys/nfsclient/nfs_bio.c:1259
> #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760,
> uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available.
> ) at /usr/src/sys/nfsclient/nfs_bio.c:530
> #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0,
> a=0xffffff847bcf09c0) at vnode_if.c:887
> #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0,
> uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not
> available.
> ) at vnode_if.h:384
> #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3,
> fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable
> "offset"
> is not available.
> ) at file.h:242
> 
> The buffer is locked by LK_KERNPROC:
> 
> (kgdb) bprint bp
> 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO)
> error = 0, bufsize = 32768, bcount = 32768, b_resid = 0
> bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep
> = 0x0
> lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending
> 
> And this buffer is queued as the first pending buffer on the mount
> waiting
> for service by nfsiod:
> 
> (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data
> (kgdb) p $nmp->nm_bufq.tqh_first
> $24 = (struct buf *) 0xffffff83e11bc128
> (kgdb) p bp
> $25 = (struct buf *) 0xffffff83e11bc128
> 
> So, the first process is waiting for a block from an NFS directory.
> That
> block is in queue to be completed as an async I/O by the nfsiod thread
> pool.
> However, the lone nfsiod thread in the pool is waiting to exclusively
> lock the
> original NFS directory to update its attributes, so it cannot service
> the
> async I/O request.
> 
> --
> John Baldwin

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 01:20:30 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 4B5DB987;
 Thu, 14 Mar 2013 01:20:30 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E1B22D42;
 Thu, 14 Mar 2013 01:20:29 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAHEkQVGDaFvO/2dsb2JhbABDiCi8PIF0dIIqAQEFIwRSGw4KAgINGQJZBognrzGSVIEjjTk0B4ItgRMDlliRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,840,1355115600"; d="scan'208";a="21133806"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Mar 2013 21:20:28 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 829BAB4023;
 Wed, 13 Mar 2013 21:20:28 -0400 (EDT)
Date: Wed, 13 Mar 2013 21:20:28 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <1040319431.3883577.1363224028494.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303131356.37919.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.203]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 01:20:30 -0000

I wrote:
> Does having a large # of nfsiod threads cause any serious problem for most
> systems these days?
>
> I'd be tempted to recode nfs_asyncio() as above and then, instead of nfs_iodmin
> and nfs_iodmax, I'd simply have:
> - a fixed number of nfsiod threads (this could be a tunable, with the
>        understanding that it should be large for good performance)

I'm probably getting ahead of myself here, since changing nfs_asyncio() may/may not
fix the deadlock, but I thought I'd comment further on the above.

It may be possible to add a new nfs_iod_target (the desired # of nfsiod threads)
and adjust that dynamically based on the ratio of the # of times nfs_asyncio() returns:
  #EIO/#0 --> when there are too many EIO returns, increase nfs_iod_target
          --> very few EIO returns, decrease nfs_iod_target
- Use nfs_iodmin, nfs_iodmax as the limits for nfs_iod_target and set
  nfs_iodmax much larger than it currently is, by default, with
  nfs_iod_target set to what nfs_iodmax is currently set to, by default.

rick


From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 03:45:07 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 99794467;
 Thu, 14 Mar 2013 03:45:07 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 0C359F9D;
 Thu, 14 Mar 2013 03:45:06 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEAOBGQVGDaFvO/2dsb2JhbAA7CIgxuV+CXYF7dIIqAQEEASMEUgUWDgoCAg0ZAlkGiCEGrxuSVYEjjDiBATQHgi2BEwOWWJECgyYggWw
X-IronPort-AV: E=Sophos;i="4.84,842,1355115600"; d="scan'208";a="18907384"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-annu.net.uoguelph.ca with ESMTP; 13 Mar 2013 23:45:05 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id AEA36B4022;
 Wed, 13 Mar 2013 23:45:05 -0400 (EDT)
Date: Wed, 13 Mar 2013 23:45:05 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <1881919310.3887914.1363232705678.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303131356.37919.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 03:45:07 -0000

John Baldwin wrote:
> I ran into a machine that had a deadlock among certain files on a
> given NFS
> mount today. I'm not sure how best to resolve it, though it seems like
> perhaps there is a bug with how the pool of nfsiod threads is managed.
> Anyway, more details on the actual hang below. This was on 8.x with
> the
> old NFS client, but I don't see anything in HEAD that would fix this.
> 
> First note that the system was idle so it had dropped down to only one
> nfsiod thread.
> 
Oh, and one more thing...
- I think this is specific to readdirplus, since I think it's the only
  rpc done by the nfsiod threads that tries to lock vnodes.

Since readdirplus isn't the default for mounts and I think it also
requires a shortage of nfsiod threads and reading a directory and its
subdirectory conncurrently for this deadlock to occur,
that would explain why it hasn't been a seen more often, I think.

rick

> The nfsiod thread is hung on a vnode lock:
> 
> (kgdb) proc 36927
> [Switching to thread 150 (Thread 100679)]#0 sched_switch (
> td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8,
> flags=524544,
> ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not
> available.
> ) at lockmgr.h:94
> #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00,
> a=0xffffff847ac10600) at vnode_if.c:1988
> #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760,
> flags=524288,
> file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138)
> at vnode_if.h:859
> #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544,
> td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138
> #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000,
> hash=227722108,
> flags=Variable "flags" is not available.
> ) at /usr/src/sys/kern/vfs_hash.c:81
> #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000,
> fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40,
> flags=524288)
> at /usr/src/sys/nfsclient/nfs_node.c:120
> #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760,
> uiop=0xffffff847ac10ad0, cred=0xffffff005587c300)
> at /usr/src/sys/nfsclient/nfs_vnops.c:2636
> ---Type <return> to continue, or q <return> to quit---
> #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760,
> bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_bio.c:1600
> #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is
> not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_nfsiod.c:303
> #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0
> <nfssvc_iod>,
> arg=0xffffffff809266e0, frame=0xffffff847ac10c40)
> at /usr/src/sys/kern/kern_fork.c:861
> 
> Thread stuck in getblk for that vnode (holds shared lock on this
> vnode):
> 
> (kgdb) proc 36902
> [Switching to thread 149 (Thread 101543)]#0 sched_switch (
> td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0,
> flags=530688,
> ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3,
> size=32768,
> slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94
> #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3,
> size=32768, td=0xffffff0378d8a8c0) at
> /usr/src/sys/nfsclient/nfs_bio.c:1259
> #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760,
> uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available.
> ) at /usr/src/sys/nfsclient/nfs_bio.c:530
> #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0,
> a=0xffffff847bcf09c0) at vnode_if.c:887
> #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0,
> uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not
> available.
> ) at vnode_if.h:384
> #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3,
> fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable
> "offset"
> is not available.
> ) at file.h:242
> 
> The buffer is locked by LK_KERNPROC:
> 
> (kgdb) bprint bp
> 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO)
> error = 0, bufsize = 32768, bcount = 32768, b_resid = 0
> bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep
> = 0x0
> lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending
> 
> And this buffer is queued as the first pending buffer on the mount
> waiting
> for service by nfsiod:
> 
> (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data
> (kgdb) p $nmp->nm_bufq.tqh_first
> $24 = (struct buf *) 0xffffff83e11bc128
> (kgdb) p bp
> $25 = (struct buf *) 0xffffff83e11bc128
> 
> So, the first process is waiting for a block from an NFS directory.
> That
> block is in queue to be completed as an async I/O by the nfsiod thread
> pool.
> However, the lone nfsiod thread in the pool is waiting to exclusively
> lock the
> original NFS directory to update its attributes, so it cannot service
> the
> async I/O request.
> 
> --
> John Baldwin

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 07:35:04 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5B654CB2
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 07:35:04 +0000 (UTC)
 (envelope-from phantom@phantom.su)
Received: from relay13.nicmail.ru (relay13.nicmail.ru [195.208.6.7])
 by mx1.freebsd.org (Postfix) with ESMTP id E570D93F
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 07:35:03 +0000 (UTC)
Received: from [109.70.25.119] (port=54701 helo=nicmail.ru)
 by f17.mail.nic.ru with esmtp (Exim 5.55)
 (envelope-from <phantom@phantom.su>) id 1UG2b9-000Kmb-5c
 for freebsd-fs@freebsd.org; Thu, 14 Mar 2013 11:29:11 +0400
Received: from [194.85.198.26] (account phantom@phantom.su HELO
 phantom-mobile.node)
 by fcgp05.nicmail.ru (CommuniGate Pro SMTP 5.2.3)
 with ESMTPSA id 178408978 for freebsd-fs@freebsd.org;
 Thu, 14 Mar 2013 11:29:11 +0400
Message-ID: <51417C47.8010304@phantom.su>
Date: Thu, 14 Mar 2013 11:29:11 +0400
From: Noskov Ilia <phantom@phantom.su>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: phantom@phantom.su
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 07:35:04 -0000

On 03/01/2013 04:58 AM, Rick Macklem wrote:
> Kostik Belousov wrote:
>> On Wed, Feb 27, 2013 at 09:59:22PM -0500, Rick Macklem wrote:
>>> Hi,
>>>
>>> Sergey Kandaurov reported a problem where getcwd() returns a
>>> path with "/./" imbedded in it for an NFSv4 mount. This is
>>> caused by a mount point crossing on the server when at the
>>> server's root because vn_fullpath1() uses VV_ROOT to spot
>>> mount point crossings.
>>>
>>> The current workaround is to use the sysctls:
>>> debug.disablegetcwd=1
>>> debug.disablefullpath=1
>>>
>>> However, it would be nice to fix this when vn_fullpath1()
>>> is being used.
>>>
>>> A simple fix is to have vn_fullpath1() fail when it finds
>>> "." as a directory match in the path. When vn_fullpath1()
>>> fails, the syscalls fail and that allows the libc algorithm
>>> to be used (which works for this case because it doesn't
>>> depend on VV_ROOT being set, etc).
>>>
>>> So, I am wondering if a patch (I have attached one) that
>>> makes vn_fullpath1() fail when it matches "." will break
>>> anything else? (I don't think so, since the code checks
>>> for VV_ROOT in the loop above the check for a match of
>>> ".", but I am not sure?)
>>>
>>> Thanks for any input w.r.t. this, rick
>>
>>> --- kern/vfs_cache.c.sav 2013-02-27 20:44:42.000000000 -0500
>>> +++ kern/vfs_cache.c 2013-02-27 21:10:39.000000000 -0500
>>> @@ -1333,6 +1333,20 @@ vn_fullpath1(struct thread *td, struct v
>>>   			    startvp, NULL, 0, 0);
>>>   			break;
>>>   		}
>>> + if (buf[buflen] == '.' && (buf[buflen + 1] == '\0' ||
>>> + buf[buflen + 1] == '/')) {
>>> + /*
>>> + * Fail if it matched ".". This should only happen
>>> + * for NFSv4 mounts that cross server mount points.
>>> + */
>>> + CACHE_RUNLOCK();
>>> + vrele(vp);
>>> + numfullpathfail1++;
>>> + error = ENOENT;
>>> + SDT_PROBE(vfs, namecache, fullpath, return,
>>> + error, vp, NULL, 0, 0);
>>> + break;
>>> + }
>>>   		buf[--buflen] = '/';
>>>   		slash_prefixed = 1;
>>>   	}
>>
>> I do not quite understand this. Did the dvp (parent) vnode returned by
>> VOP_VPTOCNP() equal to vp (child) vnode in the case of the "." name ?
>> It must be, for the correct operation, but also it should cause the
>> almost
>> infinite loop in the vn_fullpath1(). The loop is not really infinite
>> due
>> to a limited size of the buffer where the infinite amount of "./" is
>> placed.
>>
>> Anyway, I think we should do better than this patch, even if it is
>> legitimate. I think that the better place to check the condition is
>> the
>> default implementation of VOP_VPTOCNP(). Am I right that this is where
>> it broke for you ?
>>
>> diff --git a/sys/kern/vfs_default.c b/sys/kern/vfs_default.c
>> index 00d064e..1dd0185 100644
>> --- a/sys/kern/vfs_default.c
>> +++ b/sys/kern/vfs_default.c
>> @@ -856,8 +856,12 @@ vop_stdvptocnp(struct vop_vptocnp_args *ap)
>> error = ENOMEM;
>> goto out;
>> }
>> - bcopy(dp->d_name, buf + i, dp->d_namlen);
>> - error = 0;
>> + if (dp->d_namlen == 1 && dp->d_name[0] == '.') {
>> + error = ENOENT;
>> + } else {
>> + bcopy(dp->d_name, buf + i, dp->d_namlen);
>> + error = 0;
>> + }
>> goto out;
>> }
>> } while (len > 0 || !eofflag);
>
> Yes, this patch fixes the problem too. If you think it is safe to
> do this, I can commit the patch in mid-April. Maybe Sergey can
> test it?
>
> Thanks yet again, rick

Hi, Rick.
Strange behavior on nfs-client after apply this patch:

sysctl debug.disablecwd=0
sysctl debug.disablefullpath=0

# mount -v -t nfs
192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid 
02ff003a3a000000)
# ls /home | wc -l
     4946
# cd /home/user6308/.ro
# time pwd
/home/user6308/.ro
0.008u 0.269s 0:08.47 3.0%	4+157k 0+0io 0pf+0w
# ktrace -t+ -i pwd


ktrace.out is big (1MB). Attach or not?



A small piece of trace:
  19527 pwd      CALL 
mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
  19527 pwd      RET   mmap 34376515584/0x801000000
  19527 pwd      CALL  __getcwd(0x801006400,0x400)
  19527 pwd      NAMI  ".."
  19527 pwd      NAMI  ".."
  19527 pwd      RET   __getcwd -1 errno 2 No such file or directory
  19527 pwd      CALL  stat(0x800947a14,0x7fffffffd940)
  19527 pwd      NAMI  "/"
  19527 pwd      STRU  struct stat {dev=98, ino=2, mode=drwxr-xr-x , 
nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279, 
ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384, 
blocks=4, flags=0x0 }
  19527 pwd      RET   stat 0
  19527 pwd      CALL  lstat(0x80094779c,0x7fffffffd940)
  19527 pwd      NAMI  "."
  19527 pwd      STRU  struct stat {dev=1230702064, ino=145, 
mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, 
atime=1363244672.246785874, stime=1363244792.864201338, 
ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   lstat 0
  19527 pwd      CALL  openat(0xffffff9c,0x80094779b,0x100000,0x2)
  19527 pwd      NAMI  ".."
  19527 pwd      RET   openat 3
  19527 pwd      CALL  fstat(0x3,0x7fffffffd880)
  19527 pwd      STRU  struct stat {dev=1230702064, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363244665.232140704, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstat 0
  19527 pwd      CALL  fcntl(0x3,F_SETFD,FD_CLOEXEC)
  19527 pwd      RET   fcntl 0
  19527 pwd      CALL  fstatfs(0x3,0x7fffffffd660)
  19527 pwd      RET   fstatfs 0
  19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
  19527 pwd      STRU  struct stat {dev=1230702064, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363244665.232140704, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstat 0
  19527 pwd      CALL  getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
  19527 pwd      RET   getdirentries 4096/0x1000
  19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
  19527 pwd      STRU  struct stat {dev=1230702064, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363244665.232140704, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstat 0
  19527 pwd      CALL  openat(0x3,0x80094779b,0x100000,0)
  19527 pwd      NAMI  ".."
  19527 pwd      RET   openat 4
[..............................]
  19527 pwd      CALL  madvise(0x801016000,0x1000,MADV_FREE)
  19527 pwd      RET   madvise 0
  19527 pwd      CALL  madvise(0x801018000,0x2000,MADV_FREE)
  19527 pwd      RET   madvise 0
  19527 pwd      CALL  close(0x3)
  19527 pwd      RET   close 0
  19527 pwd      CALL  fstat(0x4,0x7fffffffd880)
  19527 pwd      STRU  struct stat {dev=973143810, ino=4, 
mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, 
atime=1363244767.460164771, stime=1363172100.380266923, 
ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, 
blocks=713, flags=0x0 }
  19527 pwd      RET   fstat 0
  19527 pwd      CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
  19527 pwd      RET   fcntl 0
  19527 pwd      CALL  fstatfs(0x4,0x7fffffffd660)
  19527 pwd      RET   fstatfs 0
  19527 pwd      CALL  fstat(0x4,0x7fffffffd940)
  19527 pwd      STRU  struct stat {dev=973143810, ino=4, 
mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, 
atime=1363244767.460164771, stime=1363172100.380266923, 
ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, 
blocks=713, flags=0x0 }
  19527 pwd      RET   fstat 0
  19527 pwd      CALL  getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
  19527 pwd      RET   getdirentries 4096/0x1000
  19527 pwd      CALL  fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
  19527 pwd      NAMI  "user6158"
  19527 pwd      STRU  struct stat {dev=1774902232, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363009687.040357529, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstatat 0
  19527 pwd      CALL  fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
  19527 pwd      NAMI  "user2289"
  19527 pwd      STRU  struct stat {dev=1988229825, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363009687.040357529, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstatat 0
  19527 pwd      CALL  fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
  19527 pwd      NAMI  "user4761"
  19527 pwd      STRU  struct stat {dev=2438657130, ino=4, 
mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, 
atime=1363009687.040357529, stime=1363010116.496298252, 
ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, 
blocks=3, flags=0x0 }
  19527 pwd      RET   fstatat 0
  19527 pwd      CALL  fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
  19527 pwd      NAMI  "user6055"
[.........................................]

and next get stat of all directories in /home



>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>




-- 
      Best Regards,

      Ilia Noskov
      Regional Network Information Center (RU-CENTER)
      phone: +7 495 737-0601
      fax: +7 495 737-0602
      http://www.nic.ru

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 09:09:01 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0C715441
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 09:09:01 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 77016D70
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 09:09:00 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2E98oUH035939;
 Thu, 14 Mar 2013 11:08:50 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2E98oUH035939
Received: (from kostik@localhost)
 by tom.home (8.14.6/8.14.6/Submit) id r2E98l4X035938;
 Thu, 14 Mar 2013 11:08:47 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 14 Mar 2013 11:08:47 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Noskov Ilia <phantom@phantom.su>
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
Message-ID: <20130314090847.GH3794@kib.kiev.ua>
References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
 <51417C47.8010304@phantom.su>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="0y/VvS0T6GrSShsQ"
Content-Disposition: inline
In-Reply-To: <51417C47.8010304@phantom.su>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 09:09:01 -0000


--0y/VvS0T6GrSShsQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
> Strange behavior on nfs-client after apply this patch:
>=20
> sysctl debug.disablecwd=3D0
> sysctl debug.disablefullpath=3D0
>=20
> # mount -v -t nfs
> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid=20
> 02ff003a3a000000)
> # ls /home | wc -l
>      4946
> # cd /home/user6308/.ro
> # time pwd
> /home/user6308/.ro
> 0.008u 0.269s 0:08.47 3.0%	4+157k 0+0io 0pf+0w
> # ktrace -t+ -i pwd
>=20
>=20
> ktrace.out is big (1MB). Attach or not?
>=20
>=20
>=20
> A small piece of trace:
>   19527 pwd      CALL=20
> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0x=
ffffffff,0)
>   19527 pwd      RET   mmap 34376515584/0x801000000
>   19527 pwd      CALL  __getcwd(0x801006400,0x400)
>   19527 pwd      NAMI  ".."
>   19527 pwd      NAMI  ".."
>   19527 pwd      RET   __getcwd -1 errno 2 No such file or directory
>   19527 pwd      CALL  stat(0x800947a14,0x7fffffffd940)
>   19527 pwd      NAMI  "/"
>   19527 pwd      STRU  struct stat {dev=3D98, ino=3D2, mode=3Ddrwxr-xr-x =
,=20
> nlink=3D19, uid=3D0, gid=3D0, rdev=3D2120, atime=3D1363244893, stime=3D13=
62653279,=20
> ctime=3D1362653279, birthtime=3D1200836451, size=3D1024, blksize=3D16384,=
=20
> blocks=3D4, flags=3D0x0 }
>   19527 pwd      RET   stat 0
>   19527 pwd      CALL  lstat(0x80094779c,0x7fffffffd940)
>   19527 pwd      NAMI  "."
>   19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D145,=20
> mode=3Ddrwxr-xr-x , nlink=3D2, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244672.246785874, stime=3D1363244792.864201338,=20
> ctime=3D1363244792.864201338, birthtime=3D-1, size=3D3, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   lstat 0
>   19527 pwd      CALL  openat(0xffffff9c,0x80094779b,0x100000,0x2)
>   19527 pwd      NAMI  ".."
>   19527 pwd      RET   openat 3
>   19527 pwd      CALL  fstat(0x3,0x7fffffffd880)
>   19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstat 0
>   19527 pwd      CALL  fcntl(0x3,F_SETFD,FD_CLOEXEC)
>   19527 pwd      RET   fcntl 0
>   19527 pwd      CALL  fstatfs(0x3,0x7fffffffd660)
>   19527 pwd      RET   fstatfs 0
>   19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>   19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstat 0
>   19527 pwd      CALL  getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
>   19527 pwd      RET   getdirentries 4096/0x1000
>   19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>   19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstat 0
>   19527 pwd      CALL  openat(0x3,0x80094779b,0x100000,0)
>   19527 pwd      NAMI  ".."
>   19527 pwd      RET   openat 4
> [..............................]
>   19527 pwd      CALL  madvise(0x801016000,0x1000,MADV_FREE)
>   19527 pwd      RET   madvise 0
>   19527 pwd      CALL  madvise(0x801018000,0x2000,MADV_FREE)
>   19527 pwd      RET   madvise 0
>   19527 pwd      CALL  close(0x3)
>   19527 pwd      RET   close 0
>   19527 pwd      CALL  fstat(0x4,0x7fffffffd880)
>   19527 pwd      STRU  struct stat {dev=3D973143810, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244767.460164771, stime=3D1363172100.380266923,=20
> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D4096=
,=20
> blocks=3D713, flags=3D0x0 }
>   19527 pwd      RET   fstat 0
>   19527 pwd      CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
>   19527 pwd      RET   fcntl 0
>   19527 pwd      CALL  fstatfs(0x4,0x7fffffffd660)
>   19527 pwd      RET   fstatfs 0
>   19527 pwd      CALL  fstat(0x4,0x7fffffffd940)
>   19527 pwd      STRU  struct stat {dev=3D973143810, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363244767.460164771, stime=3D1363172100.380266923,=20
> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D4096=
,=20
> blocks=3D713, flags=3D0x0 }
>   19527 pwd      RET   fstat 0
>   19527 pwd      CALL  getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
>   19527 pwd      RET   getdirentries 4096/0x1000
>   19527 pwd      CALL  fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
>   19527 pwd      NAMI  "user6158"
>   19527 pwd      STRU  struct stat {dev=3D1774902232, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstatat 0
>   19527 pwd      CALL  fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
>   19527 pwd      NAMI  "user2289"
>   19527 pwd      STRU  struct stat {dev=3D1988229825, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstatat 0
>   19527 pwd      CALL  fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
>   19527 pwd      NAMI  "user4761"
>   19527 pwd      STRU  struct stat {dev=3D2438657130, ino=3D4,=20
> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20
> atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20
> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,=
=20
> blocks=3D3, flags=3D0x0 }
>   19527 pwd      RET   fstatat 0
>   19527 pwd      CALL  fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
>   19527 pwd      NAMI  "user6055"
> [.........................................]
>=20
> and next get stat of all directories in /home

Slightly different version of the patch was committed as r247560.

The situation could only happen if the parent directory contains the "."
entry with inode number equal to the inode number of the subdirectory.
Can you confirm that this is your case ?

--0y/VvS0T6GrSShsQ
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJRQZOfAAoJEJDCuSvBvK1Bu9IP/iRFexlxYkdp5x5FeZuzAt7A
3Bu4LKb8V3eaMEniTOIRQff8eO8GsA3Ti3s9F3ZHN6BiCriy5DrWeuxNJtAa6YmA
GnU7r16aROpzJ8Z0CtTQoabKPNjCdqy9LG6jeiFlhemaG2tgpz2N6BX/MuJbrW0F
CJIsw+QozZ/koqgXpFwvkb+kjXydGm41YZJxEkdtrIecampWzWunJD19cjD3sSrC
rEYggRinQK/EQUrRBIidxWu9qKrW+WrZy4ePP7jjuIjr9//vYXIQnJre+BiYdqjL
Ihwi5fJ91wOU1vNr/7VEN/MOyISqZjXkINdvtKOWWClX/ahC06JcWoPECNVM8S3/
F3eUNyECyxkSKNnHTRseAqVZBOUpE7ulr6fSxxGiVB1SCYZwmfXq7IODV0mazDI9
w/03KBCSca0HScX5j2cYrKooS9tthdZiAp2f4LRfHef1fnaesInfDmDHkff4pxs4
QsjdlYLHe/ke0XFzZsR8zcdtdX6HmdzcCJLjMERvEyZ+8KPqb75/ANZb4aLK+UE1
FUYec0QDJ/DsPqbucMbOr4uXiZHHGrwYP4yETATlh9QTxLBZSSNgK50qCEvTf1ek
metx0YXNscOtTg+ZFY5qJtu4g8J/dMkWlUExzQ7FYIN9vfaGH5TRlfq+Qd0iGuRL
nluNA4JJ4YVEg27W1ZbP
=+vHn
-----END PGP SIGNATURE-----

--0y/VvS0T6GrSShsQ--

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 09:27:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id C61645E9;
 Thu, 14 Mar 2013 09:27:37 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 213F4E0B;
 Thu, 14 Mar 2013 09:27:36 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2E9RSba039426;
 Thu, 14 Mar 2013 11:27:28 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2E9RSba039426
Received: (from kostik@localhost)
 by tom.home (8.14.6/8.14.6/Submit) id r2E9RS7p039425;
 Thu, 14 Mar 2013 11:27:28 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 14 Mar 2013 11:27:28 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: Deadlock in the NFS client
Message-ID: <20130314092728.GI3794@kib.kiev.ua>
References: <201303131356.37919.jhb@freebsd.org>
 <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="A+YNfBJfL1GjoVjN"
Content-Disposition: inline
In-Reply-To: <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 09:27:37 -0000


--A+YNfBJfL1GjoVjN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> John Baldwin wrote:
> > I ran into a machine that had a deadlock among certain files on a
> > given NFS
> > mount today. I'm not sure how best to resolve it, though it seems like
> > perhaps there is a bug with how the pool of nfsiod threads is managed.
> > Anyway, more details on the actual hang below. This was on 8.x with
> > the
> > old NFS client, but I don't see anything in HEAD that would fix this.
> >=20
> > First note that the system was idle so it had dropped down to only one
> > nfsiod thread.
> >=20
> Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
> It seems to me that this snippet of code from nfs_asyncio() makes too
> weak an assumption:
> 	/*
> 	 * If none are free, we may already have an iod working on this mount
> 	 * point.  If so, it will process our request.
> 	 */
> 	if (!gotiod) {
> 		if (nmp->nm_bufqiods > 0) {
> 			NFS_DPF(ASYNCIO,
> 		("nfs_asyncio: %d iods are already processing mount %p\n",
> 				 nmp->nm_bufqiods, nmp));
> 			gotiod =3D TRUE;
> 		}
> 	}
> It assumes that, since an nfsiod thread is processing some buffer for the
> mount, it will become available to do this one, which isn't true for your
> deadlock.
>=20
> I think the simple fix would be to recode nfs_asyncio() so that
> it only returns 0 if it finds an AVAILABLE nfsiod thread that it
> has assigned to do the I/O, getting rid of the above. The problem
> with doing this is that it may result in a lot more synchronous I/O
> (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
> synchronous I/O could be avoided by allowing nfs_asyncio() to create a
> new thread even if the total is above nfs_iodmax. (I think this would
> require the fixed array to be replaced with a linked list and might
> result in a large number of nfsiod threads.) Maybe just having a large
> nfs_iodmax would be an adequate compromise?
>
> Does having a large # of nfsiod threads cause any serious problem for
> most systems these days?
>
> I'd be tempted to recode nfs_asyncio() as above and then, instead
> of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of
> nfsiod threads (this could be a tunable, with the understanding that
> it should be large for good performance)
>

I do not see how this would solve the deadlock itself. The proposal would
only allow system to survive slightly longer after the deadlock appeared.
And, I think that allowing the unbound amount of nfsiod threads is also
fatal.

The issue there is the LOR between buffer lock and vnode lock. Buffer lock
always must come after the vnode lock. The problematic nfsiod thread, which
locks the vnode, volatile this rule, because despite the LK_KERNPROC
ownership of the buffer lock, it is the thread which de fact owns the
buffer (only the thread can unlock it).

A possible solution would be to pass LK_NOWAIT to nfs_nget() from the
nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should
be capable of correctly handling the lock failure. And EBUSY would
result in doit =3D 0, which should be fine too.

It is possible that EBUSY should be reset to 0, though.

--A+YNfBJfL1GjoVjN
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJRQZf/AAoJEJDCuSvBvK1BK+QP/019GONiMGOEZgy9jnRFk2aR
8hUfCDdJLiNy4e3Wa2gw89Yr9TGSiOaa6YRLKwSWQ9I39yFPOOeoM6kC/QD0oQqu
qlxdalKYbiJOR61ufnqIQCRsDufbKPD2IfkoTzEYiPCsZLEAu+yV0c/0g09mCMb5
+KIr9ku72CVkLba1BHBA+9CiAb1VFa234iQDc+t792e62ttPJPP7xhTylNaME3Y7
QWqFZjcG6PFfeQDOVkhWUGRO4m6Ak5peEpLXE1po0+sgfcnrZmgw4crgLzmIKKZl
vdQ3UetqWflaTCnP3L9B28j0+H/CS53VS9sndST8xYXPADMlnuoLLGkiBpsrfRMj
vQZHz7sV6+qNXxN2LZJBgHQPuio4zghyxP6+4j57BmCJfWm6gR2pqcHip9xDBI6j
hXbkL1nVPWRH6iIIeRVs9RWXrwaa+upNIX9+aSgKWRXitIH3gRL7Gjg57wI6wxaI
CznTVt8geuVz4C1jtNcR0BfZ5i0zjwBH66hLcX86HvfYGY26HmQPB3kOTfmaYpdp
KIhYxof5gEbyr3Zgl0MwEhAxLHbVyLBybkz6ENMtbBDOhZJ2pGFawIx7j9ngK+9D
g282I/j4S8mGPdSI4uNFPLPuGyYMcjUFTBZwDostsw1pzCy+QCN3gZQWuWdVfIYe
ZrYLzGYi4hGE22kBdBNV
=YA8y
-----END PGP SIGNATURE-----

--A+YNfBJfL1GjoVjN--

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 09:41:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2901B8FF
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 09:41:34 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au
 [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id 8FF9CE93
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 09:41:33 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2E9fIVt003177
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 14 Mar 2013 20:41:21 +1100
Date: Thu, 14 Mar 2013 20:41:18 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Cody Ritts <cr@caltel.com>
Subject: Re: Aligning MBR for ZFS boot help
In-Reply-To: <5140F373.1010907@caltel.com>
Message-ID: <20130314195715.Y909@besplex.bde.org>
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com> <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com> <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
 <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org>
 <5140F373.1010907@caltel.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=JMpjKL2b c=1 sm=1 a=u3bVZBOdoLwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10
 a=s9DMC9WtY5DSvmTa96MA:9 a=CjuIK1q_8ugA:10 a=N-DlQLPxIT-iTUNV:21
 a=8ReUPQTKasrRMr7F:21 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 09:41:34 -0000

On Wed, 13 Mar 2013, Cody Ritts wrote:

> So, if I now want to create an aligned single partition, here are the steps I 
> think I should be taking:
>
> Sectors should be < 64
> Heads should be  < 256
> for OLD OLD stuff, cylinders should be < 1024

No.

Sectors _must_ be < 64
Heads _must_ be < 257
Heads should be < 256
for OLD OLD stuff, cylinders should be < 1024

> if you want boundaries on a power of 2, those the number of sectors and heads 
> should also be a power of 2.
>
> So, would all of these be potential valid values?
>
> s32  h128
> 512*32*128 = 2097152B = 2MB cylinder
> s32  h64
> 512*32*64  = 1048576B = 1MB cylinder
> s16  h128
> 512*16*128 = 1048576B = 1MB cylinder
> s4  h8
       h4
> 512*4*4   = 8192B  = 8K cylinder

Yes, provided the BIOS agrees.  If the BIOS config says that CHS is
something or other, better use or change that.  Also, s4 h8 would
probably cause problems with older BIOSes if they actually use CHS,
by causing cylinder numbers to exceed 64K.  64K cylinders of size 8KB
is just 512MB.  BIOSes might use 16-bit cylinder numbers for translating
from CHS to linear even if they don't really use CHS.

> I am assuming that once I know my cylinder size, I just divide the total size 
> of my hard drive to come up with cylinder count?

Usually I don't bother changing the cylinder count when I change the number
of heads and sectors, since I know that it is not really used by fdisk
(except possibly for default partition sizes which I never use).

> s4  h8
> 64023257088 / 8192 = 7815339c
> (8k is the largest power of 2 that the drive will evenly divide into)
>
> s32  h64
> 64023257088 / 1048576 = 61057.3359375
> Round down to 61057.
> (does the cylinder need to end on the end of the disk?)

If the sector count were used, then it should be set to the rounded
down value.  It is usually safe to make the last partition end at the
of the disk and not at the end of the fake cylinder given by rounding.
I sometimes use the part beyond the end of the fake cylinder and for
a normal partition and sometimes leave it free.

> So, here is what i calculated:
> c61057 h64 s32
>
> I want an offset of 2M, file system should be reduced to 61055M
>  (61055 * 1024 * 1024)/512 = 125040640s)
>
> Here are the commands that I ran:
>
>> cat << EOF > command
>> g c61057 h64 s32
>> p 1 0xa5 4096 125040640
>> a 1
>> EOF
>> root@:/root # fdisk -f command ada0
>> ******* Working on device /dev/ada0 *******
>> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range
>>     (must be within 1-1024 for normal BIOS operation, unless the entire 
>> disk
>>     is dedicated to FreeBSD)
>> root@:/root # fdisk -p ada0
>> # /dev/ada0
>> g c124053 h16 s63
>> p 1 0xa5 4096 125040640
>> a 1
> note, it auto goes back when exporting

Seems reasonable, but I didn't check the details.  Anything that avoids
the message about the automatic broken adjustment is probably OK.

>> root@:/root # gpart show ada0
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       4033        - free -  (2M)
>>        4096  125040640     1  freebsd  [active]  (59G)
>>   125044736        688        - free -  (344k)
>> root@:/root # gpart delete -i 1 ada0
>> root@:/root # gpart add -t freebsd -b 4096 -s 125040640 ada0
>> ada0s1 added
>> root@:/root # gpart show ada0
>> =>       63  125045361  ada0  MBR  (59G)
>>          63       4095        - free -  (2M)
>>        4158  125040573     1  freebsd  (59G)
>>   125044731        693        - free -  (346k)
> gpart does not care

I don't know anything about gpart, and if it always does the wrong adjustment
then I don't want to know.

>> root@:/root # fdisk -f command ada0
>> ******* Working on device /dev/ada0 *******
>> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range
>>     (must be within 1-1024 for normal BIOS operation, unless the entire 
>> disk
>>     is dedicated to FreeBSD)
>> root@:/root # fdisk ada0
>> ******* Working on device /dev/ada0 *******
>> parameters extracted from in-core disklabel are:
>> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl)
>> 
>> Figures below won't work with BIOS for partitions not in cyl 1
>> parameters to be used for BIOS calculations are:
>> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl)

It regressed to the broken default as usual.  This is dangerous when
modifying partition tables that have a geometry differing from the
default.  You can sill edit everything without changing the geometry
if you are careful.

>> Media sector size is 512
>> Warning: BIOS sector numbering starts with sector 1
>> Information from DOS bootblock is:
>> The data for partition 1 is:
>> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
>>     start 4096, size 125040640 (61055 Meg), flag 80 (active)
>> 	beg: cyl 2/ head 0/ sector 1;
>> 	end: cyl 640/ head 63/ sector 32

E.g., suppose you just want to change the sysid here.  Type it in.
Accept the defaults for the start and size so that these don't
change.  Then fdisk will default to making a mess of the CHS
values (if the default is wrong).  This can be recovered from
by typing in all the old values.

>> The data for partition 2 is:
>> <UNUSED>
>> The data for partition 3 is:
>> <UNUSED>
>> The data for partition 4 is:
>> <UNUSED>
>
> So, setting the geom simply does this:
>>> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
>>>     start 4096, size 125040640 (61055 Meg), flag 80 (active)
>>> 	beg: cyl 2/ head 0/ sector 1;
>>> 	end: cyl 640/ head 63/ sector 32
>
> I cannot set geom in my bios, nor does not show me what it thinks geom is. 
> Obviously anything that only supports 1024 cylinders will not think it is 
> very funny.

Probabablyy many newer BIOSes do this.

> I feel like I am missing some part of this puzzle, or is that all there is to 
> this to correct geom for proper alignment on an MBR?

I don't like the looks of gpart, but it has a -a option for alignment.

> So, by setting those CHS values I am:
>  making the partition table more compatible with other operating systems and 
> BIOSes?
>  and giving some utilities the CHS stuff they need to function right?

It's not completely clear that S=32 H=64 is portable, but it is what most
old SCSI BIOSes used.

Also, if the disk already has some partitions with a certain geometry, use
the same geometry for other partitions and don't use fdisk's defaults if
they differ.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 10:20:49 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 7361C2D1
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 10:20:49 +0000 (UTC)
 (envelope-from phantom@phantom.su)
Received: from relay08.nicmail.ru (relay08.nicmail.ru [195.208.6.4])
 by mx1.freebsd.org (Postfix) with ESMTP id C73EDC3
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 10:20:48 +0000 (UTC)
Received: from [109.70.25.145] (port=43055 helo=nicmail.ru)
 by f06.mail.nic.ru with esmtp (Exim 5.55)
 (envelope-from <phantom@phantom.su>)
 id 1UG57D-000MaL-01; Thu, 14 Mar 2013 14:10:27 +0400
Received: from [194.85.198.26] (account phantom@phantom.su HELO
 phantom-mobile.node)
 by fcgp09.nicmail.ru (CommuniGate Pro SMTP 5.2.3)
 with ESMTPSA id 99207934; Thu, 14 Mar 2013 14:10:26 +0400
Message-ID: <5141A212.9050909@phantom.su>
Date: Thu, 14 Mar 2013 14:10:26 +0400
From: Ilia Noskov <phantom@phantom.su>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
 <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua>
In-Reply-To: <20130314090847.GH3794@kib.kiev.ua>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: phantom@phantom.su
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 10:20:49 -0000

On 03/14/2013 01:08 PM, Konstantin Belousov wrote:
> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
>> Strange behavior on nfs-client after apply this patch:
>>
>> sysctl debug.disablecwd=0
>> sysctl debug.disablefullpath=0
>>
>> # mount -v -t nfs
>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid
>> 02ff003a3a000000)
>> # ls /home | wc -l
>>       4946
>> # cd /home/user6308/.ro
>> # time pwd
>> /home/user6308/.ro
>> 0.008u 0.269s 0:08.47 3.0%	4+157k 0+0io 0pf+0w
>> # ktrace -t+ -i pwd
>>
>>
>> ktrace.out is big (1MB). Attach or not?
>>
>>
>>
>> A small piece of trace:
>>    19527 pwd      CALL
>> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
>>    19527 pwd      RET   mmap 34376515584/0x801000000
>>    19527 pwd      CALL  __getcwd(0x801006400,0x400)
>>    19527 pwd      NAMI  ".."
>>    19527 pwd      NAMI  ".."
>>    19527 pwd      RET   __getcwd -1 errno 2 No such file or directory
>>    19527 pwd      CALL  stat(0x800947a14,0x7fffffffd940)
>>    19527 pwd      NAMI  "/"
>>    19527 pwd      STRU  struct stat {dev=98, ino=2, mode=drwxr-xr-x ,
>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279,
>> ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384,
>> blocks=4, flags=0x0 }
>>    19527 pwd      RET   stat 0
>>    19527 pwd      CALL  lstat(0x80094779c,0x7fffffffd940)
>>    19527 pwd      NAMI  "."
>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=145,
>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295,
>> atime=1363244672.246785874, stime=1363244792.864201338,
>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   lstat 0
>>    19527 pwd      CALL  openat(0xffffff9c,0x80094779b,0x100000,0x2)
>>    19527 pwd      NAMI  ".."
>>    19527 pwd      RET   openat 3
>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd880)
>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363244665.232140704, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstat 0
>>    19527 pwd      CALL  fcntl(0x3,F_SETFD,FD_CLOEXEC)
>>    19527 pwd      RET   fcntl 0
>>    19527 pwd      CALL  fstatfs(0x3,0x7fffffffd660)
>>    19527 pwd      RET   fstatfs 0
>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363244665.232140704, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstat 0
>>    19527 pwd      CALL  getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
>>    19527 pwd      RET   getdirentries 4096/0x1000
>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363244665.232140704, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstat 0
>>    19527 pwd      CALL  openat(0x3,0x80094779b,0x100000,0)
>>    19527 pwd      NAMI  ".."
>>    19527 pwd      RET   openat 4
>> [..............................]
>>    19527 pwd      CALL  madvise(0x801016000,0x1000,MADV_FREE)
>>    19527 pwd      RET   madvise 0
>>    19527 pwd      CALL  madvise(0x801018000,0x2000,MADV_FREE)
>>    19527 pwd      RET   madvise 0
>>    19527 pwd      CALL  close(0x3)
>>    19527 pwd      RET   close 0
>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd880)
>>    19527 pwd      STRU  struct stat {dev=973143810, ino=4,
>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
>> atime=1363244767.460164771, stime=1363172100.380266923,
>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096,
>> blocks=713, flags=0x0 }
>>    19527 pwd      RET   fstat 0
>>    19527 pwd      CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
>>    19527 pwd      RET   fcntl 0
>>    19527 pwd      CALL  fstatfs(0x4,0x7fffffffd660)
>>    19527 pwd      RET   fstatfs 0
>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd940)
>>    19527 pwd      STRU  struct stat {dev=973143810, ino=4,
>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
>> atime=1363244767.460164771, stime=1363172100.380266923,
>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096,
>> blocks=713, flags=0x0 }
>>    19527 pwd      RET   fstat 0
>>    19527 pwd      CALL  getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
>>    19527 pwd      RET   getdirentries 4096/0x1000
>>    19527 pwd      CALL  fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
>>    19527 pwd      NAMI  "user6158"
>>    19527 pwd      STRU  struct stat {dev=1774902232, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363009687.040357529, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstatat 0
>>    19527 pwd      CALL  fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
>>    19527 pwd      NAMI  "user2289"
>>    19527 pwd      STRU  struct stat {dev=1988229825, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363009687.040357529, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstatat 0
>>    19527 pwd      CALL  fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
>>    19527 pwd      NAMI  "user4761"
>>    19527 pwd      STRU  struct stat {dev=2438657130, ino=4,
>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>> atime=1363009687.040357529, stime=1363010116.496298252,
>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>> blocks=3, flags=0x0 }
>>    19527 pwd      RET   fstatat 0
>>    19527 pwd      CALL  fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
>>    19527 pwd      NAMI  "user6055"
>> [.........................................]
>>
>> and next get stat of all directories in /home
>
> Slightly different version of the patch was committed as r247560.
>
> The situation could only happen if the parent directory contains the "."
> entry with inode number equal to the inode number of the subdirectory.
> Can you confirm that this is your case ?
>

Yes, it is.
I'll try again on the latest snapshot. Thanks!



-- 
      Best Regards,

      Ilia Noskov
      Regional Network Information Center (RU-CENTER)
      phone: +7 495 737-0601
      fax: +7 495 737-0602
      http://www.nic.ru

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 11:47:20 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 51F58599
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 11:47:20 +0000 (UTC)
 (envelope-from peter.maloney@brockmann-consult.de)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
 [212.227.126.171])
 by mx1.freebsd.org (Postfix) with ESMTP id D683D6DA
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 11:47:19 +0000 (UTC)
Received: from [10.3.0.26] ([141.4.215.32])
 by mrelayeu.kundenserver.de (node=mreu4) with ESMTP (Nemesis)
 id 0M31EZ-1UZ1BQ27yp-00sxHw; Thu, 14 Mar 2013 12:47:03 +0100
Message-ID: <5141B8B6.4010209@brockmann-consult.de>
Date: Thu, 14 Mar 2013 12:47:02 +0100
From: Peter Maloney <peter.maloney@brockmann-consult.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
 <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org>
 <5140F373.1010907@caltel.com> <20130314195715.Y909@besplex.bde.org>
In-Reply-To: <20130314195715.Y909@besplex.bde.org>
X-Enigmail-Version: 1.5
X-Provags-ID: V02:K0:C4/pe3TKhgPHEqVsWCFKHk7t77FrJqAAOpRdszHRylk
 Oze/q0yot56nEVdpSobrAswTpXLbeJBzppXtzLCFrn1ehj8H/k
 qU9v3JX0tM3hERgT7hq+5h3KBKTA0SBzPmsIxfgfoEU4Icm2VX
 BJNRTZLRCYCxX6cJl/6Bp0+fN8VqgEjIE9yekNQeqkUEOi9yS3
 Cgb16UwtrnQfAvquYWQZXUY0MCQKhThWDA6RHm7pBayHHIRsYj
 aMFR9X1Jg3huY2fd9bLIBbB8oIKtIHpoISQQ0YuSpTNkCnhHh/
 0ZVXm6QMjKHfnLOd08C3UODvW02LmdXf8ZyzO9h1Om57tOpOoG
 ZSJZZ3+6rmR64lZEXgF7xDfIS7EPnMO/AgEz6qv1C
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 11:47:20 -0000

On 2013-03-14 10:41, Bruce Evans wrote:
> On Wed, 13 Mar 2013, Cody Ritts wrote:
>
>> So, by setting those CHS values I am:
>>  making the partition table more compatible with other operating
>> systems and BIOSes?
>>  and giving some utilities the CHS stuff they need to function right?
>
> It's not completely clear that S=32 H=64 is portable, but it is what most
> old SCSI BIOSes used.
>
> Also, if the disk already has some partitions with a certain geometry,
> use
> the same geometry for other partitions and don't use fdisk's defaults if
> they differ.
>
> Bruce

Oh man... I thought yeah that -a 1 or -a 2048 should work, but it
doesn't. And then I thought I'd be extra crafty and use dd to directly
write the partition table myself and send that as a solution to you
guys, but even that fails!


Here's writing a 63 alignment mbr to the disk, just to prove dd can do this:

# gdd if=mbr.img of=/dev/md10 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 16.8709 s, 0.0 kB/s

# gpart show md10
=>     63  4194241  md10  MBR  (2.0G)
       63    40950     1  freebsd  (20M)
    41013  4153291        - free -  (2G)

Here's changing the start sector on the first partition to 2048 ;)
Writing to the device works with bs=512, but not bs=1, so we use a file
and bs=1 to do our edits, and then bs=512 to the disk.

# gdd if=<(echo -ne "\x00\x08" ) of=mbr.img bs=1 seek=454
2+0 records in
2+0 records out
2 bytes (2 B) copied, 0.000112023 s, 17.9 kB/s

Here's writing the new 2048 aligned mbr to the disk:

# gdd if=mbr.img of=/dev/md10 bs=1 count=1
gdd: writing `/dev/md10': *Invalid argument*
1+0 records in
0+0 records out
0 bytes (0 B) copied, 21.0247 s, 0.0 kB/s

:O

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 11:53:27 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D3E3D8F5
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 11:53:27 +0000 (UTC)
 (envelope-from peter.maloney@brockmann-consult.de)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
 [212.227.126.187])
 by mx1.freebsd.org (Postfix) with ESMTP id 7A7CC73B
 for <freebsd-fs@FreeBSD.org>; Thu, 14 Mar 2013 11:53:27 +0000 (UTC)
Received: from [10.3.0.26] ([141.4.215.32])
 by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis)
 id 0MURRH-1U7cHA1ROl-00R428; Thu, 14 Mar 2013 12:53:15 +0100
Message-ID: <5141BA2A.9080904@brockmann-consult.de>
Date: Thu, 14 Mar 2013 12:53:14 +0100
From: Peter Maloney <peter.maloney@brockmann-consult.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: Aligning MBR for ZFS boot help
References: <513C1629.50501@caltel.com>
 <alpine.BSF.2.00.1303101006490.5989@wonkity.com>
 <513CD9AB.5080903@caltel.com>
 <alpine.BSF.2.00.1303101326530.7218@wonkity.com>
 <513CE369.4030303@caltel.com>
 <alpine.BSF.2.00.1303101349540.7637@wonkity.com>
 <1362951595.99445.2.camel@btw.pki2.com>
 <alpine.BSF.2.00.1303101807550.8481@wonkity.com>
 <CABXB=RTt-j0SGxktWMfLcgLAEN6Vi+f=psBuN0jQaJthk_3cbw@mail.gmail.com>
 <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org>
 <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org>
 <5140F373.1010907@caltel.com> <20130314195715.Y909@besplex.bde.org>
 <5141B8B6.4010209@brockmann-consult.de>
In-Reply-To: <5141B8B6.4010209@brockmann-consult.de>
X-Enigmail-Version: 1.5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Provags-ID: V02:K0:iidejYT6c3uT7Wc+/2P8aXqCpJesMuK8sHf15MFU3+Q
 AM5plOsrCOEiO2cpPkHan3AdFyGG4niJpP7GjP0o2Y8UgmeCTN
 5VyhCA1Q7aUG32GO4ZEjF6LT8CH4FSW6LhhBhMycjQPKrqa2k7
 /MGQ+5xcbjJBzgR+wD5Zi5ERfSwprFVB+/CV8STL+Jub2F0q88
 P2l8Cu31wusKYNGvLY2BrvHlalVzZFhM6td1Z+S+Lu63oLM/Va
 1YsVIiwOC0ebObryTVg1PiAjZWhhZ0rfbGxxlzk64GdfizYo6A
 6HzdUqIIAbXTh59AfdYLXrkEziGTmiFWH6w0NKfI2kK1lMqhO4
 KDnSf7TGn1aeMXpVObB0M4kcT1eod9RNiLCF3RYoP
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 11:53:27 -0000

On 2013-03-14 12:47, Peter Maloney wrote:
> On 2013-03-14 10:41, Bruce Evans wrote:
>> On Wed, 13 Mar 2013, Cody Ritts wrote:
>>
>>> So, by setting those CHS values I am:
>>>  making the partition table more compatible with other operating
>>> systems and BIOSes?
>>>  and giving some utilities the CHS stuff they need to function right?
>> It's not completely clear that S=32 H=64 is portable, but it is what most
>> old SCSI BIOSes used.
>>
>> Also, if the disk already has some partitions with a certain geometry,
>> use
>> the same geometry for other partitions and don't use fdisk's defaults if
>> they differ.
>>
>> Bruce
> Oh man... I thought yeah that -a 1 or -a 2048 should work, but it
> doesn't. And then I thought I'd be extra crafty and use dd to directly
> write the partition table myself and send that as a solution to you
> guys, but even that fails!
>
>
> Here's writing a 63 alignment mbr to the disk, just to prove dd can do this:
>
> # gdd if=mbr.img of=/dev/md10 bs=512 count=1
> 1+0 records in
> 1+0 records out
> 512 bytes (512 B) copied, 16.8709 s, 0.0 kB/s
>
> # gpart show md10
> =>     63  4194241  md10  MBR  (2.0G)
>        63    40950     1  freebsd  (20M)
>     41013  4153291        - free -  (2G)
>
> Here's changing the start sector on the first partition to 2048 ;)
> Writing to the device works with bs=512, but not bs=1, so we use a file
> and bs=1 to do our edits, and then bs=512 to the disk.
>
> # gdd if=<(echo -ne "\x00\x08" ) of=mbr.img bs=1 seek=454
> 2+0 records in
> 2+0 records out
> 2 bytes (2 B) copied, 0.000112023 s, 17.9 kB/s
>
> Here's writing the new 2048 aligned mbr to the disk:
>
> # gdd if=mbr.img of=/dev/md10 bs=1 count=1
> gdd: writing `/dev/md10': *Invalid argument*
> 1+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 21.0247 s, 0.0 kB/s
>
> :O
> _________________________________________


Oh, and I almost forgot the most important part... the solution!

The solution is to align to 129024 sectors instead, which fits the needs
of modern 512/1024/2048 alignment, and also the crazy old thing.

# gpart add -t freebsd -a 129024 -s 1M md10
md10s1 added
# gpart add -t freebsd -a 129024 -s 1511M md10
md10s2 added
# gpart show md10
=>     63  4194241  md10  MBR  (2.0G)
       63   128961        - free -  (63M)
   129024     2016     1  freebsd  (1M)
   131040   127008        - free -  (62M)
   258048  2967552     2  freebsd  (1.4G)
  3225600   968704        - free -  (473M)

Above -s numbers are basically random for testing. So now let's check
that they are indeed aligned, with the modulus in bc. (Note that
strangely, % is only modulus if scale=0 in bc)

# bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
129024%2048
0
258048%2048
0


From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 14:57:29 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A20B7799;
 Thu, 14 Mar 2013 14:57:29 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 70A2F266;
 Thu, 14 Mar 2013 14:57:29 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id C4858B985;
 Thu, 14 Mar 2013 10:57:28 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: Deadlock in the NFS client
Date: Thu, 14 Mar 2013 10:57:13 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; )
References: <201303131356.37919.jhb@freebsd.org>
 <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
 <20130314092728.GI3794@kib.kiev.ua>
In-Reply-To: <20130314092728.GI3794@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201303141057.13609.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Thu, 14 Mar 2013 10:57:28 -0400 (EDT)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 14:57:29 -0000

On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > John Baldwin wrote:
> > > I ran into a machine that had a deadlock among certain files on a
> > > given NFS
> > > mount today. I'm not sure how best to resolve it, though it seems like
> > > perhaps there is a bug with how the pool of nfsiod threads is managed.
> > > Anyway, more details on the actual hang below. This was on 8.x with
> > > the
> > > old NFS client, but I don't see anything in HEAD that would fix this.
> > > 
> > > First note that the system was idle so it had dropped down to only one
> > > nfsiod thread.
> > > 
> > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
> > It seems to me that this snippet of code from nfs_asyncio() makes too
> > weak an assumption:
> > 	/*
> > 	 * If none are free, we may already have an iod working on this mount
> > 	 * point.  If so, it will process our request.
> > 	 */
> > 	if (!gotiod) {
> > 		if (nmp->nm_bufqiods > 0) {
> > 			NFS_DPF(ASYNCIO,
> > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > 				 nmp->nm_bufqiods, nmp));
> > 			gotiod = TRUE;
> > 		}
> > 	}
> > It assumes that, since an nfsiod thread is processing some buffer for the
> > mount, it will become available to do this one, which isn't true for your
> > deadlock.
> > 
> > I think the simple fix would be to recode nfs_asyncio() so that
> > it only returns 0 if it finds an AVAILABLE nfsiod thread that it
> > has assigned to do the I/O, getting rid of the above. The problem
> > with doing this is that it may result in a lot more synchronous I/O
> > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
> > synchronous I/O could be avoided by allowing nfs_asyncio() to create a
> > new thread even if the total is above nfs_iodmax. (I think this would
> > require the fixed array to be replaced with a linked list and might
> > result in a large number of nfsiod threads.) Maybe just having a large
> > nfs_iodmax would be an adequate compromise?
> >
> > Does having a large # of nfsiod threads cause any serious problem for
> > most systems these days?
> >
> > I'd be tempted to recode nfs_asyncio() as above and then, instead
> > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of
> > nfsiod threads (this could be a tunable, with the understanding that
> > it should be large for good performance)
> >
> 
> I do not see how this would solve the deadlock itself. The proposal would
> only allow system to survive slightly longer after the deadlock appeared.
> And, I think that allowing the unbound amount of nfsiod threads is also
> fatal.
> 
> The issue there is the LOR between buffer lock and vnode lock. Buffer lock
> always must come after the vnode lock. The problematic nfsiod thread, which
> locks the vnode, volatile this rule, because despite the LK_KERNPROC
> ownership of the buffer lock, it is the thread which de fact owns the
> buffer (only the thread can unlock it).
> 
> A possible solution would be to pass LK_NOWAIT to nfs_nget() from the
> nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should
> be capable of correctly handling the lock failure. And EBUSY would
> result in doit = 0, which should be fine too.
> 
> It is possible that EBUSY should be reset to 0, though.

Yes, thinking about this more, I do think the right answer is for readdirplus 
to do this.  The only question I have is if it should do this always, or if it 
should do this only from the nfsiod thread.  I believe you can't get this in
the non-nfsiod case.

-- 
John Baldwin

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 17:22:49 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 63914CEF;
 Thu, 14 Mar 2013 17:22:49 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id C9DA1DD0;
 Thu, 14 Mar 2013 17:22:48 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2EHMegE075450;
 Thu, 14 Mar 2013 19:22:40 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2EHMegE075450
Received: (from kostik@localhost)
 by tom.home (8.14.6/8.14.6/Submit) id r2EHMeDj075449;
 Thu, 14 Mar 2013 19:22:40 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 14 Mar 2013 19:22:39 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
Message-ID: <20130314172239.GL3794@kib.kiev.ua>
References: <201303131356.37919.jhb@freebsd.org>
 <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
 <20130314092728.GI3794@kib.kiev.ua>
 <201303141057.13609.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="f78grIEC2xXSYamv"
Content-Disposition: inline
In-Reply-To: <201303141057.13609.jhb@freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 17:22:49 -0000


--f78grIEC2xXSYamv
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > John Baldwin wrote:
> > > > I ran into a machine that had a deadlock among certain files on a
> > > > given NFS
> > > > mount today. I'm not sure how best to resolve it, though it seems l=
ike
> > > > perhaps there is a bug with how the pool of nfsiod threads is manag=
ed.
> > > > Anyway, more details on the actual hang below. This was on 8.x with
> > > > the
> > > > old NFS client, but I don't see anything in HEAD that would fix thi=
s.
> > > >=20
> > > > First note that the system was idle so it had dropped down to only =
one
> > > > nfsiod thread.
> > > >=20
> > > Hmm, I see the problem and I'm a bit surprised it doesn't bite more o=
ften.
> > > It seems to me that this snippet of code from nfs_asyncio() makes too
> > > weak an assumption:
> > > 	/*
> > > 	 * If none are free, we may already have an iod working on this mount
> > > 	 * point.  If so, it will process our request.
> > > 	 */
> > > 	if (!gotiod) {
> > > 		if (nmp->nm_bufqiods > 0) {
> > > 			NFS_DPF(ASYNCIO,
> > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > 				 nmp->nm_bufqiods, nmp));
> > > 			gotiod =3D TRUE;
> > > 		}
> > > 	}
> > > It assumes that, since an nfsiod thread is processing some buffer for=
 the
> > > mount, it will become available to do this one, which isn't true for =
your
> > > deadlock.
> > >=20
> > > I think the simple fix would be to recode nfs_asyncio() so that
> > > it only returns 0 if it finds an AVAILABLE nfsiod thread that it
> > > has assigned to do the I/O, getting rid of the above. The problem
> > > with doing this is that it may result in a lot more synchronous I/O
> > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
> > > synchronous I/O could be avoided by allowing nfs_asyncio() to create a
> > > new thread even if the total is above nfs_iodmax. (I think this would
> > > require the fixed array to be replaced with a linked list and might
> > > result in a large number of nfsiod threads.) Maybe just having a large
> > > nfs_iodmax would be an adequate compromise?
> > >
> > > Does having a large # of nfsiod threads cause any serious problem for
> > > most systems these days?
> > >
> > > I'd be tempted to recode nfs_asyncio() as above and then, instead
> > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of
> > > nfsiod threads (this could be a tunable, with the understanding that
> > > it should be large for good performance)
> > >
> >=20
> > I do not see how this would solve the deadlock itself. The proposal wou=
ld
> > only allow system to survive slightly longer after the deadlock appeare=
d.
> > And, I think that allowing the unbound amount of nfsiod threads is also
> > fatal.
> >=20
> > The issue there is the LOR between buffer lock and vnode lock. Buffer l=
ock
> > always must come after the vnode lock. The problematic nfsiod thread, w=
hich
> > locks the vnode, volatile this rule, because despite the LK_KERNPROC
> > ownership of the buffer lock, it is the thread which de fact owns the
> > buffer (only the thread can unlock it).
> >=20
> > A possible solution would be to pass LK_NOWAIT to nfs_nget() from the
> > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should
> > be capable of correctly handling the lock failure. And EBUSY would
> > result in doit =3D 0, which should be fine too.
> >=20
> > It is possible that EBUSY should be reset to 0, though.
>=20
> Yes, thinking about this more, I do think the right answer is for
> readdirplus to do this. The only question I have is if it should do
> this always, or if it should do this only from the nfsiod thread. I
> believe you can't get this in the non-nfsiod case.

I agree that it looks as of the workaround only needed for nfsiod thread.
On the other hand, it is not immediately obvious how to detect that
the current thread is nfsio daemon. Probably a thread flag should be
set.

--f78grIEC2xXSYamv
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJRQgdfAAoJEJDCuSvBvK1BJ3AQAJv801LkZRl8NG2tgvt1jm41
o0IcdC0+frAC6yRZQsmwkINJEFhojf5cozWobyBOYzjoC8SKJIJ4JfC82b7tZIgz
3KnRd0cspkuTrM5WbAkKL/MHzZmHAZkNs4VJ2z/Ov+pqedlq+HlecYbH9PUxG7+e
ZFSVTIhSP17pHeLFamR4eVKuJxC9H723ca//h5tgAWHu/PBimHOWdT6URJ6C1N/A
QuTZqkKSJ8HNkO+DPU89h5wC1IpDXwni+YY5M5rfbc9eisogeQ3k3KW4Jv28oDEe
VpPIhSLRZwF/nL/0tn0ha1s62XnAyFYT3r7j1pnFTRssdR0/llB8A3y9vRjkXsaX
uRoKholt57JZ7NsbR+yE8CrWQxBePx5cTaxU7k/42eqwm0JGPMiHaq8DChGQuC/m
tWaxtva48A0jL37ND+w/mifl/Bmul1s+U6VNwuZ732SQyHCTabTqV7t5InPbRL12
JTc+OOui85MU3wvigVUCKxenp181Hx1No+QXKFVeesUU1NENZWlph5+DxTy/UynB
G0v73kO70q1Rs6hQfQPRburyO2t6TdDVTfzpJNeoFR7mPfQ77uRqPWMxaQD6wlPw
v7ZZqH2H1adJltP5tzgxGuRDSmjRxcnKvCgICCLlK3n/JneLSmiFDOeOwRqf7G0g
1OA/JWbKedASOThXMHrD
=P2wk
-----END PGP SIGNATURE-----

--f78grIEC2xXSYamv--

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 18:13:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0189BD39
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:13:40 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com
 [209.85.216.49]) by mx1.freebsd.org (Postfix) with ESMTP id BCBBB172
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:13:39 +0000 (UTC)
Received: by mail-qa0-f49.google.com with SMTP id o13so1399854qaj.15
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:date:message-id:subject:from:to
 :content-type; bh=GYplxzYfiI41tlYW79pzop4H+7jE/XGvwwwev1BwNtU=;
 b=IZDNwfAY+G0fzdBtGAG2Kf/ZQc238NtfwAyalEwiABkUObRexaue6b0maSH1e0y8fl
 CMfT/hfF70AFafIFfwZAN5nUb/FntherTWcnM5Mucu4G63NNQD/95CY67CS0FEBzh9n0
 8U1WFok6nvj2rQFJ7u4Kq6w8tz8Q2OM/goyE9VwB2BrZenSM/4A3TY2Wqz0FHt0Onc9z
 ZJR3Y/LNPoPXSw2ST5OjoBor5g0HUrXngYS5VNXEqtiifeOFPc2c6ePIBAn0fPl/h6zm
 9npvQVjOIxuZexn+Azn6zqPaPWocLestS42hZ4SF+NPtSvG7CApe8j+jXxJ5M4nwQoql
 Xotw==
MIME-Version: 1.0
X-Received: by 10.229.172.162 with SMTP id l34mr713340qcz.81.1363284818828;
 Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
Received: by 10.49.50.67 with HTTP; Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
Date: Thu, 14 Mar 2013 11:13:38 -0700
Message-ID: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
Subject: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 18:13:40 -0000

3 storage systems are running this:
# uname -a
FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M:
Fri Feb  1 09:38:24 PST 2013
root@alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

1 storage system is running this:
# uname -a
FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M:
Mon Mar  4 10:27:26 PST 2013
root@omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

The last system has manually merged the ZFS "deadman" patch (r 247265 from
-CURRENT).

All 4 systems exhibit the same symptoms:  if a cache device is enabled in
the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100%
of a CPU core, at which point all I/O to the pool stops.  "zpool iostat 1"
and "zpool iostat -v 1" show 0 reads and 0 writes to the pool.  "gstat -I
1s -f gpt" shows 0 activity to the pool disks.

If I remove the cache device from the pool, I/O starts up right away
(although it takes several minutes for the remove operation to complete).

During the "0 I/O period", any attempt to access the pool "hangs".  CTRL+T
shows either spa_namespace_lock or tx->tx_something or other (the one when
trying to write a transaction to disk).  And it will stay like that until
the cache device is removed.

Hardware is almost the same in all 4 boxes:

3x storage boxes:
alphadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    64 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS-8i SATA controller using mpt driver
    SuperMicro 4U chassis

betadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    48 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs +
spare)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 3U chassis

zuludrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    32 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 836 chassis


1x storage box:
omegadrive:
    SuperMicro H8DG6-F motherboard
    2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total)
    128 GB of DDR3 ECC SDRAM in one box
    2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror)
    2x 120 GB SSD for cache devices
    45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3
spares)
    LSI 9211-8e SAS controllers using mps driver
    Onboard LSI 2008 SATA controller using mps driver for OS/log/cache
    SuperMicro 4U JBOD chassis
    SuperMicro 2U chassis for motherboard/OS

alphadrive, betadrive, and omegadrive all have dedup and lzjb compression
enabled.
zuludrive has lzjb compression enabled (no dedup).

alpha/beta/zulu do rsync backups every night from various local and remote
Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during
the day.  The "0 I/O periods" occur most often and most quickly on
omegadrive when receiving snapshots, but will eventually occur on all
systems during the rsyncs.

Things I've tried:
  - limiting ARC to only 32 GB on each system
  - limiting L2ARC to 30 GB on each system
  - enabling the "deadman" patch in case it was I/O requests being lost by
the drives/controllers
  - changing primarycache between all and metadata
  - increasing arc_meta_limit to just shy of arc_max
  - removing cache devices completely

So far, only the last option works.  Without L2ARC, the systems are 100%
stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of
ZFS recv (saturates gigabit link, bursts writes; usually hovers around
50-80 MB/s continuous writes).

I'm baffled.  An L2ARC is supposed to make things faster, especially when
using dedup as the DDT can be cached.

-- 
Freddie Cash
fjwcash@gmail.com

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 18:44:56 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id C33DB47E
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:44:56 +0000 (UTC)
 (envelope-from daniel@digsys.bg)
Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123])
 by mx1.freebsd.org (Postfix) with ESMTP id 183002B3
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:44:55 +0000 (UTC)
Received: from [192.168.178.221] ([62.28.165.86]) (authenticated bits=0)
 by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r2EIfxsq091921
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
 Thu, 14 Mar 2013 20:42:02 +0200 (EET)
 (envelope-from daniel@digsys.bg)
References: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
Mime-Version: 1.0 (1.0)
In-Reply-To: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Message-Id: <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg>
X-Mailer: iPad Mail (10B146)
From: Daniel Kalchev <daniel@digsys.bg>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
Date: Thu, 14 Mar 2013 18:41:58 +0000
To: Freddie Cash <fjwcash@gmail.com>
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 18:44:56 -0000

Just an idea - have you tried to increase the L2ARC fill rate? That might mo=
ve data faster from RAM to L2ARC.. Don't remember the says to offhand.

Might be dedup kicking in and not able to get "swapped out" fast enough.

Daniel=

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 18:51:41 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 78193919
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:51:41 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qc0-x235.google.com (mail-qc0-x235.google.com
 [IPv6:2607:f8b0:400d:c01::235])
 by mx1.freebsd.org (Postfix) with ESMTP id 3CB8B340
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:51:41 +0000 (UTC)
Received: by mail-qc0-f181.google.com with SMTP id a22so1190787qcs.12
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 11:51:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=NaEI9E5BSMcb4Dd2FVAfzktgk2sA+P7/cjvLmeDxVzM=;
 b=WjY7tSBYPWjDxE6bJMOWGG9G3nwcHPIAO9xnIl09cvK+6Zdgi+13rAK21apzHuwrYc
 JqRpKcO9qIKDzUYbhoQ3ZG62j8dwnAttvzUw3u982ZgaPDrNAoK3Vb1OJuBnPlOTXolc
 Jyf13lkv8FLKGczxR1mdGC7EI046XiiIr2TCA8nBOxQpwDxmYLVrd3rW/TLyDla1Oa0u
 zoQoC02uRGljKREUEyi/QfULTYalZMDpkAjA46swY2WSQE8tmfPskjwdUjXHmA9TEyah
 4jLkjCOHqDrZvQWr5t5GOJHJjxHZnbPtzm3NBFm/jVysty3riRjJY4DOxHbjfvk5OJz4
 UWyA==
MIME-Version: 1.0
X-Received: by 10.224.182.70 with SMTP id cb6mr3622266qab.80.1363287100731;
 Thu, 14 Mar 2013 11:51:40 -0700 (PDT)
Received: by 10.49.50.67 with HTTP; Thu, 14 Mar 2013 11:51:40 -0700 (PDT)
In-Reply-To: <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg>
References: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
 <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg>
Date: Thu, 14 Mar 2013 11:51:40 -0700
Message-ID: <CAOjFWZ718cGJUruysNYr+QuqX-RP2+dXAVTnMwnqFgz-+mkK9w@mail.gmail.com>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: Daniel Kalchev <daniel@digsys.bg>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 18:51:41 -0000

I do have the following set in /boot/loader.conf:
vfs.zfs.l2arc_write_boost="160000000"           # Set the L2ARC warmup
writes to 160 MBps
vfs.zfs.l2arc_write_max="320000000"             # Set the L2ARC writes to
320 MBps

Haven't tried setting them any higher than that, though.

During the "0 I/O periods", there's no I/O going to the cache devices
either, and they're anywhere from 50% full to almost 100% full (as shown in
"zpool list -v" and "zpool iostat" output).  ARC use is close to max,
though.



On Thu, Mar 14, 2013 at 11:41 AM, Daniel Kalchev <daniel@digsys.bg> wrote:

> Just an idea - have you tried to increase the L2ARC fill rate? That might
> move data faster from RAM to L2ARC.. Don't remember the says to offhand.
>
> Might be dedup kicking in and not able to get "swapped out" fast enough.
>
> Daniel




-- 
Freddie Cash
fjwcash@gmail.com

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 20:35:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A1D262FB;
 Thu, 14 Mar 2013 20:35:40 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 4DE98A11;
 Thu, 14 Mar 2013 20:35:40 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id A4A3EB943;
 Thu, 14 Mar 2013 16:35:39 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: Deadlock in the NFS client
Date: Thu, 14 Mar 2013 14:44:35 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; )
References: <201303131356.37919.jhb@freebsd.org>
 <201303141057.13609.jhb@freebsd.org> <20130314172239.GL3794@kib.kiev.ua>
In-Reply-To: <20130314172239.GL3794@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201303141444.35740.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Thu, 14 Mar 2013 16:35:39 -0400 (EDT)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 20:35:40 -0000

On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote:
> On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > I ran into a machine that had a deadlock among certain files on a
> > > > > given NFS
> > > > > mount today. I'm not sure how best to resolve it, though it seems like
> > > > > perhaps there is a bug with how the pool of nfsiod threads is managed.
> > > > > Anyway, more details on the actual hang below. This was on 8.x with
> > > > > the
> > > > > old NFS client, but I don't see anything in HEAD that would fix this.
> > > > > 
> > > > > First note that the system was idle so it had dropped down to only one
> > > > > nfsiod thread.
> > > > > 
> > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
> > > > It seems to me that this snippet of code from nfs_asyncio() makes too
> > > > weak an assumption:
> > > > 	/*
> > > > 	 * If none are free, we may already have an iod working on this mount
> > > > 	 * point.  If so, it will process our request.
> > > > 	 */
> > > > 	if (!gotiod) {
> > > > 		if (nmp->nm_bufqiods > 0) {
> > > > 			NFS_DPF(ASYNCIO,
> > > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > > 				 nmp->nm_bufqiods, nmp));
> > > > 			gotiod = TRUE;
> > > > 		}
> > > > 	}
> > > > It assumes that, since an nfsiod thread is processing some buffer for the
> > > > mount, it will become available to do this one, which isn't true for your
> > > > deadlock.
> > > > 
> > > > I think the simple fix would be to recode nfs_asyncio() so that
> > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that it
> > > > has assigned to do the I/O, getting rid of the above. The problem
> > > > with doing this is that it may result in a lot more synchronous I/O
> > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
> > > > synchronous I/O could be avoided by allowing nfs_asyncio() to create a
> > > > new thread even if the total is above nfs_iodmax. (I think this would
> > > > require the fixed array to be replaced with a linked list and might
> > > > result in a large number of nfsiod threads.) Maybe just having a large
> > > > nfs_iodmax would be an adequate compromise?
> > > >
> > > > Does having a large # of nfsiod threads cause any serious problem for
> > > > most systems these days?
> > > >
> > > > I'd be tempted to recode nfs_asyncio() as above and then, instead
> > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of
> > > > nfsiod threads (this could be a tunable, with the understanding that
> > > > it should be large for good performance)
> > > >
> > > 
> > > I do not see how this would solve the deadlock itself. The proposal would
> > > only allow system to survive slightly longer after the deadlock appeared.
> > > And, I think that allowing the unbound amount of nfsiod threads is also
> > > fatal.
> > > 
> > > The issue there is the LOR between buffer lock and vnode lock. Buffer lock
> > > always must come after the vnode lock. The problematic nfsiod thread, which
> > > locks the vnode, volatile this rule, because despite the LK_KERNPROC
> > > ownership of the buffer lock, it is the thread which de fact owns the
> > > buffer (only the thread can unlock it).
> > > 
> > > A possible solution would be to pass LK_NOWAIT to nfs_nget() from the
> > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should
> > > be capable of correctly handling the lock failure. And EBUSY would
> > > result in doit = 0, which should be fine too.
> > > 
> > > It is possible that EBUSY should be reset to 0, though.
> > 
> > Yes, thinking about this more, I do think the right answer is for
> > readdirplus to do this. The only question I have is if it should do
> > this always, or if it should do this only from the nfsiod thread. I
> > believe you can't get this in the non-nfsiod case.
> 
> I agree that it looks as of the workaround only needed for nfsiod thread.
> On the other hand, it is not immediately obvious how to detect that
> the current thread is nfsio daemon. Probably a thread flag should be
> set.

OTOH, updating the attributes from readdir+ is only an optimization anyway, so
just having it always do LK_NOWAIT is probably ok (and simple).  Currently I'm
trying to develop a test case to provoke this so I can test the fix, but no
luck on that yet.

-- 
John Baldwin

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 22:39:00 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4D66E529;
 Thu, 14 Mar 2013 22:39:00 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B326BF7E;
 Thu, 14 Mar 2013 22:38:59 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEADxRQlGDaFvO/2dsb2JhbABDiC65eYJegX10gisBAQUjBFIbDgoRGQIEVQYuh3mvO5JVjVWBDRkbB4ItgRMDjzaDXYNFkQKDJiCBNzU
X-IronPort-AV: E=Sophos;i="4.84,848,1355115600"; d="scan'208";a="21301102"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 14 Mar 2013 18:38:49 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id F0BC2B3F1B;
 Thu, 14 Mar 2013 18:38:48 -0400 (EDT)
Date: Thu, 14 Mar 2013 18:38:48 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <341067813.3923482.1363300728967.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303141444.35740.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: multipart/mixed; 
 boundary="----=_Part_3923481_469795532.1363300728964"
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 22:39:00 -0000

------=_Part_3923481_469795532.1363300728964
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

John Baldwin wrote:
> On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote:
> > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > > > John Baldwin wrote:
> > > > > > I ran into a machine that had a deadlock among certain files
> > > > > > on a
> > > > > > given NFS
> > > > > > mount today. I'm not sure how best to resolve it, though it
> > > > > > seems like
> > > > > > perhaps there is a bug with how the pool of nfsiod threads
> > > > > > is managed.
> > > > > > Anyway, more details on the actual hang below. This was on
> > > > > > 8.x with
> > > > > > the
> > > > > > old NFS client, but I don't see anything in HEAD that would
> > > > > > fix this.
> > > > > >
> > > > > > First note that the system was idle so it had dropped down
> > > > > > to only one
> > > > > > nfsiod thread.
> > > > > >
> > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite
> > > > > more often.
> > > > > It seems to me that this snippet of code from nfs_asyncio()
> > > > > makes too
> > > > > weak an assumption:
> > > > > 	/*
> > > > > 	 * If none are free, we may already have an iod working on
> > > > > 	 this mount
> > > > > 	 * point. If so, it will process our request.
> > > > > 	 */
> > > > > 	if (!gotiod) {
> > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > 			NFS_DPF(ASYNCIO,
> > > > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > 			gotiod = TRUE;
> > > > > 		}
> > > > > 	}
> > > > > It assumes that, since an nfsiod thread is processing some
> > > > > buffer for the
> > > > > mount, it will become available to do this one, which isn't
> > > > > true for your
> > > > > deadlock.
> > > > >
> > > > > I think the simple fix would be to recode nfs_asyncio() so
> > > > > that
> > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that
> > > > > it
> > > > > has assigned to do the I/O, getting rid of the above. The
> > > > > problem
> > > > > with doing this is that it may result in a lot more
> > > > > synchronous I/O
> > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe
> > > > > more
> > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to
> > > > > create a
> > > > > new thread even if the total is above nfs_iodmax. (I think
> > > > > this would
> > > > > require the fixed array to be replaced with a linked list and
> > > > > might
> > > > > result in a large number of nfsiod threads.) Maybe just having
> > > > > a large
> > > > > nfs_iodmax would be an adequate compromise?
> > > > >
> > > > > Does having a large # of nfsiod threads cause any serious
> > > > > problem for
> > > > > most systems these days?
> > > > >
> > > > > I'd be tempted to recode nfs_asyncio() as above and then,
> > > > > instead
> > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed
> > > > > number of
> > > > > nfsiod threads (this could be a tunable, with the
> > > > > understanding that
> > > > > it should be large for good performance)
> > > > >
> > > >
> > > > I do not see how this would solve the deadlock itself. The
> > > > proposal would
> > > > only allow system to survive slightly longer after the deadlock
> > > > appeared.
> > > > And, I think that allowing the unbound amount of nfsiod threads
> > > > is also
> > > > fatal.
> > > >
I should mention that what I was thinking of above was more than just
getting rid of the snippet of code. It would have involved handing a
buffer directly to an available nfsiod thread (no queuing on the mount
point). That way there would never be a thread blocked waiting for a
queued buffer.

However, when I was thinking about it a little after posting, I came
to a similar (but even simpler) conclusion than what you've proposed.
(See below and attached patch.)

> > > > The issue there is the LOR between buffer lock and vnode lock.
> > > > Buffer lock
> > > > always must come after the vnode lock. The problematic nfsiod
> > > > thread, which
> > > > locks the vnode, volatile this rule, because despite the
> > > > LK_KERNPROC
> > > > ownership of the buffer lock, it is the thread which de fact
> > > > owns the
> > > > buffer (only the thread can unlock it).
> > > >
> > > > A possible solution would be to pass LK_NOWAIT to nfs_nget()
> > > > from the
> > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget()
> > > > should
> > > > be capable of correctly handling the lock failure. And EBUSY
> > > > would
> > > > result in doit = 0, which should be fine too.
> > > >
> > > > It is possible that EBUSY should be reset to 0, though.
> > >
> > > Yes, thinking about this more, I do think the right answer is for
> > > readdirplus to do this. The only question I have is if it should
> > > do
> > > this always, or if it should do this only from the nfsiod thread.
> > > I
> > > believe you can't get this in the non-nfsiod case.
> >
> > I agree that it looks as of the workaround only needed for nfsiod
> > thread.
> > On the other hand, it is not immediately obvious how to detect that
> > the current thread is nfsio daemon. Probably a thread flag should be
> > set.
> 
> OTOH, updating the attributes from readdir+ is only an optimization
> anyway, so
> just having it always do LK_NOWAIT is probably ok (and simple).
> Currently I'm
> trying to develop a test case to provoke this so I can test the fix,
> but no
> luck on that yet.
> 
Well, when I was thinking about it a bit after the last email, I was
thinking "why bother having the nfsiod threads do readdirplus at all?".

The only reason to use the nfsiod threads is read-ahead. However, for
readdir this is problematic, since the read-ahead block can only be done
when it has the correct directory offset cookie. This implies that it
usually waits until the previous directory block has been read. In other
words, the read-ahead can't usually start until the previous block has
been read. As such, why bother having the nfsiod threads do it? They
might be better used for reads and writes.
(OpenBSD doesn't do read-aheads for directories. In fact, they don't
 even keep directory blocks in the buffer cache, although I'm not
 sure I'd suggest the latter.)

It would be nice to compare the performance with/without the attached
patch. It might turn out that not using the nfsiod threads for readdir
performs just as well or better?
 
Anyhow, the attached trivial patch (which stops readdirplus from being
done by the nfsiod threads) might be worth trying?

However, I don't see a problem with modifying readdirplus to do a
non-blocking vget(), except that it might make an already convoluted
function even more convoluted.

rick

> --
> John Baldwin

------=_Part_3923481_469795532.1363300728964
Content-Type: text/x-patch; name=nfsiod.patch
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=nfsiod.patch

LS0tIGZzL25mc2NsaWVudC9uZnNfY2xiaW8uYy5zYXZ5CTIwMTMtMDMtMTQgMTc6NDk6MzIuMDAw
MDAwMDAwIC0wNDAwCisrKyBmcy9uZnNjbGllbnQvbmZzX2NsYmlvLmMJMjAxMy0wMy0xNCAxODow
Mjo1My4wMDAwMDAwMDAgLTA0MDAKQEAgLTE0MTUsMTAgKzE0MTUsMTggQEAgbmNsX2FzeW5jaW8o
c3RydWN0IG5mc21vdW50ICpubXAsIHN0cnVjdAogCSAqIENvbW1pdHMgYXJlIHVzdWFsbHkgc2hv
cnQgYW5kIHN3ZWV0IHNvIGxldHMgc2F2ZSBzb21lIGNwdSBhbmQKIAkgKiBsZWF2ZSB0aGUgYXN5
bmMgZGFlbW9ucyBmb3IgbW9yZSBpbXBvcnRhbnQgcnBjJ3MgKHN1Y2ggYXMgcmVhZHMKIAkgKiBh
bmQgd3JpdGVzKS4KKwkgKgorCSAqIFJlYWRkaXJwbHVzIFJQQ3MgZG8gdmdldCgpcyB0byBhY3F1
aXJlIHRoZSB2bm9kZXMgZm9yIGVudHJpZXMKKwkgKiBpbiB0aGUgZGlyZWN0b3J5IGluIG9yZGVy
IHRvIHVwZGF0ZSBhdHRyaWJ1dGVzLiBUaGlzIGNhbiBkZWFkbG9jaworCSAqIHdpdGggYW5vdGhl
ciB0aHJlYWQgdGhhdCBpcyB3YWl0aW5nIGZvciBhc3luYyBJL08gdG8gYmUgZG9uZSBieQorCSAq
IGFuIG5mc2lvZCB0aHJlYWQgd2hpbGUgaG9sZGluZyBhIGxvY2sgb24gb25lIG9mIHRoZXNlIHZu
b2Rlcy4KKwkgKiBUbyBhdm9pZCB0aGlzIGRlYWRsb2NrLCBkb24ndCBhbGxvdyB0aGUgYXN5bmMg
bmZzaW9kIHRocmVhZHMgdG8KKwkgKiBwZXJmb3JtIFJlYWRkaXJwbHVzIFJQQ3MuCiAJICovCiAJ
bXR4X2xvY2soJm5jbF9pb2RfbXV0ZXgpOwotCWlmIChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg
JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgotCSAgICAobm1wLT5ubV9idWZxaW9k
cyA+IG5jbF9udW1hc3luYyAvIDIpKSB7CisJaWYgKChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg
JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgorCSAgICAgKG5tcC0+bm1fYnVmcWlv
ZHMgPiBuY2xfbnVtYXN5bmMgLyAyKSkgfHwKKwkgICAgKGJwLT5iX3ZwLT52X3R5cGUgPT0gVkRJ
UiAmJiAobm1wLT5ubV9mbGFnICYgTkZTTU5UX1JESVJQTFVTKSkpIHsKIAkJbXR4X3VubG9jaygm
bmNsX2lvZF9tdXRleCk7CiAJCXJldHVybihFSU8pOwogCX0K
------=_Part_3923481_469795532.1363300728964--

From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 23:23:22 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id C13988C9;
 Thu, 14 Mar 2013 23:23:22 +0000 (UTC)
 (envelope-from pawel@dawidek.net)
Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72])
 by mx1.freebsd.org (Postfix) with ESMTP id 7EDF41C0;
 Thu, 14 Mar 2013 23:23:22 +0000 (UTC)
Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149])
 by mail.dawidek.net (Postfix) with ESMTPSA id 4BD9BB02;
 Fri, 15 Mar 2013 00:20:02 +0100 (CET)
Date: Fri, 15 Mar 2013 00:24:50 +0100
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: patches to add new stat(2) file flags
Message-ID: <20130314232449.GC1446@garage.freebsd.pl>
References: <20130307000533.GA38950@nargothrond.kdm.org>
 <20130307214649.X981@besplex.bde.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="p2kqVDKq5asng8Dg"
Content-Disposition: inline
In-Reply-To: <20130307214649.X981@besplex.bde.org>
X-OS: FreeBSD 10.0-CURRENT amd64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: arch@FreeBSD.org, "Kenneth D. Merry" <ken@FreeBSD.org>, fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 23:23:22 -0000


--p2kqVDKq5asng8Dg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 07, 2013 at 10:21:38PM +1100, Bruce Evans wrote:
> On Wed, 6 Mar 2013, Kenneth D. Merry wrote:
>=20
> > I have attached diffs against head for some additional stat(2) file fla=
gs.
> >
> > The primary purpose of these flags is to improve compatibility with CIF=
S,
> > both from the client and the server side.
> > ...
> > 	UF_IMMUTABLE:	Command line name: "uchg", "uimmutable"
> > 			ZFS name: XAT_READONLY, ZFS_READONLY
> > 			Windows: FILE_ATTRIBUTE_READONLY
> >
> > 			This flag means that the file may not be modified.
> > 			This is not a new flag, but where applicable it is
> > 			mapped to the Windows readonly bit.  ZFS and UFS
> > 			now both support the flag and enforce it.
> >
> > 			The behavior of this flag is compatible with MacOS X.
>=20
> This is incompatible with mapping the DOS read-only attribute to the
> non-writeable file permission in msdosfs.  msdosfs does this mainly to
> get at least one useful file permission, but the semantics are subtly
> different from all of file permissions, UF_IMMUTABLE and SF_IMMUTABLE.
> I think it should be a new flag.

I agree, especially that I saw some discussion recently on Illumos
mailing lists to not enforce this flag in ZFS, which would be confusing
to FreeBSD users if we forget to _not_ merge that change.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://tupytaj.pl

--p2kqVDKq5asng8Dg
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlFCXEEACgkQForvXbEpPzSCswCeLMmHONhIZDnAFFCZD+iv2Ghq
AygAn0fbIw2k8sJHl5Fv41sUqi4kIjY8
=Tb+w
-----END PGP SIGNATURE-----

--p2kqVDKq5asng8Dg--

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 01:45:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 1393D7B0;
 Fri, 15 Mar 2013 01:45:37 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 7B2968B3;
 Fri, 15 Mar 2013 01:45:36 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEALh8QlGDaFvO/2dsb2JhbABDiDK5dIJeggF0gisBAQUjBFIbDgoRGQIEVQYuh3mvHpJOjmIZGweCLYETA482hyKRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,849,1355115600"; d="scan'208";a="21318975"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 14 Mar 2013 21:45:17 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 502FCB4026;
 Thu, 14 Mar 2013 21:45:17 -0400 (EDT)
Date: Thu, 14 Mar 2013 21:45:17 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <2115520715.3927772.1363311917302.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303141444.35740.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: multipart/mixed; 
 boundary="----=_Part_3927771_888202910.1363311917299"
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 01:45:37 -0000

------=_Part_3927771_888202910.1363311917299
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

John Baldwin wrote:
> On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote:
> > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > > > John Baldwin wrote:
> > > > > > I ran into a machine that had a deadlock among certain files
> > > > > > on a
> > > > > > given NFS
> > > > > > mount today. I'm not sure how best to resolve it, though it
> > > > > > seems like
> > > > > > perhaps there is a bug with how the pool of nfsiod threads
> > > > > > is managed.
> > > > > > Anyway, more details on the actual hang below. This was on
> > > > > > 8.x with
> > > > > > the
> > > > > > old NFS client, but I don't see anything in HEAD that would
> > > > > > fix this.
> > > > > >
> > > > > > First note that the system was idle so it had dropped down
> > > > > > to only one
> > > > > > nfsiod thread.
> > > > > >
> > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite
> > > > > more often.
> > > > > It seems to me that this snippet of code from nfs_asyncio()
> > > > > makes too
> > > > > weak an assumption:
> > > > > 	/*
> > > > > 	 * If none are free, we may already have an iod working on
> > > > > 	 this mount
> > > > > 	 * point. If so, it will process our request.
> > > > > 	 */
> > > > > 	if (!gotiod) {
> > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > 			NFS_DPF(ASYNCIO,
> > > > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > 			gotiod = TRUE;
> > > > > 		}
> > > > > 	}
> > > > > It assumes that, since an nfsiod thread is processing some
> > > > > buffer for the
> > > > > mount, it will become available to do this one, which isn't
> > > > > true for your
> > > > > deadlock.
> > > > >
> > > > > I think the simple fix would be to recode nfs_asyncio() so
> > > > > that
> > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that
> > > > > it
> > > > > has assigned to do the I/O, getting rid of the above. The
> > > > > problem
> > > > > with doing this is that it may result in a lot more
> > > > > synchronous I/O
> > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe
> > > > > more
> > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to
> > > > > create a
> > > > > new thread even if the total is above nfs_iodmax. (I think
> > > > > this would
> > > > > require the fixed array to be replaced with a linked list and
> > > > > might
> > > > > result in a large number of nfsiod threads.) Maybe just having
> > > > > a large
> > > > > nfs_iodmax would be an adequate compromise?
> > > > >
> > > > > Does having a large # of nfsiod threads cause any serious
> > > > > problem for
> > > > > most systems these days?
> > > > >
> > > > > I'd be tempted to recode nfs_asyncio() as above and then,
> > > > > instead
> > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed
> > > > > number of
> > > > > nfsiod threads (this could be a tunable, with the
> > > > > understanding that
> > > > > it should be large for good performance)
> > > > >
> > > >
> > > > I do not see how this would solve the deadlock itself. The
> > > > proposal would
> > > > only allow system to survive slightly longer after the deadlock
> > > > appeared.
> > > > And, I think that allowing the unbound amount of nfsiod threads
> > > > is also
> > > > fatal.
> > > >
> > > > The issue there is the LOR between buffer lock and vnode lock.
> > > > Buffer lock
> > > > always must come after the vnode lock. The problematic nfsiod
> > > > thread, which
> > > > locks the vnode, volatile this rule, because despite the
> > > > LK_KERNPROC
> > > > ownership of the buffer lock, it is the thread which de fact
> > > > owns the
> > > > buffer (only the thread can unlock it).
> > > >
> > > > A possible solution would be to pass LK_NOWAIT to nfs_nget()
> > > > from the
> > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget()
> > > > should
> > > > be capable of correctly handling the lock failure. And EBUSY
> > > > would
> > > > result in doit = 0, which should be fine too.
> > > >
> > > > It is possible that EBUSY should be reset to 0, though.
> > >
> > > Yes, thinking about this more, I do think the right answer is for
> > > readdirplus to do this. The only question I have is if it should
> > > do
> > > this always, or if it should do this only from the nfsiod thread.
> > > I
> > > believe you can't get this in the non-nfsiod case.
> >
> > I agree that it looks as of the workaround only needed for nfsiod
> > thread.
> > On the other hand, it is not immediately obvious how to detect that
> > the current thread is nfsio daemon. Probably a thread flag should be
> > set.
> 
> OTOH, updating the attributes from readdir+ is only an optimization
> anyway, so
> just having it always do LK_NOWAIT is probably ok (and simple).
> Currently I'm
> trying to develop a test case to provoke this so I can test the fix,
> but no
> luck on that yet.
> 
> --
> John Baldwin
When I commented out the readahead stuff for VDIR (patch attached), I
got much better performance for a:
# time ls -lR > /dev/null
run at the root of the nfs mount point (-o nfsv3,rdirplus).

However, I got about the same performance for the previous patch.
(The difference is that this one doesn't play with the buffer cache
 for the read-ahead attempt.)

My test environment is crappy (a laptop mounting itself), so it may
just be a side effect of this.

Maybe you could try this?

rick


------=_Part_3927771_888202910.1363311917299
Content-Type: text/x-patch; name=nfsiod2.patch
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=nfsiod2.patch

LS0tIGZzL25mc2NsaWVudC9uZnNfY2xiaW8uYy5zYXZpdAkyMDEzLTAzLTE0IDIwOjQyOjUwLjAw
MDAwMDAwMCAtMDQwMAorKysgZnMvbmZzY2xpZW50L25mc19jbGJpby5jCTIwMTMtMDMtMTQgMjA6
NDM6NDUuMDAwMDAwMDAwIC0wNDAwCkBAIC02NTgsNiArNjU4LDcgQEAgbmNsX2Jpb3JlYWQoc3Ry
dWN0IHZub2RlICp2cCwgc3RydWN0IHVpbwogCQkgKiAoWW91IG5lZWQgdGhlIGN1cnJlbnQgYmxv
Y2sgZmlyc3QsIHNvIHRoYXQgeW91IGhhdmUgdGhlCiAJCSAqICBkaXJlY3Rvcnkgb2Zmc2V0IGNv
b2tpZSBvZiB0aGUgbmV4dCBibG9jay4pCiAJCSAqLworI2lmZGVmIG5vdGRlZgogCQlpZiAobm1w
LT5ubV9yZWFkYWhlYWQgPiAwICYmCiAJCSAgICAoYnAtPmJfZmxhZ3MgJiBCX0lOVkFMKSA9PSAw
ICYmCiAJCSAgICAobnAtPm5fZGlyZW9mb2Zmc2V0ID09IDAgfHwKQEAgLTY4MCw2ICs2ODEsNyBA
QCBuY2xfYmlvcmVhZChzdHJ1Y3Qgdm5vZGUgKnZwLCBzdHJ1Y3QgdWlvCiAJCQkgICAgfQogCQkJ
fQogCQl9CisjZW5kaWYKIAkJLyoKIAkJICogVW5saWtlIFZSRUcgZmlsZXMsIHdob3MgYnVmZmVy
IHNpemUgKCBicC0+Yl9iY291bnQgKSBpcwogCQkgKiBjaG9wcGVkIGZvciB0aGUgRU9GIGNvbmRp
dGlvbiwgd2UgY2Fubm90IHRlbGwgaG93IGxhcmdlCg==
------=_Part_3927771_888202910.1363311917299--

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 09:47:48 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 2849E385;
 Fri, 15 Mar 2013 09:47:48 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au
 [211.29.133.169])
 by mx1.freebsd.org (Postfix) with ESMTP id A3007DCA;
 Fri, 15 Mar 2013 09:47:46 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2F9lYsg010340
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Fri, 15 Mar 2013 20:47:38 +1100
Date: Fri, 15 Mar 2013 20:47:34 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: patches to add new stat(2) file flags
In-Reply-To: <20130314232449.GC1446@garage.freebsd.pl>
Message-ID: <20130315184014.A902@besplex.bde.org>
References: <20130307000533.GA38950@nargothrond.kdm.org>
 <20130307214649.X981@besplex.bde.org>
 <20130314232449.GC1446@garage.freebsd.pl>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=JMpjKL2b c=1 sm=1 a=n2O7wv11oSwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=YOiZBDKP_E4A:10
 a=LVUDrmMsRTOz-s-3SHEA:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: arch@FreeBSD.org, "Kenneth D. Merry" <ken@FreeBSD.org>, fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 09:47:48 -0000

On Fri, 15 Mar 2013, Pawel Jakub Dawidek wrote:

> On Thu, Mar 07, 2013 at 10:21:38PM +1100, Bruce Evans wrote:
>> On Wed, 6 Mar 2013, Kenneth D. Merry wrote:
>>
>>> I have attached diffs against head for some additional stat(2) file flags.
>>>
>>> The primary purpose of these flags is to improve compatibility with CIFS,
>>> both from the client and the server side.
>>> ...
>>> 	UF_IMMUTABLE:	Command line name: "uchg", "uimmutable"
>>> 			ZFS name: XAT_READONLY, ZFS_READONLY
>>> 			Windows: FILE_ATTRIBUTE_READONLY
>>>
>>> 			This flag means that the file may not be modified.
>>> 			This is not a new flag, but where applicable it is
>>> 			mapped to the Windows readonly bit.  ZFS and UFS
>>> 			now both support the flag and enforce it.
>>>
>>> 			The behavior of this flag is compatible with MacOS X.
>>
>> This is incompatible with mapping the DOS read-only attribute to the
>> non-writeable file permission in msdosfs.  msdosfs does this mainly to
>> get at least one useful file permission, but the semantics are subtly
>> different from all of file permissions, UF_IMMUTABLE and SF_IMMUTABLE.
>> I think it should be a new flag.
>
> I agree, especially that I saw some discussion recently on Illumos
> mailing lists to not enforce this flag in ZFS, which would be confusing
> to FreeBSD users if we forget to _not_ merge that change.

However, I now think the READONLY attribute would map well to
UF_IMMUTABLE in msdosfs, better than the current mapping of the READONLY
attribute to the inverse of the write permissions bits.  The permissions
bits are also controlled by the permissions bits of the mount point,
and this is the least worst way to control them for general files.
When this is mixed with control by the READONLY attribute (which
involves back-control of the READONLY attribute according to the
permissions bits), the behaviour is confusing and might lead to the
READONLY bit being set for too many files (e.g., for copies of man
pages, since man pages are installed with the bogus permissions
r--r--r-- although the owner (root) can write them (the r--r--r--
permissions only made sense when the owner was bin)).  If the READONLY
attribute is instead mapped only to UF_IMMUTABLE, its impact would
be smaller since there aren't so many files which have a native READONLY
attribute or a native UF_IMMUTABLE attribute.  The READONLY attribute
would interact badly with the permissions bits in a different way --
just like UF_IMMUTABLE interacts with them.  It is confusing when ls -l
shows writability for non-writable files.

Further testing of possible confusion from UF_IMMUTABLE on a rw-r--r--
uchg file on ffs showed that:
- eaccess(2) with flag W_OK used to work correctly, although this was not
   documented.  It used to return the documented errno EACCES, but its man
   page didn't say anything about immutable attributes and said that this
   error means that the permissions bits indicate no access (or search
   permission is denied).
- eaccess(2) with flag W_OK now returns the undocumented errno EPERM.
   Its man page doesn't seem to have changed significantly.  Documentation
   for ACLs also seems to be missing.  The old and new man pages point to
   more details in intro(2).  The fine details are missing there too.
   There is just the usual weaselish "appropriate privilege" used in a
   generic way for EPERM.  This can mean anything, but what it means is
   not documented in either man page.

Actually, eaccess() used to work correctly because I fixed it locally.
It seems to have always been broken in FreeBSD.  The current version is:

@ 	/*
@ 	 * If immutable bit set, nobody gets to write it.  "& ~VADMIN_PERMS"
@ 	 * is here, because without it, * it would be impossible for the owner
@ 	 * to remove the IMMUTABLE flag.
@ 	 */
@ 	if ((accmode & (VMODIFY_PERMS & ~VADMIN_PERMS)) &&
@ 	    (ip->i_flags & (IMMUTABLE | SF_SNAPSHOT)))
@ 		return (EPERM);

Bugs to be fixed here:
- the first sentence in the comment is banal and doesn't even echo the
   code (the code actually handles several immutable bits (obfuscated by
   the IMMUTABLE macro), and also the snapshot bit)
- the second sentence in the comment has a misplaced comment delimiter
   '*' in the middle of it.  It also doesn't fully echo the code, but is
   not banal.
- the "write" in the first sentence also doesn't even echo the code.  It
   used to echo the code when the code was simpler.  The code used to
   check only (accmode & VWRITE).  But immutability prevents much more
   than writing, and the code now handles that.
- wrong errno.

ext2fs still uses the old ffs code here (except it doesn't use IMMUTABLE
and checks explicitly for the only immutable flag that it supports).  It
duplicates the SF_SNAPSHOT check, but that is nonsense because ext2fs
doesn't support snapshots.

nandfs copies ffs for setattr, so it has immutabilty flags checks there,
but it just uses vaccess() for access(), so it it is missing the above,
so the immutable flags checks are either nonsense where they are made
or missing here.

tmpfs uses the old ffs code here (except for mangling the style, but
it does remove the banal comment).

I couldn't see exactly what zfs does here, but it mostly returns EPERM
for immutable flags checks.

Fixing foofs_access() hopefully also fixes open(2), unlink(2), ...

Unfortunately, my fix is incompatible with dubious fixes that make the
man pages bug for bug compatible with the code.  POSIX of course doesn't
document EPERM for open(2) (except in the general weasel section about
appropriate privilege).  FreeBSD didn't document it either in the
version in which the above was fixed.  But now FreeBSD documents in
open.2 and other man pages that immutability gives EPERM, and the code
always had this bug.

The changes in the man pages have some style bugs: in open.2:
- a comma splice in the reference to chflags(2)
- this reference is only made in 1 of the descriptions of EPERM.
These style bugs were cloned to most or all man pages that are affected
by immutability or nounlink flags.

ACLs still seem to be unmentioned in all these man pages.  I don't use
them, so I don't know what happens for them.  However, the core vfs
function vaccess() is careful to always return EACCES and EPERM as
explicitly specified by POSIX.  This means EACCES for all cases except
VADMIN.  VADMIN/EPERM apply to chmod(), chown(), ... but shouldn't
apply to open(), unlink(), rename(), ...

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 11:34:39 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 76027EF6
 for <freebsd-fs@FreeBSD.org>; Fri, 15 Mar 2013 11:34:39 +0000 (UTC)
 (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id B1B23892
 for <freebsd-fs@FreeBSD.org>; Fri, 15 Mar 2013 11:34:38 +0000 (UTC)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
 [212.40.38.101])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id NAA25999;
 Fri, 15 Mar 2013 13:34:29 +0200 (EET) (envelope-from avg@FreeBSD.org)
Message-ID: <51430744.6020004@FreeBSD.org>
Date: Fri, 15 Mar 2013 13:34:28 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130313 Thunderbird/17.0.4
MIME-Version: 1.0
To: Freddie Cash <fjwcash@gmail.com>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
References: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
In-Reply-To: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
X-Enigmail-Version: 1.5.1
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Filesystems <freebsd-fs@FreeBSD.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 11:34:39 -0000

on 14/03/2013 20:13 Freddie Cash said the following:
> the l2arc_feed_thread of zfskern will spin until it takes up 100%
> of a CPU core

If you see a thread taking 100% where it shouldn't, then just profile it and
actually see what it's doing.

-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 13:12:57 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D7DDDC9C
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:12:57 +0000 (UTC)
 (envelope-from girgen@FreeBSD.org)
Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200])
 by mx1.freebsd.org (Postfix) with ESMTP id 77AE8DFC
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:12:57 +0000 (UTC)
Received: from girgBook.local (citron2.pingpong.net [195.178.173.68])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (No client certificate requested)
 by melon.pingpong.net (Postfix) with ESMTPSA id 7A20F14D05;
 Fri, 15 Mar 2013 14:12:49 +0100 (CET)
Message-ID: <51431E50.2020109@FreeBSD.org>
Date: Fri, 15 Mar 2013 14:12:48 +0100
From: Palle Girgensohn <girgen@FreeBSD.org>
User-Agent: Postbox 3.0.7 (Macintosh/20130119)
MIME-Version: 1.0
To: Kirk McKusick <mckusick@mckusick.com>
Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after
 moving tables and indexes to tablespace on different volume
References: <201303131652.r2DGqSr4051899@chez.mckusick.com>
In-Reply-To: <201303131652.r2DGqSr4051899@chez.mckusick.com>
X-Enigmail-Version: 1.2.3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, Jeff Roberson <jroberson@jroberson.net>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 13:12:57 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Kirk,

Thanks for your reply!

Kirk McKusick skrev:
> Thanks for your report. It is certainly unlike anything that we have
> seen reported before.
> 
> Are you running your /usr filesystem with (the default) journalled 
> soft updates? You can check this by running the `mount' command with
> no arguments.

yes, plain vanilla:

/dev/da0s1f on /usr (ufs, local, soft-updates)


> 
> Rather than rebooting your system, it would be most helpful if you 
> could instead shut it down to single user. Then do the following:

OK, we have planned downtime this evening. I will run the suggested
commands in a script session before rebooting.

Best regards,
Palle

> 
> Create a transcript of your session by running `script'. Once running
> in the session run these commands:
> 
> Run `mount' to show your filesystem configuration. Run `df -hi /usr'
> to see whether the inodes are still missing. Verify that you can
> cleanly unmount /usr (e.g., that the unmount does not hang and does
> not complain). Remount /usr and run `df -hi' to see whether the
> inodes are still missing. Unmount /usr again and run `fsck_ffs -p -f
> -d /usr'. If the fsck_ffs fails with an unexpected inconsistency, you
> can run `fsck_ffs -y -d /usr' to force it to clean up. When you have
> the filesystem successfully cleaned up, type `exit' to get out of the
> script session and mail me the transcript of the session
> (typescript).
> 
> Thanks for your help in tracking this down.
> 
> Kirk McKusick
> 
> ----- Original Message:
> 
> Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn
> <girgen@freebsd.org> To: freebsd-fs@freebsd.org Subject: leaking lots
> of unreferenced inodes (pg_xlog files?), maybe after moving tables
> and indexes to tablespace on different volume
> 
> Hi!
> 
> Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file
> system.
> 
> I have the postgresql base/ on the /usr disk, and a separate volume
> /opt where the default tablespace resides. This means that the amount
> of data on the /usr disk sould be stable. This is not the case, the
> disk usage grows linearly (it seems to leave many inodes
> unreferenced).
> 
> The the discrepancy between df and du is now huge:
> 
> # du -sxh /usr; df -h /usr 4,6G	/usr Filesystem     Size    Used
> Avail Capacity  Mounted on /dev/da0s1f    104G     88G    8.0G    92%
> /usr
> 
> 4,6G vs 88GB, that must be more than a rounding error?
> 
> Strange thing is I cannot find any open files among the missing.
> 
> # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null
> 
> returns no errors (a missing file would render an error with ls). If 
> there where open files not referenced in any directory, they should
> be found.
> 
> Next thing is fsck, and yes, there are plenty of unreferenced files.
> 
> I ran fsck while system is running (i.e. read only) to get a grip
> oif the amount of lost inodes:
> 
> fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while
> (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' 
> 85223.3530330658
> 
> ~85 GB gone, that's 80% of the disk, and it accounts fo all the
> missing space.
> 
> MTIME for the inodes are pretty evenly spread over time since the 
> machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was
> updated to 9.2. All was done at the same time, so I can't really tell
> who's to blaim, but this is the only server, out of a dozen that
> where updated to exactly the same versions, that has this problem.
> All other servers have their /usr disk usage stable (since all data
> resides on a separate tablespace).
> 
> The unreferenced inodes are almost exclusively around 16 MB in size,
> so they most certainly all are postgresql pg_xlog files. This means
> all files are lost from the same portion of code in the database
> engine.
> 
> How could it possibly be able to leave unreferenced inodes around
> like this at such a scale? Is the culprit a combination of postgresql
> and file system code? Both where updated.
> 
> pg_xlog checkpoints seems to happen approximately every three
> minutes:
> 
> Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: 
> checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]:
> [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%);
> 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667
> s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s,
> average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1]
> db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver
> postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996
> buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1
> recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync
> files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver
> postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar
> 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint
> complete: wrote 13736 buffers (1.4%); 0 transaction log file(s)
> added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, 
> total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s 
> Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: 
> checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]:
> [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers
> (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled;
> write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363,
> longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver
> postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar
> 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint
> complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added,
> 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433
> s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08
> dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting:
> time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: 
> checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log 
> file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, 
> total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar
> 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint
> starting: time
> 
> 
> I'm pretty certain that unmounting the file system and running fsck
> will regain the lost space, but will it stop there?
> 
> Stopping postgresql briefly did not help, I tried that. That would
> have helped if the files where open, but they're not. It seems to
> postgresql did the right thing, and FreeBSD failed to unreference the
> files.
> 
> The server has about 30 databases and ~127 concurrent connections
> (not all beeing active simultaneously, though), so it is fair to say
> it is pretty active, but nothing extreme.
> 
> Hardware is HP DL360, using their HT Smart Array P410i.
> 
> Any ideas how to debug this? Or shall I just reboot, fsck, hope the 
> problem will go away, and when it does, forget about it?
> 
> Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2
> v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment:
> Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq 
> XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy 
> jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs 
> O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv 
> H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw 
> NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= 
> =3ZZs -----END PGP SIGNATURE----- 
> _______________________________________________ 
> freebsd-fs@freebsd.org mailing list 
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe,
> send any mail to "freebsd-fs-unsubscribe@freebsd.org"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRQx5QAAoJEIhV+7FrxBJDAcYIALvj4hiWoAN/wchrJIbfiXbY
XcPNIuqKFT1sRYWgdLZQ7e34zmvtmPfa0WW6/OFHbI5q+G/xciuZLhTl7EZ98IvD
jbBUR4SLLcrOvFNe35b43eOqr12okIboLg2fQx/jUbWQM19V/2/YaLobBDl2iv/v
gbD5ErL3yd0YBU1EFETho3hsL9fzbmSczQqhWWs0glD+aiHDQbtIAFVkC3IZSaLl
MNhqrzKsv4kEHXSylYRU2RbHYKNg55jQ1JHA5HKinqZbe7qLmyqr4dFVtbYEgGE9
DCh4/buO0/UIg+Te7WuD2XxMhfutgbGN6kOaTXk3NQhtgd5a/8I/yqfv6zGtglY=
=UFMm
-----END PGP SIGNATURE-----

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 13:21:27 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 79FCAE1E
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:21:27 +0000 (UTC)
 (envelope-from phantom@phantom.su)
Received: from relay13.nicmail.ru (relay13.nicmail.ru [195.208.6.7])
 by mx1.freebsd.org (Postfix) with ESMTP id 0F0A8E72
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:21:26 +0000 (UTC)
Received: from [109.70.25.39] (port=37155 helo=nicmail.ru)
 by f17.mail.nic.ru with esmtp (Exim 5.55)
 (envelope-from <phantom@phantom.su>)
 id 1UGUZT-0006Qg-3g; Fri, 15 Mar 2013 17:21:19 +0400
Received: from [194.85.198.26] (account phantom@phantom.su HELO
 phantom-mobile.node)
 by fcgp04.nicmail.ru (CommuniGate Pro SMTP 5.2.3)
 with ESMTPSA id 326736398; Fri, 15 Mar 2013 17:21:19 +0400
Message-ID: <5143204E.90003@phantom.su>
Date: Fri, 15 Mar 2013 17:21:18 +0400
From: Ilia Noskov <phantom@phantom.su>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: kostikbel@gmail.com
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
 <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua>
 <5141A212.9050909@phantom.su>
In-Reply-To: <5141A212.9050909@phantom.su>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: phantom@phantom.su
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 13:21:27 -0000

On 03/14/2013 02:10 PM, Ilia Noskov wrote:
> On 03/14/2013 01:08 PM, Konstantin Belousov wrote:
>> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
>>> Strange behavior on nfs-client after apply this patch:
>>>
>>> sysctl debug.disablecwd=0
>>> sysctl debug.disablefullpath=0
>>>
>>> # mount -v -t nfs
>>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid
>>> 02ff003a3a000000)
>>> # ls /home | wc -l
>>>       4946
>>> # cd /home/user6308/.ro
>>> # time pwd
>>> /home/user6308/.ro
>>> 0.008u 0.269s 0:08.47 3.0%    4+157k 0+0io 0pf+0w
>>> # ktrace -t+ -i pwd
>>>
>>>
>>> ktrace.out is big (1MB). Attach or not?
>>>
>>>
>>>
>>> A small piece of trace:
>>>    19527 pwd      CALL
>>> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
>>>
>>>    19527 pwd      RET   mmap 34376515584/0x801000000
>>>    19527 pwd      CALL  __getcwd(0x801006400,0x400)
>>>    19527 pwd      NAMI  ".."
>>>    19527 pwd      NAMI  ".."
>>>    19527 pwd      RET   __getcwd -1 errno 2 No such file or directory
>>>    19527 pwd      CALL  stat(0x800947a14,0x7fffffffd940)
>>>    19527 pwd      NAMI  "/"
>>>    19527 pwd      STRU  struct stat {dev=98, ino=2, mode=drwxr-xr-x ,
>>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279,
>>> ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384,
>>> blocks=4, flags=0x0 }
>>>    19527 pwd      RET   stat 0
>>>    19527 pwd      CALL  lstat(0x80094779c,0x7fffffffd940)
>>>    19527 pwd      NAMI  "."
>>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=145,
>>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244672.246785874, stime=1363244792.864201338,
>>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   lstat 0
>>>    19527 pwd      CALL  openat(0xffffff9c,0x80094779b,0x100000,0x2)
>>>    19527 pwd      NAMI  ".."
>>>    19527 pwd      RET   openat 3
>>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd880)
>>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244665.232140704, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstat 0
>>>    19527 pwd      CALL  fcntl(0x3,F_SETFD,FD_CLOEXEC)
>>>    19527 pwd      RET   fcntl 0
>>>    19527 pwd      CALL  fstatfs(0x3,0x7fffffffd660)
>>>    19527 pwd      RET   fstatfs 0
>>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244665.232140704, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstat 0
>>>    19527 pwd      CALL
>>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
>>>    19527 pwd      RET   getdirentries 4096/0x1000
>>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
>>>    19527 pwd      STRU  struct stat {dev=1230702064, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244665.232140704, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstat 0
>>>    19527 pwd      CALL  openat(0x3,0x80094779b,0x100000,0)
>>>    19527 pwd      NAMI  ".."
>>>    19527 pwd      RET   openat 4
>>> [..............................]
>>>    19527 pwd      CALL  madvise(0x801016000,0x1000,MADV_FREE)
>>>    19527 pwd      RET   madvise 0
>>>    19527 pwd      CALL  madvise(0x801018000,0x2000,MADV_FREE)
>>>    19527 pwd      RET   madvise 0
>>>    19527 pwd      CALL  close(0x3)
>>>    19527 pwd      RET   close 0
>>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd880)
>>>    19527 pwd      STRU  struct stat {dev=973143810, ino=4,
>>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244767.460164771, stime=1363172100.380266923,
>>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096,
>>> blocks=713, flags=0x0 }
>>>    19527 pwd      RET   fstat 0
>>>    19527 pwd      CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
>>>    19527 pwd      RET   fcntl 0
>>>    19527 pwd      CALL  fstatfs(0x4,0x7fffffffd660)
>>>    19527 pwd      RET   fstatfs 0
>>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd940)
>>>    19527 pwd      STRU  struct stat {dev=973143810, ino=4,
>>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
>>> atime=1363244767.460164771, stime=1363172100.380266923,
>>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096,
>>> blocks=713, flags=0x0 }
>>>    19527 pwd      RET   fstat 0
>>>    19527 pwd      CALL
>>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
>>>    19527 pwd      RET   getdirentries 4096/0x1000
>>>    19527 pwd      CALL  fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
>>>    19527 pwd      NAMI  "user6158"
>>>    19527 pwd      STRU  struct stat {dev=1774902232, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363009687.040357529, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstatat 0
>>>    19527 pwd      CALL  fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
>>>    19527 pwd      NAMI  "user2289"
>>>    19527 pwd      STRU  struct stat {dev=1988229825, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363009687.040357529, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstatat 0
>>>    19527 pwd      CALL  fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
>>>    19527 pwd      NAMI  "user4761"
>>>    19527 pwd      STRU  struct stat {dev=2438657130, ino=4,
>>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
>>> atime=1363009687.040357529, stime=1363010116.496298252,
>>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
>>> blocks=3, flags=0x0 }
>>>    19527 pwd      RET   fstatat 0
>>>    19527 pwd      CALL  fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
>>>    19527 pwd      NAMI  "user6055"
>>> [.........................................]
>>>
>>> and next get stat of all directories in /home
>>
>> Slightly different version of the patch was committed as r247560.
>>
>> The situation could only happen if the parent directory contains the "."
>> entry with inode number equal to the inode number of the subdirectory.
>> Can you confirm that this is your case ?
>>
>
> Yes, it is.
> I'll try again on the latest snapshot. Thanks!
>

Yes.
On latest r248313 similar situation - if path contains "." then 
nfsclient get stat of all directories in /home.





-- 
      Best Regards,

      Ilia Noskov
      Regional Network Information Center (RU-CENTER)
      phone: +7 495 737-0601
      fax: +7 495 737-0602
      http://www.nic.ru

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 13:59:56 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 168FE199
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:59:56 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 4E8B6258
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 13:59:55 +0000 (UTC)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2FDxo98092642;
 Fri, 15 Mar 2013 15:59:50 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2FDxo98092642
Received: (from kostik@localhost)
 by tom.home (8.14.6/8.14.6/Submit) id r2FDxo9B092641;
 Fri, 15 Mar 2013 15:59:50 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Fri, 15 Mar 2013 15:59:50 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Ilia Noskov <phantom@phantom.su>
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
Message-ID: <20130315135950.GU3794@kib.kiev.ua>
References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca>
 <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua>
 <5141A212.9050909@phantom.su> <5143204E.90003@phantom.su>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="IcaJnnNV5xAPxpBT"
Content-Disposition: inline
In-Reply-To: <5143204E.90003@phantom.su>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 13:59:56 -0000


--IcaJnnNV5xAPxpBT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote:
> On 03/14/2013 02:10 PM, Ilia Noskov wrote:
> > On 03/14/2013 01:08 PM, Konstantin Belousov wrote:
> >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
> >>> Strange behavior on nfs-client after apply this patch:
> >>>
> >>> sysctl debug.disablecwd=3D0
> >>> sysctl debug.disablefullpath=3D0
> >>>
> >>> # mount -v -t nfs
> >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid
> >>> 02ff003a3a000000)
> >>> # ls /home | wc -l
> >>>       4946
> >>> # cd /home/user6308/.ro
> >>> # time pwd
> >>> /home/user6308/.ro
> >>> 0.008u 0.269s 0:08.47 3.0%    4+157k 0+0io 0pf+0w
> >>> # ktrace -t+ -i pwd
> >>>
> >>>
> >>> ktrace.out is big (1MB). Attach or not?
> >>>
> >>>
> >>>
> >>> A small piece of trace:
> >>>    19527 pwd      CALL
> >>> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON=
>,0xffffffff,0)
> >>>
> >>>    19527 pwd      RET   mmap 34376515584/0x801000000
> >>>    19527 pwd      CALL  __getcwd(0x801006400,0x400)
> >>>    19527 pwd      NAMI  ".."
> >>>    19527 pwd      NAMI  ".."
> >>>    19527 pwd      RET   __getcwd -1 errno 2 No such file or directory
> >>>    19527 pwd      CALL  stat(0x800947a14,0x7fffffffd940)
> >>>    19527 pwd      NAMI  "/"
> >>>    19527 pwd      STRU  struct stat {dev=3D98, ino=3D2, mode=3Ddrwxr-=
xr-x ,
> >>> nlink=3D19, uid=3D0, gid=3D0, rdev=3D2120, atime=3D1363244893, stime=
=3D1362653279,
> >>> ctime=3D1362653279, birthtime=3D1200836451, size=3D1024, blksize=3D16=
384,
> >>> blocks=3D4, flags=3D0x0 }
> >>>    19527 pwd      RET   stat 0
> >>>    19527 pwd      CALL  lstat(0x80094779c,0x7fffffffd940)
> >>>    19527 pwd      NAMI  "."
> >>>    19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D145,
> >>> mode=3Ddrwxr-xr-x , nlink=3D2, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244672.246785874, stime=3D1363244792.864201338,
> >>> ctime=3D1363244792.864201338, birthtime=3D-1, size=3D3, blksize=3D409=
6,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   lstat 0
> >>>    19527 pwd      CALL  openat(0xffffff9c,0x80094779b,0x100000,0x2)
> >>>    19527 pwd      NAMI  ".."
> >>>    19527 pwd      RET   openat 3
> >>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd880)
> >>>    19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstat 0
> >>>    19527 pwd      CALL  fcntl(0x3,F_SETFD,FD_CLOEXEC)
> >>>    19527 pwd      RET   fcntl 0
> >>>    19527 pwd      CALL  fstatfs(0x3,0x7fffffffd660)
> >>>    19527 pwd      RET   fstatfs 0
> >>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
> >>>    19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstat 0
> >>>    19527 pwd      CALL
> >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
> >>>    19527 pwd      RET   getdirentries 4096/0x1000
> >>>    19527 pwd      CALL  fstat(0x3,0x7fffffffd940)
> >>>    19527 pwd      STRU  struct stat {dev=3D1230702064, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstat 0
> >>>    19527 pwd      CALL  openat(0x3,0x80094779b,0x100000,0)
> >>>    19527 pwd      NAMI  ".."
> >>>    19527 pwd      RET   openat 4
> >>> [..............................]
> >>>    19527 pwd      CALL  madvise(0x801016000,0x1000,MADV_FREE)
> >>>    19527 pwd      RET   madvise 0
> >>>    19527 pwd      CALL  madvise(0x801018000,0x2000,MADV_FREE)
> >>>    19527 pwd      RET   madvise 0
> >>>    19527 pwd      CALL  close(0x3)
> >>>    19527 pwd      RET   close 0
> >>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd880)
> >>>    19527 pwd      STRU  struct stat {dev=3D973143810, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244767.460164771, stime=3D1363172100.380266923,
> >>> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D=
4096,
> >>> blocks=3D713, flags=3D0x0 }
> >>>    19527 pwd      RET   fstat 0
> >>>    19527 pwd      CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
> >>>    19527 pwd      RET   fcntl 0
> >>>    19527 pwd      CALL  fstatfs(0x4,0x7fffffffd660)
> >>>    19527 pwd      RET   fstatfs 0
> >>>    19527 pwd      CALL  fstat(0x4,0x7fffffffd940)
> >>>    19527 pwd      STRU  struct stat {dev=3D973143810, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363244767.460164771, stime=3D1363172100.380266923,
> >>> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D=
4096,
> >>> blocks=3D713, flags=3D0x0 }
> >>>    19527 pwd      RET   fstat 0
> >>>    19527 pwd      CALL
> >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
> >>>    19527 pwd      RET   getdirentries 4096/0x1000
> >>>    19527 pwd      CALL  fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
> >>>    19527 pwd      NAMI  "user6158"
> >>>    19527 pwd      STRU  struct stat {dev=3D1774902232, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstatat 0
> >>>    19527 pwd      CALL  fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
> >>>    19527 pwd      NAMI  "user2289"
> >>>    19527 pwd      STRU  struct stat {dev=3D1988229825, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstatat 0
> >>>    19527 pwd      CALL  fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
> >>>    19527 pwd      NAMI  "user4761"
> >>>    19527 pwd      STRU  struct stat {dev=3D2438657130, ino=3D4,
> >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,
> >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252,
> >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40=
96,
> >>> blocks=3D3, flags=3D0x0 }
> >>>    19527 pwd      RET   fstatat 0
> >>>    19527 pwd      CALL  fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
> >>>    19527 pwd      NAMI  "user6055"
> >>> [.........................................]
> >>>
> >>> and next get stat of all directories in /home
> >>
> >> Slightly different version of the patch was committed as r247560.
> >>
> >> The situation could only happen if the parent directory contains the "=
=2E"
> >> entry with inode number equal to the inode number of the subdirectory.
> >> Can you confirm that this is your case ?
> >>
> >
> > Yes, it is.
> > I'll try again on the latest snapshot. Thanks!
> >
>=20
> Yes.
> On latest r248313 similar situation - if path contains "." then=20
> nfsclient get stat of all directories in /home.

What path ?

Did you read the description of the situation when r248313 returns
ENOENT to delegate the resolving to usermode ? Can you confirm that
this is your situation ?

--IcaJnnNV5xAPxpBT
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJRQylVAAoJEJDCuSvBvK1BslAP/A28vT8tyv6iYpIKHO0Vthii
TKDhLNiCZMw/9udsddqgJfg2IbiqtNH+L/8KqHdpk5pkF7emfolQaeINY4AdxDvo
/0x6imNAGBbLkwIpyNGHJstLfqmF8FWBI9wu26wbIIV2Lv0darWkm1qzulJzC8B7
k1NIb5h2Y5SzwjG51ZOxXz5xikvpruDZnDIHKC/wnE+kqu1cy0kd/1aem8+3B5CT
J3yc95R+tMiYKkNssUxPgLaobSRF//k1H4uKjtYiEB+ceXoDuwaE7cXjmqrSHpnw
jgwdXZWE3b57pwv6Xs7bmAVLsWLTwqm0Qr9R7FcFcC8or9vNDz7dng6OayOS/zG/
tGYfCOldjDzznDQgF7gBDlUiqnBIy1pdwFA7asiwr89JiYv2055n67rxZGUJh590
enjhsCaKQSHONvskARmP+ETxqYHPnsUocku1ebJxkeQV0HQp/qHqnswhk3Sy4yqF
gNkKLyipgjUXsV1ryre3zWi8GIaEq4LRNi5BEOkBcPsZHohwZj2N02wnaUQcr6Bi
Ynup8pwerxOeBG1O9TiLMD/8VP6jM3mOD/UHtdE9cUFUACju+EMHhRtBo7yJdrXL
h/4GiK9Xvf1L3mHMNJv7Yg+f3xvEcGovP3pjYhNAjyusxntT0dHaJTMLgnA+Cf0Z
o5j6WYrqS/R0wjrofIOj
=Bk3F
-----END PGP SIGNATURE-----

--IcaJnnNV5xAPxpBT--

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 14:25:08 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4DAD8A52
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 14:25:08 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id DE5373E4
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 14:25:07 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEADYuQ1GDaFvO/2dsb2JhbABDiDG6AIJlgXt0gioBAQUjVhsOCgICDRkCWQaIJ7A1knOBI40+NAeCLYETA5ZbiWyHFoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,850,1355115600"; d="scan'208";a="19169934"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-annu.net.uoguelph.ca with ESMTP; 15 Mar 2013 10:25:00 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B45EBB4034;
 Fri, 15 Mar 2013 10:25:00 -0400 (EDT)
Date: Fri, 15 Mar 2013 10:25:00 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Konstantin Belousov <kostikbel@gmail.com>
Message-ID: <2081421885.3937873.1363357500724.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20130315135950.GU3794@kib.kiev.ua>
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 14:25:08 -0000

Kostik Belousov wrote:
> On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote:
> > On 03/14/2013 02:10 PM, Ilia Noskov wrote:
> > > On 03/14/2013 01:08 PM, Konstantin Belousov wrote:
> > >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
> > >>> Strange behavior on nfs-client after apply this patch:
> > >>>
> > >>> sysctl debug.disablecwd=0
> > >>> sysctl debug.disablefullpath=0
> > >>>
> > >>> # mount -v -t nfs
> > >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid
> > >>> 02ff003a3a000000)
You don't mention how your server is configured. Does your V4: line
in /etc/exports have "/" as the root or "/pool". See my comment at
the bottom for why this might matter.

Also, if you have a recent version of nfsstat, you can use "nfsstat -m"
to dump out exactly what options the mount is actually using. (Gives
a lot more info than the above.)

> > >>> # ls /home | wc -l
> > >>>       4946
> > >>> # cd /home/user6308/.ro
Is user6308 a separate file system than /home on the server?
(If so, I would expect the userland getcwd() to stat all the entries in
 /home to get the st_dev and st_ino fields of them all.)

> > >>> # time pwd
> > >>> /home/user6308/.ro
> > >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w
> > >>> # ktrace -t+ -i pwd
> > >>>
> > >>>
> > >>> ktrace.out is big (1MB). Attach or not?
> > >>>
> > >>>
> > >>>
> > >>> A small piece of trace:
> > >>>    19527 pwd CALL
> > >>> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
> > >>>
> > >>>    19527 pwd RET mmap 34376515584/0x801000000
> > >>>    19527 pwd CALL __getcwd(0x801006400,0x400)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET __getcwd -1 errno 2 No such file or directory
> > >>>    19527 pwd CALL stat(0x800947a14,0x7fffffffd940)
> > >>>    19527 pwd NAMI "/"
> > >>>    19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x ,
> > >>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893,
> > >>> stime=1362653279,
> > >>> ctime=1362653279, birthtime=1200836451, size=1024,
> > >>> blksize=16384,
> > >>> blocks=4, flags=0x0 }
> > >>>    19527 pwd RET stat 0
> > >>>    19527 pwd CALL lstat(0x80094779c,0x7fffffffd940)
> > >>>    19527 pwd NAMI "."
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=145,
> > >>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244672.246785874, stime=1363244792.864201338,
> > >>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET lstat 0
> > >>>    19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET openat 3
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd880)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC)
> > >>>    19527 pwd RET fcntl 0
> > >>>    19527 pwd CALL fstatfs(0x3,0x7fffffffd660)
> > >>>    19527 pwd RET fstatfs 0
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL
> > >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
> > >>>    19527 pwd RET getdirentries 4096/0x1000
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL openat(0x3,0x80094779b,0x100000,0)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET openat 4
> > >>> [..............................]
> > >>>    19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE)
> > >>>    19527 pwd RET madvise 0
> > >>>    19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE)
> > >>>    19527 pwd RET madvise 0
> > >>>    19527 pwd CALL close(0x3)
> > >>>    19527 pwd RET close 0
> > >>>    19527 pwd CALL fstat(0x4,0x7fffffffd880)
> > >>>    19527 pwd STRU struct stat {dev=973143810, ino=4,
> > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244767.460164771, stime=1363172100.380266923,
> > >>> ctime=1363172100.380266923, birthtime=-1, size=4948,
> > >>> blksize=4096,
> > >>> blocks=713, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
> > >>>    19527 pwd RET fcntl 0
> > >>>    19527 pwd CALL fstatfs(0x4,0x7fffffffd660)
> > >>>    19527 pwd RET fstatfs 0
> > >>>    19527 pwd CALL fstat(0x4,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=973143810, ino=4,
> > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244767.460164771, stime=1363172100.380266923,
> > >>> ctime=1363172100.380266923, birthtime=-1, size=4948,
> > >>> blksize=4096,
> > >>> blocks=713, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL
> > >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
> > >>>    19527 pwd RET getdirentries 4096/0x1000
> > >>>    19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user6158"
> > >>>    19527 pwd STRU struct stat {dev=1774902232, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user2289"
> > >>>    19527 pwd STRU struct stat {dev=1988229825, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user4761"
> > >>>    19527 pwd STRU struct stat {dev=2438657130, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user6055"
> > >>> [.........................................]
> > >>>
> > >>> and next get stat of all directories in /home
> > >>
> > >> Slightly different version of the patch was committed as r247560.
> > >>
> > >> The situation could only happen if the parent directory contains
> > >> the "."
> > >> entry with inode number equal to the inode number of the
> > >> subdirectory.
> > >> Can you confirm that this is your case ?
> > >>
> > >
> > > Yes, it is.
> > > I'll try again on the latest snapshot. Thanks!
> > >
> >
> > Yes.
> > On latest r248313 similar situation - if path contains "." then
> > nfsclient get stat of all directories in /home.
> 
> What path ?
> 
> Did you read the description of the situation when r248313 returns
> ENOENT to delegate the resolving to usermode ? Can you confirm that
> this is your situation ?

I think the patch is doing the correct thing. When __getcwd() returns
ENOENT, the userland algorithm in getcwd() must stat() all entries in
the directory to figure out if it is a mount point using the st_ino
and st_dev fields. (You can look at it in the libc sources, if you'd
like. Just look for getcwd.c.) I think this is what Kostik is referring
to.

If I understand your issue, it is that this takes a long time, since
/home is large. There are a couple of things that *might* reduce the
time this takes.

1 - If you are NFSv4 mounting "/pool", you could specify /pool as the
    root in you V4: /etc/exports line. Something like:
    V4: /pool ...
    Then you do the mount with 192.168.168.1:/.
    This will make the "/pool" a root point and might avoid the
    problem, but I am not sure.
2 - Adding rdirplus to the mount options will make it get attributes
    for entries in a directory when it does a readdir and cache them.
    This might speed things up.

rick

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 14:37:33 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2EED2CCB
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 14:37:33 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id D66D463B
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 14:37:32 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAM8xQ1GDaFvO/2dsb2JhbABDiDG6AIJlgXt0gioBAQUjVhsOCgICDRkCWQaIJ7AfknSBI40+NAeCLYETA5ZbiWyHFoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,850,1355115600"; d="scan'208";a="21391005"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 15 Mar 2013 10:37:31 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id BBBABB4036;
 Fri, 15 Mar 2013 10:37:31 -0400 (EDT)
Date: Fri, 15 Mar 2013 10:37:31 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Konstantin Belousov <kostikbel@gmail.com>
Message-ID: <800896676.3938590.1363358251751.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20130315135950.GU3794@kib.kiev.ua>
Subject: Re: should vn_fullpath1() ever return a path with "." in it?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 14:37:33 -0000

Kostik Belousov wrote:
> On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote:
> > On 03/14/2013 02:10 PM, Ilia Noskov wrote:
> > > On 03/14/2013 01:08 PM, Konstantin Belousov wrote:
> > >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote:
> > >>> Strange behavior on nfs-client after apply this patch:
> > >>>
> > >>> sysctl debug.disablecwd=0
> > >>> sysctl debug.disablefullpath=0
> > >>>
> > >>> # mount -v -t nfs
> > >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid
> > >>> 02ff003a3a000000)
> > >>> # ls /home | wc -l
> > >>>       4946
> > >>> # cd /home/user6308/.ro
I forgot to mention in the previous post. If user6308 is a different
file system than /home, forget about my suggestion #1, because I
know it won't help for this case.

> > >>> # time pwd
> > >>> /home/user6308/.ro
> > >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w
> > >>> # ktrace -t+ -i pwd
> > >>>
> > >>>
> > >>> ktrace.out is big (1MB). Attach or not?
> > >>>
> > >>>
> > >>>
> > >>> A small piece of trace:
> > >>>    19527 pwd CALL
> > >>> mmap(0,0x400000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
> > >>>
> > >>>    19527 pwd RET mmap 34376515584/0x801000000
> > >>>    19527 pwd CALL __getcwd(0x801006400,0x400)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET __getcwd -1 errno 2 No such file or directory
> > >>>    19527 pwd CALL stat(0x800947a14,0x7fffffffd940)
> > >>>    19527 pwd NAMI "/"
> > >>>    19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x ,
> > >>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893,
> > >>> stime=1362653279,
> > >>> ctime=1362653279, birthtime=1200836451, size=1024,
> > >>> blksize=16384,
> > >>> blocks=4, flags=0x0 }
> > >>>    19527 pwd RET stat 0
> > >>>    19527 pwd CALL lstat(0x80094779c,0x7fffffffd940)
> > >>>    19527 pwd NAMI "."
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=145,
> > >>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244672.246785874, stime=1363244792.864201338,
> > >>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET lstat 0
> > >>>    19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET openat 3
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd880)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC)
> > >>>    19527 pwd RET fcntl 0
> > >>>    19527 pwd CALL fstatfs(0x3,0x7fffffffd660)
> > >>>    19527 pwd RET fstatfs 0
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL
> > >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8)
> > >>>    19527 pwd RET getdirentries 4096/0x1000
> > >>>    19527 pwd CALL fstat(0x3,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=1230702064, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244665.232140704, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL openat(0x3,0x80094779b,0x100000,0)
> > >>>    19527 pwd NAMI ".."
> > >>>    19527 pwd RET openat 4
> > >>> [..............................]
> > >>>    19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE)
> > >>>    19527 pwd RET madvise 0
> > >>>    19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE)
> > >>>    19527 pwd RET madvise 0
> > >>>    19527 pwd CALL close(0x3)
> > >>>    19527 pwd RET close 0
> > >>>    19527 pwd CALL fstat(0x4,0x7fffffffd880)
> > >>>    19527 pwd STRU struct stat {dev=973143810, ino=4,
> > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244767.460164771, stime=1363172100.380266923,
> > >>> ctime=1363172100.380266923, birthtime=-1, size=4948,
> > >>> blksize=4096,
> > >>> blocks=713, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
> > >>>    19527 pwd RET fcntl 0
> > >>>    19527 pwd CALL fstatfs(0x4,0x7fffffffd660)
> > >>>    19527 pwd RET fstatfs 0
> > >>>    19527 pwd CALL fstat(0x4,0x7fffffffd940)
> > >>>    19527 pwd STRU struct stat {dev=973143810, ino=4,
> > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363244767.460164771, stime=1363172100.380266923,
> > >>> ctime=1363172100.380266923, birthtime=-1, size=4948,
> > >>> blksize=4096,
> > >>> blocks=713, flags=0x0 }
> > >>>    19527 pwd RET fstat 0
> > >>>    19527 pwd CALL
> > >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8)
> > >>>    19527 pwd RET getdirentries 4096/0x1000
> > >>>    19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user6158"
> > >>>    19527 pwd STRU struct stat {dev=1774902232, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user2289"
> > >>>    19527 pwd STRU struct stat {dev=1988229825, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user4761"
> > >>>    19527 pwd STRU struct stat {dev=2438657130, ino=4,
> > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295,
> > >>> atime=1363009687.040357529, stime=1363010116.496298252,
> > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096,
> > >>> blocks=3, flags=0x0 }
> > >>>    19527 pwd RET fstatat 0
> > >>>    19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200)
> > >>>    19527 pwd NAMI "user6055"
> > >>> [.........................................]
> > >>>
> > >>> and next get stat of all directories in /home
> > >>
> > >> Slightly different version of the patch was committed as r247560.
> > >>
> > >> The situation could only happen if the parent directory contains
> > >> the "."
> > >> entry with inode number equal to the inode number of the
> > >> subdirectory.
> > >> Can you confirm that this is your case ?
> > >>
> > >
> > > Yes, it is.
> > > I'll try again on the latest snapshot. Thanks!
> > >
> >
> > Yes.
> > On latest r248313 similar situation - if path contains "." then
> > nfsclient get stat of all directories in /home.
> 
> What path ?
> 
> Did you read the description of the situation when r248313 returns
> ENOENT to delegate the resolving to usermode ? Can you confirm that
> this is your situation ?

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 14:58:58 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 128DF921;
 Fri, 15 Mar 2013 14:58:58 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qa0-f48.google.com (mail-qa0-f48.google.com
 [209.85.216.48]) by mx1.freebsd.org (Postfix) with ESMTP id B4C0275D;
 Fri, 15 Mar 2013 14:58:57 +0000 (UTC)
Received: by mail-qa0-f48.google.com with SMTP id j8so336636qah.0
 for <multiple recipients>; Fri, 15 Mar 2013 07:58:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=qkDuTVQJ6xqFXDMQGdzSS6hRckvP99eSBCbNyltoeoc=;
 b=WlOM50D7ZgsjK2u0UpZOHzNPDZutSZNir/iL8rVtDFYeBkMrCLmNDG89kOpMSc2BLt
 0VA+gQ7wEInUZdR1RvEs3OoOlxhOhdQXnAIpk0baTWSDrvneMMOENENvCdA4cqkLl+a0
 g8JxeuIYseV0CvtNyPsCTfqmmj1R6t1weFhtvMi6RH6XBrFBIo2uKLMDaJxrQmMgJ+4q
 u0C5zAo7athMplD69T3rHZTa7IxgwgYEQX3/ypbX1FZtPv5IKBEhK1oCrjySiV000Sez
 b332v3Gvgh0rseheluJvy0uc1pZP0jvdn1MCZ4gUibC+m5FzP04q0lALsoa7lta0o0y1
 CWVA==
MIME-Version: 1.0
X-Received: by 10.49.128.170 with SMTP id np10mr6042926qeb.37.1363359531070;
 Fri, 15 Mar 2013 07:58:51 -0700 (PDT)
Received: by 10.49.50.67 with HTTP; Fri, 15 Mar 2013 07:58:50 -0700 (PDT)
In-Reply-To: <51430744.6020004@FreeBSD.org>
References: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
 <51430744.6020004@FreeBSD.org>
Date: Fri, 15 Mar 2013 07:58:50 -0700
Message-ID: <CAOjFWZ5e2t0Y_KOxm+GhX+zXNPfOXb8HKF4uU+Q+N5eWQqLtdg@mail.gmail.com>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: Andriy Gapon <avg@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 14:58:58 -0000

How does one do that?  I've never done that before.

Point me to some docs, and I'll see what I can find out.


On Fri, Mar 15, 2013 at 4:34 AM, Andriy Gapon <avg@freebsd.org> wrote:

> on 14/03/2013 20:13 Freddie Cash said the following:
> > the l2arc_feed_thread of zfskern will spin until it takes up 100%
> > of a CPU core
>
> If you see a thread taking 100% where it shouldn't, then just profile it
> and
> actually see what it's doing.
>
> --
> Andriy Gapon
>



-- 
Freddie Cash
fjwcash@gmail.com

From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 15 21:27:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 69F416CF
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 21:27:34 +0000 (UTC)
 (envelope-from girgen@FreeBSD.org)
Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200])
 by mx1.freebsd.org (Postfix) with ESMTP id A0BB56D2
 for <freebsd-fs@freebsd.org>; Fri, 15 Mar 2013 21:27:33 +0000 (UTC)
Received: from girgBook.local
 (c-1754e155.1525-1-64736c12.cust.bredbandsbolaget.se [85.225.84.23])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (No client certificate requested)
 by melon.pingpong.net (Postfix) with ESMTPSA id C29BE14732;
 Fri, 15 Mar 2013 22:27:31 +0100 (CET)
Message-ID: <51439243.5020604@FreeBSD.org>
Date: Fri, 15 Mar 2013 22:27:31 +0100
From: Palle Girgensohn <girgen@FreeBSD.org>
User-Agent: Postbox 3.0.7 (Macintosh/20130119)
MIME-Version: 1.0
To: Kirk McKusick <mckusick@mckusick.com>
Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after
 moving tables and indexes to tablespace on different volume
References: <201303131652.r2DGqSr4051899@chez.mckusick.com>
In-Reply-To: <201303131652.r2DGqSr4051899@chez.mckusick.com>
X-Enigmail-Version: 1.2.3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, Jeff Roberson <jroberson@jroberson.net>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Mar 2013 21:27:34 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Kirk McKusick skrev:
> Thanks for your report. It is certainly unlike anything that we have
> seen reported before.
> 
> Are you running your /usr filesystem with (the default) journalled 
> soft updates? You can check this by running the `mount' command with
> no arguments.
> 
> Rather than rebooting your system, it would be most helpful if you 
> could instead shut it down to single user. Then do the following:
> 
> Create a transcript of your session by running `script'. Once running
> in the session run these commands:
> 
> Run `mount' to show your filesystem configuration. Run `df -hi /usr'
> to see whether the inodes are still missing. Verify that you can
> cleanly unmount /usr (e.g., that the unmount does not hang and does
> not complain). Remount /usr and run `df -hi' to see whether the
> inodes are still missing. Unmount /usr again and run `fsck_ffs -p -f
> -d /usr'. If the fsck_ffs fails with an unexpected inconsistency, you
> can run `fsck_ffs -y -d /usr' to force it to clean up. When you have
> the filesystem successfully cleaned up, type `exit' to get out of the
> script session and mail me the transcript of the session
> (typescript).
> 
> Thanks for your help in tracking this down.
> 
> Kirk McKusick

Hi again,

A umount + mount was enough to reclaim the space.

Script started on Fri Mar 15 19:02:22 2013

# mount
/dev/da0s1a on / (ufs, local)
devfs on /dev (devfs, local, multilabel)
/dev/da0s1d on /tmp (ufs, local, soft-updates)
/dev/da0s1f on /usr (ufs, local, soft-updates)
/dev/da0s1e on /var (ufs, local, soft-updates)
/dev/da1s1d on /opt (ufs, local, soft-updates)
procfs on /proc (procfs, local)
fdescfs on /dev/fd (fdescfs)
# df -hi /usr
Filesystem     Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/da0s1f    104G     88G    7.5G    92%    283k   13M    2%   /usr
# umount /usr
# mount /usr
# du-hi /usr
Filesystem     Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/da0s1f    104G    4.7G     91G     5%    278k   13M    2%   /usr
# ^D
Script done on Fri Mar 15 19:09:26 2013



But, after a couple of hours in production again, after power-off +
reboot (for other reasons, had to replace the remote console card, the
iLO), an fsck indicates that it might still be losing file references
occasionally? Look at the unreferenced files with size ~ 11111686. This
is exactly how it looked before the unmount/remount, only there where
many many more.

# fsck /usr
** /dev/da0s1f (NO WRITE)
** Last Mounted on /usr
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
UNALLOCATED  I=3157519  OWNER=pgsql MODE=100600
SIZE=0 MTIME=Mar 15 22:15 2013
FILE=/local/pgsql/data/base/16431/t79_3703656628

UNEXPECTED SOFT UPDATE INCONSISTENCY

REMOVE? no

UNALLOCATED  I=7301569  OWNER=pgsql MODE=100600
SIZE=0 MTIME=Mar 15 22:15 2013
FILE=/local/pgsql/data/base/2969955511/t109_3703656671

UNEXPECTED SOFT UPDATE INCONSISTENCY

REMOVE? no

** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=3156555  OWNER=pgsql MODE=100600
SIZE=11100363 MTIME=Mar 15 20:42 2013
CLEAR? no

UNREF FILE I=3156714  OWNER=pgsql MODE=100600
SIZE=11126220 MTIME=Mar 15 20:08 2013
CLEAR? no

UNREF FILE I=3157415  OWNER=pgsql MODE=100600
SIZE=11101546 MTIME=Mar 15 20:34 2013
CLEAR? no

UNREF FILE I=3157512  OWNER=pgsql MODE=100600
SIZE=11100870 MTIME=Mar 15 20:39 2013
CLEAR? no

UNREF FILE I=3157518  OWNER=pgsql MODE=100600
SIZE=11100194 MTIME=Mar 15 20:59 2013
CLEAR? no

UNREF FILE I=3157520  OWNER=pgsql MODE=100600
SIZE=11098673 MTIME=Mar 15 21:58 2013
CLEAR? no

UNREF FILE I=3157544  OWNER=pgsql MODE=100600
SIZE=11107123 MTIME=Mar 15 21:13 2013
CLEAR? no

UNREF FILE I=3157547  OWNER=pgsql MODE=100600
SIZE=11110672 MTIME=Mar 15 21:54 2013
CLEAR? no

UNREF FILE I=3157554  OWNER=pgsql MODE=100600
SIZE=11111686 MTIME=Mar 15 22:12 2013
CLEAR? no

LINK COUNT FILE I=3157590  OWNER=pgsql MODE=0
SIZE=0 MTIME=Mar 15 22:15 2013  COUNT 0 SHOULD BE -1
ADJUST? no

UNREF FILE I=3157596  OWNER=pgsql MODE=100600
SIZE=11107968 MTIME=Mar 15 20:48 2013
CLEAR? no

UNREF FILE I=3157607  OWNER=pgsql MODE=100600
SIZE=11093096 MTIME=Mar 15 21:23 2013
CLEAR? no

LINK COUNT FILE I=7301564  OWNER=pgsql MODE=0
SIZE=0 MTIME=Mar 15 22:15 2013  COUNT 0 SHOULD BE -2
ADJUST? no

UNREF FILE  I=8485378  OWNER=pgsql MODE=100600
SIZE=0 MTIME=Mar 15 22:15 2013
RECONNECT? no


CLEAR? no

** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? no

ALLOCATED FRAGS 3416608-3416735 MARKED FREE
BLK(S) MISSING IN BIT MAPS
SALVAGE? no

FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? no

278680 files, 2516552 used, 52158314 free (72842 frags, 6510684 blocks,
0.1% fragmentation)


FreeBSD 9.1-RELEASE, amd64, GENERIC kernel.

Any ideas?

Best regards,
Palle

> 
> ----- Original Message:
> 
> Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn
> <girgen@freebsd.org> To: freebsd-fs@freebsd.org Subject: leaking lots
> of unreferenced inodes (pg_xlog files?), maybe after moving tables
> and indexes to tablespace on different volume
> 
> Hi!
> 
> Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file
> system.
> 
> I have the postgresql base/ on the /usr disk, and a separate volume
> /opt where the default tablespace resides. This means that the amount
> of data on the /usr disk sould be stable. This is not the case, the
> disk usage grows linearly (it seems to leave many inodes
> unreferenced).
> 
> The the discrepancy between df and du is now huge:
> 
> # du -sxh /usr; df -h /usr 4,6G	/usr Filesystem     Size    Used
> Avail Capacity  Mounted on /dev/da0s1f    104G     88G    8.0G    92%
> /usr
> 
> 4,6G vs 88GB, that must be more than a rounding error?
> 
> Strange thing is I cannot find any open files among the missing.
> 
> # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null
> 
> returns no errors (a missing file would render an error with ls). If 
> there where open files not referenced in any directory, they should
> be found.
> 
> Next thing is fsck, and yes, there are plenty of unreferenced files.
> 
> I ran fsck while system is running (i.e. read only) to get a grip
> oif the amount of lost inodes:
> 
> fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while
> (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' 
> 85223.3530330658
> 
> ~85 GB gone, that's 80% of the disk, and it accounts fo all the
> missing space.
> 
> MTIME for the inodes are pretty evenly spread over time since the 
> machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was
> updated to 9.2. All was done at the same time, so I can't really tell
> who's to blaim, but this is the only server, out of a dozen that
> where updated to exactly the same versions, that has this problem.
> All other servers have their /usr disk usage stable (since all data
> resides on a separate tablespace).
> 
> The unreferenced inodes are almost exclusively around 16 MB in size,
> so they most certainly all are postgresql pg_xlog files. This means
> all files are lost from the same portion of code in the database
> engine.
> 
> How could it possibly be able to leave unreferenced inodes around
> like this at such a scale? Is the culprit a combination of postgresql
> and file system code? Both where updated.
> 
> pg_xlog checkpoints seems to happen approximately every three
> minutes:
> 
> Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: 
> checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]:
> [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%);
> 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667
> s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s,
> average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1]
> db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver
> postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996
> buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1
> recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync
> files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver
> postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar
> 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint
> complete: wrote 13736 buffers (1.4%); 0 transaction log file(s)
> added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, 
> total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s 
> Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: 
> checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]:
> [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers
> (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled;
> write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363,
> longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver
> postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar
> 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint
> complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added,
> 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433
> s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08
> dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting:
> time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: 
> checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log 
> file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, 
> total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar
> 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint
> starting: time
> 
> 
> I'm pretty certain that unmounting the file system and running fsck
> will regain the lost space, but will it stop there?
> 
> Stopping postgresql briefly did not help, I tried that. That would
> have helped if the files where open, but they're not. It seems to
> postgresql did the right thing, and FreeBSD failed to unreference the
> files.
> 
> The server has about 30 databases and ~127 concurrent connections
> (not all beeing active simultaneously, though), so it is fair to say
> it is pretty active, but nothing extreme.
> 
> Hardware is HP DL360, using their HT Smart Array P410i.
> 
> Any ideas how to debug this? Or shall I just reboot, fsck, hope the 
> problem will go away, and when it does, forget about it?
> 
> Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2
> v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment:
> Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq 
> XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy 
> jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs 
> O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv 
> H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw 
> NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= 
> =3ZZs -----END PGP SIGNATURE----- 
> _______________________________________________ 
> freebsd-fs@freebsd.org mailing list 
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe,
> send any mail to "freebsd-fs-unsubscribe@freebsd.org"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRQ5JDAAoJEIhV+7FrxBJDuK8H/3gvtaZyKqNxbrQ+JkgGooit
AVs5i38j6ZjKoYOPTNrqD5zsqk76NE5hUmJ2HAj/EkEt5CnPkR0trVN/s95NQu1S
IY+iOlng9ImKHVvIEWKRap0WTeUu7BT2M+e6szOkOOo93xqS7E0U7tfwgkFXgjI2
MUcy7QxFz/Yfjyu7HrYDvJMCmCEL2e5SDRQoPXO/Qs4CRnE16d85nJtFJXuM8EgQ
j8ZZmmphRt9yxxLg6tAlm3Tscf2QqXL8G4ABHSf32dJYuO11/7Glz+svh4m/gj7B
YnlXuqOq7ESBMhwLpQqA78JOWfZiiF8B8aTQVlxm3GtjPWknm4rkK1XljWl8Zi8=
=kIKS
-----END PGP SIGNATURE-----

From owner-freebsd-fs@FreeBSD.ORG  Sat Mar 16 02:03:41 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 24892205;
 Sat, 16 Mar 2013 02:03:41 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8E80DA50;
 Sat, 16 Mar 2013 02:03:40 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAFrSQ1GDaFvO/2dsb2JhbABDiDG6FYJlgX10gioBAQUjBFIbDgoCAg0ZAlkGLod5sHeSWoEjjT40B4ItgRMDlluRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,855,1355115600"; d="scan'208";a="21484410"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 15 Mar 2013 22:03:39 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 14223B3F36;
 Fri, 15 Mar 2013 22:03:39 -0400 (EDT)
Date: Fri, 15 Mar 2013 22:03:39 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <88927360.3963361.1363399419023.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303141444.35740.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Mar 2013 02:03:41 -0000

John Baldwin wrote:
> On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote:
> > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > > > John Baldwin wrote:
> > > > > > I ran into a machine that had a deadlock among certain files
> > > > > > on a
> > > > > > given NFS
> > > > > > mount today. I'm not sure how best to resolve it, though it
> > > > > > seems like
> > > > > > perhaps there is a bug with how the pool of nfsiod threads
> > > > > > is managed.
> > > > > > Anyway, more details on the actual hang below. This was on
> > > > > > 8.x with
> > > > > > the
> > > > > > old NFS client, but I don't see anything in HEAD that would
> > > > > > fix this.
> > > > > >
> > > > > > First note that the system was idle so it had dropped down
> > > > > > to only one
> > > > > > nfsiod thread.
> > > > > >
> > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite
> > > > > more often.
> > > > > It seems to me that this snippet of code from nfs_asyncio()
> > > > > makes too
> > > > > weak an assumption:
> > > > > 	/*
> > > > > 	 * If none are free, we may already have an iod working on
> > > > > 	 this mount
> > > > > 	 * point. If so, it will process our request.
> > > > > 	 */
> > > > > 	if (!gotiod) {
> > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > 			NFS_DPF(ASYNCIO,
> > > > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > 			gotiod = TRUE;
> > > > > 		}
> > > > > 	}
> > > > > It assumes that, since an nfsiod thread is processing some
> > > > > buffer for the
> > > > > mount, it will become available to do this one, which isn't
> > > > > true for your
> > > > > deadlock.
> > > > >
> > > > > I think the simple fix would be to recode nfs_asyncio() so
> > > > > that
> > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that
> > > > > it
> > > > > has assigned to do the I/O, getting rid of the above. The
> > > > > problem
> > > > > with doing this is that it may result in a lot more
> > > > > synchronous I/O
> > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe
> > > > > more
> > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to
> > > > > create a
> > > > > new thread even if the total is above nfs_iodmax. (I think
> > > > > this would
> > > > > require the fixed array to be replaced with a linked list and
> > > > > might
> > > > > result in a large number of nfsiod threads.) Maybe just having
> > > > > a large
> > > > > nfs_iodmax would be an adequate compromise?
> > > > >
> > > > > Does having a large # of nfsiod threads cause any serious
> > > > > problem for
> > > > > most systems these days?
> > > > >
> > > > > I'd be tempted to recode nfs_asyncio() as above and then,
> > > > > instead
> > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed
> > > > > number of
> > > > > nfsiod threads (this could be a tunable, with the
> > > > > understanding that
> > > > > it should be large for good performance)
> > > > >
> > > >
> > > > I do not see how this would solve the deadlock itself. The
> > > > proposal would
> > > > only allow system to survive slightly longer after the deadlock
> > > > appeared.
> > > > And, I think that allowing the unbound amount of nfsiod threads
> > > > is also
> > > > fatal.
> > > >
> > > > The issue there is the LOR between buffer lock and vnode lock.
> > > > Buffer lock
> > > > always must come after the vnode lock. The problematic nfsiod
> > > > thread, which
> > > > locks the vnode, volatile this rule, because despite the
> > > > LK_KERNPROC
> > > > ownership of the buffer lock, it is the thread which de fact
> > > > owns the
> > > > buffer (only the thread can unlock it).
> > > >
> > > > A possible solution would be to pass LK_NOWAIT to nfs_nget()
> > > > from the
> > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget()
> > > > should
> > > > be capable of correctly handling the lock failure. And EBUSY
> > > > would
> > > > result in doit = 0, which should be fine too.
> > > >
> > > > It is possible that EBUSY should be reset to 0, though.
> > >
> > > Yes, thinking about this more, I do think the right answer is for
> > > readdirplus to do this. The only question I have is if it should
> > > do
> > > this always, or if it should do this only from the nfsiod thread.
> > > I
> > > believe you can't get this in the non-nfsiod case.
> >
> > I agree that it looks as of the workaround only needed for nfsiod
> > thread.
> > On the other hand, it is not immediately obvious how to detect that
> > the current thread is nfsio daemon. Probably a thread flag should be
> > set.
> 
> OTOH, updating the attributes from readdir+ is only an optimization
> anyway, so
> just having it always do LK_NOWAIT is probably ok (and simple).
> Currently I'm
> trying to develop a test case to provoke this so I can test the fix,
> but no
> luck on that yet.
> 
> --
> John Baldwin
Just fyi, ignore my comment about the second version of the patch that
disables the nfsiod threads from doing readdirplus running faster. It
was just that when I tested the 2nd patch, the server's caches were
primed. Oops.

However, sofar the minimal testing I've done has been essentially
performance neutral between the unpatch and patched versions.

Hopefully John has a convenient way to do some performance testing,
since I won't be able to do much until the end of April.

rick

From owner-freebsd-fs@FreeBSD.ORG  Sat Mar 16 04:01:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id E23D6477;
 Sat, 16 Mar 2013 04:01:34 +0000 (UTC)
 (envelope-from mckusick@mckusick.com)
Received: from chez.mckusick.com (chez.mckusick.com
 [IPv6:2001:5a8:4:7e72:4a5b:39ff:fe12:452])
 by mx1.freebsd.org (Postfix) with ESMTP id BC714E4F;
 Sat, 16 Mar 2013 04:01:34 +0000 (UTC)
Received: from chez.mckusick.com (localhost [127.0.0.1])
 by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id r2G41Um7026132;
 Fri, 15 Mar 2013 21:01:30 -0700 (PDT)
 (envelope-from mckusick@chez.mckusick.com)
Message-Id: <201303160401.r2G41Um7026132@chez.mckusick.com>
To: Palle Girgensohn <girgen@FreeBSD.org>
Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?),
 maybe after moving tables and indexes to tablespace on different
 volume 
In-reply-to: <51439243.5020604@FreeBSD.org> 
Date: Fri, 15 Mar 2013 21:01:30 -0700
From: Kirk McKusick <mckusick@mckusick.com>
X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY
 autolearn=failed version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com
Cc: freebsd-fs@FreeBSD.org, Jeff Roberson <jroberson@jroberson.net>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Mar 2013 04:01:34 -0000

I don't know how, but somehow something is holding references to
the removed files causing them to fail to be reclaimed.

Could you run your system for a while to build up a new set of
these files, then run a script with the `df -ih' as before. Then
run `vmstat -m', `sysctl debug', and fstat -f /usr' both before
and after doing the umount/mount. Hopefully that will give us
some more clues as to what is happening.

And Jeff, if you have any ideas do speak up :-)

	Kirk McKusick

From owner-freebsd-fs@FreeBSD.ORG  Sat Mar 16 19:48:38 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 3E8BA78D
 for <freebsd-fs@FreeBSD.org>; Sat, 16 Mar 2013 19:48:38 +0000 (UTC)
 (envelope-from marck@rinet.ru)
Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68])
 by mx1.freebsd.org (Postfix) with ESMTP id CC5D1FB9
 for <freebsd-fs@FreeBSD.org>; Sat, 16 Mar 2013 19:48:37 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by woozle.rinet.ru (8.14.5/8.14.5) with ESMTP id r2GJmQd4009240
 for <freebsd-fs@FreeBSD.org>; Sat, 16 Mar 2013 23:48:26 +0400 (MSK)
 (envelope-from marck@rinet.ru)
Date: Sat, 16 Mar 2013 23:48:26 +0400 (MSK)
From: Dmitry Morozovsky <marck@rinet.ru>
To: freebsd-fs@FreeBSD.org
Subject: HA iSCSI target on ZFS: model
Message-ID: <alpine.BSF.2.00.1303162331420.46383@woozle.rinet.ru>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
X-NCC-RegID: ru.rinet
X-OpenPGP-Key-ID: 6B691B03
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (woozle.rinet.ru [0.0.0.0]); Sat, 16 Mar 2013 23:48:26 +0400 (MSK)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Mar 2013 19:48:38 -0000

Dear colleagues,

I'm currently plan to architect and deploy test HA iSCSI target using two 
FreeBSD hosts, and want to listen to your comments.

What I already did:

- two hosts, booted from internal USB stick, with 4 HDDs and one SSD each
- LACP-based laggs while links are connected to different half of 
a clustered switch, with the mtu of 9000
- 2 carps (thanks to araujo@ help, -current, i.e. interface property, not 
clone) holding shared addresses

- disk layout such as

root@cthulhu4:/usr/local/etc/istgt# gpart show -l
=>        34  1953525101  ada0  GPT  (931G)
          34        2014        - free -  (1M)
        2048  1952448512     1  ct4-0  (931G)
  1952450560     1074575        - free -  (524M)

=>        34  1953522988  ada1  GPT  (931G)
          34        2014        - free -  (1M)
        2048  1952448512     1  ct4-1  (931G)
  1952450560     1072462        - free -  (523M)

=>        34  1953525101  ada2  GPT  (931G)
          34        2014        - free -  (1M)
        2048  1952448512     1  ct4-2  (931G)
  1952450560     1074575        - free -  (524M)

=>        34  1953525101  ada3  GPT  (931G)
          34        2014        - free -  (1M)
        2048  1952448512     1  ct4-3  (931G)
  1952450560     1074575        - free -  (524M)

=>       34  234441581  ada4  GPT  (111G)
         34       2014        - free -  (1M)
       2048    2097152     1  ct3-zil4  (1.0G)
    2099200    2097152     2  ct4-zil4  (1.0G)
    4196352  230244352     3  ct4-cache  (109G)
  234440704        911        - free -  (455k)

=>      0  4005886  da0  BSD  (1.9G)
        0       16       - free -  (8.0k)
       16  4005870    1  (null)  (1.9G)

(da0 is USB, ada0-ada3 HDDs, ada4 SSD)

- 2 hast sets such as

root@cthulhu4:/usr/local/etc/istgt# hastctl status
Name    Status   Role           Components
d0      complete secondary      /dev/ada0p1     cthulhu3
d1      complete secondary      /dev/ada1p1     cthulhu3
d2      complete primary        /dev/ada2p1     cthulhu3
d3      complete primary        /dev/ada3p1     cthulhu3
zil3    complete secondary      /dev/ada4p1     cthulhu3
zil4    complete primary        /dev/ada4p2     cthulhu3

- 2 ZFS setups like
root@cthulhu4:/usr/local/etc/istgt# zpool status
  pool: ct4
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu Mar 14 17:58:48 2013
config:

        NAME             STATE     READ WRITE CKSUM
        ct4              ONLINE       0     0     0
          hast/d2        ONLINE       0     0     0
          hast/d3        ONLINE       0     0     0
        logs
          hast/zil4      ONLINE       0     0     0
        cache
          gpt/ct4-cache  ONLINE       0     0     0


Now, it's time to create exportable entities. I think of creating thin 
(non-preallocated) ZFS volumes in the pools, and sharing them via istgt.

What zfs properties would be appropriate for this? 
I'm thinking at least about volblocksize=4k (main usage will be vSphere), but 
not sure about it.  And, more importantly, what about sync property?

Did I miss something obvious?

(and yes, supporting scripts for checking paired resource to be alive are all 
to be written...)

Thanks in advance!

-- 
Sincerely,
D.Marck                                     [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer:                                 marck@FreeBSD.org ]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru ***
------------------------------------------------------------------------