From owner-freebsd-arch@FreeBSD.ORG Thu Apr 1 18:54:23 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE231106564A; Thu, 1 Apr 2010 18:54:23 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.152]) by mx1.freebsd.org (Postfix) with ESMTP id CE9E48FC12; Thu, 1 Apr 2010 18:54:22 +0000 (UTC) Received: by fg-out-1718.google.com with SMTP id d23so424711fga.13 for ; Thu, 01 Apr 2010 11:54:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:cc:subject :message-id:mime-version:content-type:content-disposition:user-agent; bh=/Ycq4OE4WXsMvp4EtAIpyTAYLwkFOwZVD2I49GQouMI=; b=nrL39iHrvUR3yeLX1Bt9PJQBe4Dmy5z3Nu2mjzVYFoDZxdakZctJcGcYFWEkK8EmNr O6pT2ymeQCUSpsuEOOU9SKeRI75K7Uc2aLSCwigAJDPczerlgQ5b6gnQ3zIzNtKBugr1 Jq2+zP3iqeLNtoMfOFAV/9FwwlAZwsc9nBZ+E= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:mime-version:content-type :content-disposition:user-agent; b=hOjhCO9hrqxXx6Dyq8Ur7UylQpQ3xI3JkW0G7O6+IhTHaylfLgvdUaRgyXoPuBkfAy zLqGPK3KsY5q2+gFX7hbOJkLrqX+GovmImExjsEuETngEWth9NUMQgdMgrEJWk6ToLZ1 8D+aRxsFnd55HruNSZTbX/Rq56UuGbHZjFWYo= Received: by 10.87.48.34 with SMTP id a34mr2570942fgk.2.1270146500079; Thu, 01 Apr 2010 11:28:20 -0700 (PDT) Received: from localhost ([212.98.186.134]) by mx.google.com with ESMTPS id e3sm9954427fga.19.2010.04.01.11.28.18 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 01 Apr 2010 11:28:19 -0700 (PDT) Date: Thu, 1 Apr 2010 21:28:29 +0300 From: Gleb Kurtsou To: freebsd-arch@freebsd.org Message-ID: <20100401182829.GA2306@tops> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Alexander Kabaev Subject: Namecache improvements (Summer of Code proposal) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Apr 2010 18:54:23 -0000 I plan to submit following idea as Summer of Code project. I'd like it to be treated as a research project but not like ready-for-production solution. Comments and suggestions are welcome. Into. First of all there could be several paths to a vnode: hardlinks, nullfs mount, snapshot dirs, etc. There are several traditional ways of solving problem of getting full path from cache: Pass granular updates to namecace on file system change or cache full path for vnode. This approach was taken by Solaris, XNU and many others. In FreeBSD on file remove or rename cache for entire directory is purged. Namecache in Dragonfly keeps references to vnodes from the path root to the given vnode. Linux does similar thing but its VFS is very different. But this approach doesn't work for network filesystems well enough, there are tweaks to make NFS usable. AFAIK Linux also had problems with NFS due to its VFS design (long ago), there is a d_validate call used to verify if "cached" entry (dentry) is valid. Such problems arise because of stateless and nameless operation of NFS. Random thoughts to make my intent clear. I do not like idea of keeping all vnodes from path root around, it serves no good purpose and creates another way of having vnode in semi-valid state. Keeping cache in sync by updating it from outside of filesystem is hard, unreliable and error prone. Basically there is no reliable way to find out if path has changed for network filesystem or nullfs mount. Our VFS keeps names and vnodes separate, the same happens on filesystem level: one inode, one vnode, several names (just a hard link); by design there is no easy way of doing reverse lookup (getting name for an inode/vnode). Proposal. Generalize dirhash into generic directory cache (dircache) updated by filesystem itself. It's much like dirhash in UFS or dircache in pefs (rather different from dirhash, it's more like cache for a network filesystem). Dircache entry ("strong entry") contains inode numbers, mount struct pointer, internal filesystem data if need and contains no vnode reference in general case. Filesystem keeps entries in sync on create/rename/remove/etc. By 'no vnode reference' I mean that dircache entry may exist without vnode reference attached to it. On VOP_LOOKUP, VOP_CREATE and VFS_GET cache entry is updated with vnode reference. Vnode reference is removed on VOP_RECLAIM. Unify current namecache and dircache. There are strong and weak entries in the cache. Strong added by dircache, all entires added by oldcache (current cache_*) are weak (inode number = 0). If there's no inode number for entry it's weak. Snapshot directory for filesystem using dircache is likely to be weak. Nullfs and NFS entries are all weak. Only dircache entries may be strong. There is no weak entries without vnode reference in the cache at any time. Strong nodes are permanent, they remain in cache until there are references to them. Cache remains consistent, i.e. if cache grows large only leaf strong nodes that do not reference vnode removed. Entries removed not one by one but all directory entries at once, so that cache remains valid. It also means, that entries forming full path to mount point are always in cache. Traditional-Weak entries behave the same way as in current cache (removed with cache_purge(vp)). Filesystem uses either dircache or oldcache but not both. Strong nodes form a Directed Acyclic Graph supporting reliable full path resolve (among strong nodes only). It needs more thinking, componentname struct should be extended with support for "namespaces". Resolving weak entries remains the same, there is not much one can do without knowing that cache is valid, i.e. relay on namecache to get full path only in cases it's known to be reliable, do not try to make it reliable with different tweaks. Vnode won't contain v_cache_src/v_cache_dst lists, just a reference to cache entry. Rewriting existing filesystems to support dircache won't be necessary oldcache should remain fully functional. Dircache is intended only for local filesystems, it can be extended with validate operation to become useful for cases like nullfs. Additional goal is to make it possible to get vnode for *strong* dircache entry by traversing the graph and, if vnode doesn't exist, calling VFS_GET(inode number) for the entry, thus giving considerable performance improvement, avoiding VOP_LOOKUP calls for all elements in path. Access check issues are not hard to solve, there is VFS_GETACL. I do not mean that vnode can be created for arbitrary dircache entry, full path of vnodes has to be created at least once, ACLs cached, etc. But, ignoring this goal, filesystem level data, namely inode number, is still necessary for dircache to remain connected with filesystem. In other words, namecache+dircache is a per-filesystem meta data cache, it's also aware of vnodes if there are any attached to entries. Namecache becomes not a VFS subsystem, but moves in between VFS and filesystem code. Anyway there is internal name handling in filesystems, in UFS case there is dirhash which partially does what dircache should, the idea is to expose it to upper levels and use it. Thanks, Gleb.