Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Apr 2015 08:33:42 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Julian Elischer <julian@freebsd.org>
Cc:        freebsd-current@freebsd.org, John Baldwin <jhb@freebsd.org>,  Jilles Tjoelker <jilles@stack.nl>
Subject:   Re: readdir/telldir/seekdir problem (i think)
Message-ID:  <1781764425.25665403.1429965222745.JavaMail.root@uoguelph.ca>
In-Reply-To: <553B0326.1090306@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
> what does linux use?
> ------
>        In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
>         POSIX.1-2001 specifies long, and this is the type used since
>         glibc
>         2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> --------
>         In early filesystems, the value returned by telldir() was a
>         simple
>         file offset within a directory.  Modern filesystems use tree
> or hash
>         structures, rather than flat tables, to represent
>         directories.  On
>         such filesystems, the value returned by telldir() (and used
>         internally by readdir(3)) is a "cookie" that is used by the
>         implementation to derive a position within a directory.
> Application
>         programs should treat this strictly as an opaque value,
>         making no
>         assumptions about its contents.
> ------
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
I believe that most of the FreeBSD file systems except UFS and ZFS just
copy the fields of their internal directory structure to fields in
"struct dirent", filling blocks sequentially. (Actually, I only took a
quick look, but ZFS might also be this way.)
As such, the "offsets" for FreeBSD are byte offsets into these "logical directory"
blocks.
The problem is (as already discussed) that there is no way to predict
how these will change for a given file system when entries are removed/added.
(I think the only way to "know" what the modified "logical directory" looks
 like is to read it again from the beginning, so that all the directory entries
 go through the conversion to logical again.)
UFS and the NFS client ensure that no "struct dirent" crosses a 512byte block
boundary. I don't think the other file systems do this. I mention this, since
the libc functions can't assume the UFS behaviour for this.
(At one time, UFS just "consumed" removed entries into the preceding "struct direct"
 or set it invalid, if it was the first entry in a 512byte block. This implied that
 the byte offsets (logical == physical) didn't change for subsequent entries upon
 a removal. It sounds like UFS is no longer doing this, from one of your posts?)

I am curious to see what glibc does, since I had assumed it just read the
entire directory at opendir/first-readdir.

rick
ps: This is what I recall from fooling with "struct dirent" a few months ago
    and I'm getting old, so it may all be wrong;-)
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 bit system would not have such large file systems
> on
> it..
> (maybe?)
> >
> > rick
> >
> >> A proper d_off would still be useful even if UFS's readdir keeps
> >> masking
> >> off the offset so a directory read always starts at the beginning
> >> of
> >> a
> >> 512-byte directory block, since this allows more distinct offset
> >> values
> >> than safely using getdirentries()'s *basep. With d_off, one outer
> >> loop
> >> must read at least one directory block to avoid spinning
> >> indefinitely,
> >> while using getdirentries()'s *basep requires reading the whole
> >> getdirentries() buffer.
> >>
> >> Some Linux filesystems go further and provide a unique d_off for
> >> each
> >> entry.
> >>
> >> Another idea would be to store the last d_ino instead of dd_loc
> >> into
> >> the
> >> struct ddloc. On seekdir(), this would seek to loc_seek as before
> >> and
> >> skip entries until that d_ino is found, or to the start of the
> >> buffer
> >> if
> >> not found (and possibly return some entries again that should not
> >> be
> >> returned, but Samba copes with that).
> >>
> >> --
> >> Jilles Tjoelker
> >> _______________________________________________
> >> freebsd-current@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> >> To unsubscribe, send any mail to
> >> "freebsd-current-unsubscribe@freebsd.org"
> >>
> >
> 
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1781764425.25665403.1429965222745.JavaMail.root>