Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jul 2014 18:59:59 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: current fd allocation idiom
Message-ID:  <20140718155959.GN93733@kib.kiev.ua>
In-Reply-To: <20140718144012.GA7179@dft-labs.eu>
References:  <20140717235538.GA15714@dft-labs.eu> <20140718130629.GJ93733@kib.kiev.ua> <20140718144012.GA7179@dft-labs.eu>

next in thread | previous in thread | raw e-mail | index | archive | help

--whB3aSVTmcGVRwAq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jul 18, 2014 at 04:40:12PM +0200, Mateusz Guzik wrote:
> On Fri, Jul 18, 2014 at 04:06:29PM +0300, Konstantin Belousov wrote:
> > On Fri, Jul 18, 2014 at 01:55:38AM +0200, Mateusz Guzik wrote:
> > > The kernel has to deal with concurrent attempts to open, close, dup2 =
and
> > > use the same file descriptor.
> > >=20
> > > I start with stating a rather esoteric bug affecting this area, then I
> > > follow with a short overview of what is happening in general and a
> > > proposal how to change to get rid of the bug and get some additional
> > > enchancements. Interestingly enough turns out Linux is doing pretty m=
uch
> > > the same thing.
> > >=20
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> > > THE BUG:
> > > /*
> > >  * Initialize the file pointer with the specified properties.
> > >  *
> > >  * The ops are set with release semantics to be certain that the flag=
s, type,
> > >  * and data are visible when ops is.  This is to prevent ops methods =
=66rom being
> > >  * called with bad data.
> > >  */
> > > void
> > > finit(struct file *fp, u_int flag, short type, void *data, struct fil=
eops *ops)
> > > {
> > >         fp->f_data =3D data;
> > >         fp->f_flag =3D flag;
> > >         fp->f_type =3D type;
> > >         atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintp=
tr_t)ops);
> > > }
> > >=20
> > > This itself is fine, but it assumes all code obtaining fp from fdtabl=
e places
> > > a read memory barrier after reading f_ops and before reading anything=
 else.
> > > As you could guess no code does that and I don't believe placing rmb's
> > > in several new places is the way to go.
> > I think your analysis is correct for all cases except kern_openat().
> >=20
> > For kern_openat(), we install a file into fdtable only after the struct
> > file is fully initialized, see kern_openat(). The file is allocated
> > with falloc_noinstall(), then file is initialized, then finstall() does
> > FILEDESC_LOCK/UNLOCK, which ensures the full barrier. Other file
> > accessors do fget_unlocked(), which does acquire (of fp->f_count, but
> > this does not matter).
> >=20
> > The only critical constraint is that other accessors must not see
> > f_ops !=3D badfileops while struct file is not fully visible.  IMO,
> > the full barrier in finstall() and acquire in fget*() guarantee
> > that we see badfileops until writes to other members are visible.
> >=20
>=20
> I agree, but I doubt everything can be converted to this model (see below=
).
>=20
> > For falloc(), indeed the write to f_ops could become visible too early
> > (but not on x86).
> >=20
> > >=20
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> > > GENERAL OVERVIEW OF CURRENT STATE:
> > >=20
> > > What readers need to do:
> > > - rmb() after reading fp_ops
> > > - check fp_ops for badfileops
> > How can readers see badfileops ?
> >=20
>=20
> Not sure what you mean. fp is installed with badfileops, anything
> accessing fdtable before finit finishes will see this.
I referenced falloc_noinstall().

>=20
> > >=20
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> > > PROPOSAL:
> > >=20
> > > struct file *fp;
> > > int fd;
> > >=20
> > > if (error =3D falloc(&fp, &fd))
> > Use falloc_noinstall() there.
> >=20
> > > 	return (error);
> > > if (error =3D something(....)) {
> > > 	fdclose(fp, fd);
> > > 	return (error);
> > > }
> > >=20
> > > finit(fp, ....);
> > > factivate(fd, fp);
> > This function is spelled finstall().
> >=20
> > It seems that all what is needed is conversion of places using
> > falloc() to falloc_noinstall()/finstall().
> >=20
>=20
> This postpones fd allocation to after interested function did all work
> it wanted to do, which means we would need reliable ways of reverting
> all the work in case allocation failed. I'm not so confident we can do
> that for all current consumers and both current and my proposed approach
> don't impose such requirement.
Cleanup should be identical to the actions done on close(2).

>=20
> Of course postponing fd allocation where possible is definitely worth
> doing.
Yes, and after that the rest of the cases should be evaluated.
But my gut feeling is that everything would be converted.

>=20
> For cases where it is not possible/feasible, the only problem we have is
> making sure we update fd entry in proper table in factivate.
>=20
> The easiest solution is to FILEDESC_XLOCK, but this may have measurable
> impact. We can get away with FILEDESC_SLOCK just fine, but this still
> writes and may ping-pong with other cpus.
>=20
> This is another place where we could just plop sequence counters.
>=20
> fdgrowtable would +/-:
> seq_write_begin(&fdp->fd_tbl_seq);
> memcpy(....);
> assign the new pointer
> seq_end_begin(&fdp->fd_tbl_seq);
>=20
> Then factivate would be +/-:
>=20
> do {
> 	fdtable =3D __READ_ONCE(..., fdp->fd_tbl);
> 	rmb();
> 	seq =3D seq_read(fdp->fd_tbl_seq);
> 	fdtable[fd] =3D fp;
> while (!seq_consistent(fdp->fd_tbl_seq, seq));
>=20
> This in worst case updates old fdtable (which we never free so it is
> harmless) and retries. seq_read never returns seq in 'modify' state,
> and we check that seq is the same before and after the operation. If the
> condition is met there was no concurrent fdtable copy.
>=20
> This also means that table readers can 'suddenly' get fps, but this
> should be fine. NULL fp is used to denote unused fd. If fp is non NULL,
> it is safe for use. If it is NULL, fd is ignored which should not matter.
>=20
> Functions like fdcopy just have to make sure they read fp once.
>=20
> --=20
> Mateusz Guzik <mjguzik gmail.com>

--whB3aSVTmcGVRwAq
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJTyUR+AAoJEJDCuSvBvK1BQcYP/iGyymAyxnX4nZ55SIS0sz1n
KS9bJzNmOsnQrqQNcTP1AArkmCPqa2MGTWEXxzvS54dhXCBRxA14WmSQykfUV7jg
2B2AsnnQ5lNOxwOjjHwvvkhrWtqoxvuVhg7a6ADFIu4AS5TCtVxi1bohO5X+T3ED
qoPxtDai6rNNiRJKu3aNfKYT0idNVt1HROHRu3pRelBkfFqwdxO8D4GsI8e2rQGY
74ZIi3kkVq+N3GA1KbSnMzrEhzE5jrUNz5v9YCszoYmbsyKY1myToTQoqX/E70l2
KnhIotBbdQFEYBb2EffCHfrzn/Tl8y6uE//7Aqag6DVj3iJ5Khv5FLe1jiJXx573
sxfOj5N+eaiBY9Y8T7pNkYkrHAz2910UR+yW04atAV7+/yKHH1uyVewZej+JAFwN
IFYKcfwhBOlbCBc+mfV1WZN0axszJX3eX+i7Fi5Mo2GdaLeEFyqZSWbBlNHTPpw2
OwrzDFZAcm9EsLQGf6RP4Mawq3TeuKNvdqjuNcegeZHNHGyFU3XCrzBLKVz2OOWY
Dh+sO5b79K7MZctNvZNhjOMawy5BALaFapsjnJc5QvR9Gknxml256Sc18eSfDRgt
qQ/fQA/SoQRndzhRjRsHEGOCIzwqeNIgaMKv86NiAND/zB3Xi5jFKzpeSoig3TvU
0HhnD6SBejgndg23BgKI
=UyEP
-----END PGP SIGNATURE-----

--whB3aSVTmcGVRwAq--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140718155959.GN93733>