From owner-freebsd-bugs@FreeBSD.ORG Wed Jul 7 08:30:24 2004 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A89AC16A4CE for ; Wed, 7 Jul 2004 08:30:24 +0000 (GMT) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8B22843D2D for ; Wed, 7 Jul 2004 08:30:24 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.12.11/8.12.11) with ESMTP id i678UO9w066473 for ; Wed, 7 Jul 2004 08:30:24 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.12.11/8.12.11/Submit) id i678UOOr066472; Wed, 7 Jul 2004 08:30:24 GMT (envelope-from gnats) Date: Wed, 7 Jul 2004 08:30:24 GMT Message-Id: <200407070830.i678UOOr066472@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Bruce Evans Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: Bruce Evans List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Jul 2004 08:30:24 -0000 The following reply was made to PR kern/68690; it has been noted by GNATS. From: Bruce Evans To: KOIE Hidetaka Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT Date: Wed, 7 Jul 2004 18:23:57 +1000 (EST) On Mon, 5 Jul 2004, KOIE Hidetaka wrote: > >Description: > Invoking write(fd, buf, size), if buf has both validand invalid segment, > write return -1, but it's file pointer has been advanced. > The caller mistakes the pointer stays. This is a longstanding bug. It is more of a problem for non-seekable devices since physically written can't be undone, so the best possible error handling is to return a short write as specified in all versions of POSIX. Do you actually see the file pointer advanced? This may be file system dependent. ffs is supposed to back out of the write, and it does so for me. Output: %%% pos=0 write(20480)->-1 (should be 12288) write: Bad address pos=0 (should be 12288) %%% This is correct. We tried to write 20K but the buffer size is only 12K and the was an i/o error after 12K. ffs_write() handled the error by truncating the file to its original size and rerstoring the original file system, and since the write was at the end of the file this has the same result as if the write had failed with EFAULT after 0K. I think there is also no way for other processes to see the write while it is in progress, since even if they have mmapped the file they can't see beyond its original length. To show an actual bug for ffs, modifiy the example slightly so that it overwrites existing data. Then truncating the file has no effect, and it is just wrong for write() to return -1 after clobbering some data. > >Fix: (1) Here is ffs_write()'s buggy error handling: % for (error = 0; uio->uio_resid > 0;) { % ... % if (uio->uio_offset + xfersize > ip->i_size) % vnode_pager_setsize(vp, uio->uio_offset + xfersize); Not error handling, but general setup. Hopefully extending the size at the vm level doesn't allow other other processes to see up to the new EOF yet. % % /* % * We must perform a read-before-write if the transfer size % * does not cover the entire buffer. % */ % if (fs->fs_bsize > xfersize) % flags |= BA_CLRBUF; % else % flags &= ~BA_CLRBUF; Security-related code. Other processes can see buffer contents via mmap(), at least after completion of extension of the file at the vfs level, so we must clear the buffer contents now. This also avoids potential leakage of unwritten buffer contents to the same process after buggy EFAULT handling. % ... % /* % * If the buffer is not valid we have to clear out any % * garbage data from the pages instantiated for the buffer. % * If we do not, a failed uiomove() during a write can leave % * the prior contents of the pages exposed to a userland % * mmap(). XXX deal with uiomove() errors a better way. % */ % if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize) % vfs_bio_clrbuf(bp); More buffer clearing, for another case. It's not so clear that this is done before other processes can see the buffer. % ... % /* % * If we successfully wrote any data, and we are not the superuser % * we clear the setuid and setgid bits as a precaution against % * tampering. % */ % if (resid > uio->uio_resid && ap->a_cred && % suser_cred(ap->a_cred, PRISON_ROOT)) { % ip->i_mode &= ~(ISUID | ISGID); % DIP(ip, i_mode) = ip->i_mode; % } Misplaced code. We are going to back out of the write in some cases, so we don't know if we successefully wrote any data here. % ... % if (error) { % if (ioflag & IO_UNIT) { % (void)UFS_TRUNCATE(vp, osize, % IO_NORMAL | (ioflag & IO_SYNC), % ap->a_cred, uio->uio_td); % uio->uio_offset -= resid - uio->uio_resid; % uio->uio_resid = resid; % } % } ... Here is where we back out of the write in some cases. (a) We only back out in the IO_UNIT case, but that is the usual case and is always the case for write(2). (There is another bug suite involving IO_UNIT. Most references to IO_UNIT in the kernel except those related to userland i/o don't actually give atomic writes since writes are made non-atomic using vfs_rdwr_inchunks() anyway. OTOH, there seems to be no reason for ffs to skip the backout in the !IO_UNIT case, so there is no need for the IO_UNIT flag to exist.) (b) We cannot back out if anything was written before the original end of the file. Truncation just increases the mess in that case, since it only reduces the file to its original size, but the file position (in uio->uio_offset) is reduced by the full amount. Fixes for ffs_write(): ignore IO_UNIT, and don't rewind the file pointer below osize. This gives short for writes that wrote something before osize and then failed. Similarly for other file systems. Here is write(2)'s buggy error handling (in dofilewrite()): % if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) { % if (auio.uio_resid != cnt && (error == ERESTART || % error == EINTR || error == EWOULDBLOCK)) % error = 0; (auio.uio_resid != cnt) means that the write was short. It should be returned to userland as a short write by setting error to 0 unconditionally. LOwer layers mostly get this write and return a short write if and only if they couldn't back out of the write, but they still return a nonzero error code to indicate that there was a problem. The userland interface (write(2) in this case) just doesn't support returning both an error code and a byte count, so one of them must be discarded, and it must be the error code since it is useful and POSIX standard to return the byte count. Since ffs attempts to backs out of failing writes, and the backup is successful in most cases, this bug mainly affects devices. Similarly for read(2) and writev(2), etc. Bruce