Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Dec 2015 12:28:46 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: DELETE support in the VOP_STRATEGY(9)?
Message-ID:  <CANCZdfrrW61kRdH=yrkN60coo4=9E9U8MOue7ASyy0aVjN0zPA@mail.gmail.com>
In-Reply-To: <56672C94.30404@multiplay.co.uk>
References:  <CAH7qZftSVAYPmxNCQy=VVRj79AW7z9ade-0iogv2COfo2x%2Ba2Q@mail.gmail.com> <201512052002.tB5K2ZEA026540@chez.mckusick.com> <CAH7qZfs6ksE%2BQTMFFLYxY0PNE4hzn=D5skzQ91=gGK2xvndkfw@mail.gmail.com> <86poyhqsdh.fsf@desk.des.no> <CAH7qZftVj9m_yob=AbAQA7fh8yG-VLgM7H0skW3eX_S%2Bv75E-g@mail.gmail.com> <86fuzdqjwn.fsf@desk.des.no> <CANCZdfo=NfKy51%2B64-F_v%2BDh2wkrFYP4gXe=X9RWSSao49gO9g@mail.gmail.com> <CANCZdfqHoduhdCss0b6=UsBPAxfRZv4hF8vyuUVLBdP5gYUduQ@mail.gmail.com> <864mfssxgt.fsf@desk.des.no> <CANCZdfoXdcD%2B9jeVR1Np16gafBf0_4B2wombwxze8DvJwf7cMg@mail.gmail.com> <86wpsord9l.fsf@desk.des.no> <566726ED.2010709@multiplay.co.uk> <0DB97CBA-4DC3-4D52-AE9D-54546292D66F@bsdimp.com> <56672C94.30404@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 8, 2015 at 12:16 PM, Steven Hartland <killing@multiplay.co.uk>
wrote:

>
>
> On 08/12/2015 19:03, Warner Losh wrote:
>
>> On Dec 8, 2015, at 11:52 AM, Steven Hartland <killing@multiplay.co.uk>
>>> wrote:
>>>
>>>
>>>
>>> On 08/12/2015 18:44, Dag-Erling Sm=C3=B8rgrav wrote:
>>>
>>>> Warner Losh <imp@bsdimp.com> writes:
>>>>
>>>>> Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
>>>>>
>>>>>> But the filesystem does not know whether the underlying storage is
>>>>>> electromechanical or solid-state, nor does it know whether the user
>>>>>> cares much about seek times (unless we introduce the heuristic
>>>>>> "avoid creating holes unless the file already has them, in which
>>>>>> case the userland probably does not care").
>>>>>>
>>>>> Actually, the filesystem does know. Or has some knowledge of what
>>>>> is supported and what isn't. BIO_DELETE support is a strong indicator
>>>>> of a flash or other log-type system.
>>>>>
>>>> The filesystem can ask the layer below if BIO_DELETE is supported, but
>>>> should not assume anything about what it means.  For instance, I could
>>>> write a gnop-like module that translates BIO_DELETE into an all-zeroes
>>>> BIO_WRITE and passes everything else unmodified.  It would provide a
>>>> stronger guarantee than, say, SATA TRIM but would also have a complete=
ly
>>>> different performance profile (even on SSDs, since it would do its wor=
k
>>>> synchronously whereas TRIM works asynchronously).
>>>>
>>> That ship has sailed. UFS, at least, assumes that if TRIM is supported
>> then
>> relocating files to be contiguous is bad.
>>
>> But writing a gnop module that did the BIO_DELETE thing would be bogus.
>> BIO_DELETE does not mean that blocks will read back as zeros. But that=
=E2=80=99s
>> not what BIO_DELETE means. So, sure you could invent a stupid thing that
>> breaks the rules, and thus the assumptions of the other code, but why
>> would
>> you want to do that?
>>
>> The SATA trims are actually synchronous (in the absence of power
>> failures).
>> Once you TRIM The data, it is gone. And depending what bits are set in
>> the identify response, you can count on different things. But to say the=
y
>> happen asynchronously because of implementation details about when the
>> data
>> is actually erased is missing the point. Also, your BIO_DELETE example
>> wouldn=E2=80=99t guarantee the data is erased either. Writes to log appe=
nd devices
>> (like SSDs) are like a TRIM followed by a write: the old LBA mapping is
>> discarded and a new one replaces it.
>>
>
> Not all SATA TRIMs are synchronous , some FW does process them in the
> background.
>
> Saying once you TRIM data its gone is actually too strong I'm afraid, as
> its advisory, the FW can ignore you if it so chooses.
>
> There is the concept of DSM deterministic read which if set "should"
> result in returning the same values from read of a TRIMed sector every
> time, but even this is unreliable due to FW bugs (yes I've seen this).


I guess I've been lucky. In FreeBSD we only depend that the data will read
without error after
a BIO_DELETE and that a subsequent BIO_WRITE will make BIO_READ
deterministic again.
But I was mostly trying to say that once you issue a TRIM to the drive, and
it returns, the TRIM
is done in the sense that there's not another TRIM_COMPLETED message that
comes back
from the drive.


> Anyway, my point is that Maxim needs to revise his assumptions.
>>>>
>>> Just to clarify most consumer devices process TRIM synchronously, not
>>> asynchronously.
>>>
>> It also depends on what you mean by =E2=80=98process=E2=80=99 here.
>>
> Indeed it does, here I mean when / if the data is removed from the media
> by the HW.


I agree. Most firmware is asynchronous in this sense. You have to do
something called
a SECURE ERASE to have the data be actually gone. The granularity of that
command,
though is the entire drive.


> Your example isn't actually just an example CAM scsi_da has a number of
>>> different ways it can process BIO_DELETE:
>>> * ATA TRIM
>>> * SCSI UMAP
>>> * Write Same 16
>>> * Write Same 10
>>> * Zero
>>>
>>> So you example is actually exists in practice in the FreeBSD code base
>>> ;-)
>>>
>> All these are effectively TRIM operations. The devices that implement th=
em
>> use them as hints to optimize storage. DES=E2=80=99 BIO_DELETE -> WRITE =
zero
>> example doesn=E2=80=99t optimize storage at all, nor does it give the lo=
wer layers
>> any clue about how to optimize the storage. All the SCSI delete types
>> do give that hint.
>>
> This is true, just wanted to highlight that "TRIM" can mean very differen=
t
> things even at the CAM layer.
>

Agreed. There's many different ways to implement BIO_DELETE's rather loose
semantics. This
is one reason why we give people the knobs to turn it off if performance is
hurt in their application.
This is the ultimate escape hatch when the performance profile of
BIO_DELETE in the actual
drive doesn't match the upper layer's assumptions.

Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrrW61kRdH=yrkN60coo4=9E9U8MOue7ASyy0aVjN0zPA>