FreeBSD Mail Archives

Date:      Wed, 12 Feb 2003 19:32:10 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Brad Knowles <brad.knowles@skynet.be>
Cc:        Rahul Siddharthan <rsidd@online.fr>, freebsd-chat@freebsd.org
Subject:   Re: Email push and pull (was Re: matthew dillon)
Message-ID:  <3E4B11BA.A060AEFD@mindspring.com>
References:  <20030211032932.GA1253@papagena.rockefeller.edu>		 <a05200f2bba6e8fc03a0f@[10.0.1.2]>		 <3E498175.295FC389@mindspring.com>	 <a05200f37ba6f50bfc705@[10.0.1.2]>	 <3E49C2BC.F164F19A@mindspring.com> <a05200f43ba6fe1a9f4d8@[10.0.1.2]> <3E4A81A3.A8626F3D@mindspring.com> <a05200f4cba70710ad3f1@[10.0.1.2]>

Brad Knowles wrote:
> At 9:17 AM -0800 2003/02/12, Terry Lambert wrote:
> >  In terms of I/O throughput, you are right.
> >
> >  But we are not interested in I/O throughput, in this case, we
> >  are interested in minimizing dynamic pool size, for a given
> >  pool retention time function, over a given input and output
> >  volume.
> 
>         Under what circumstances are you not interested in I/O throughput?!?

When the problem is recipient maildrop overflow, rather than
inability to handle load.  Since a single RS/6000 with 2 166MHz
CPUs and a modified Sendmail can handle 500,000 average-sized
email messages in a 24 hour period, load isn't really the problem:
it's something you can "throw hardware at", and otherwise ignore.

>         I have seen some mail systems that were short of disk space, but
> when we looked carefully at the number of messages in the mailboxes
> and the number of recipients per message, there just wasn't a whole
> lot of disk space that we could potentially have recovered.  This was
> across 100,000+ POP3/dialup users at an earlier time in the life of
> Belgacom Skynet, the largest ISP in Belgium.

The issue is not real limits, it is administrative limits, and, of
you care about being DOS'ed, it's about aggregate limits not
resulting in overcommit.

>         Virtually all other mail systems I've ever seen have not had disk
> space problems (unless they didn't enforce quotas), but were instead
> critically short of I/O capacity in the form of synchronous meta-data
> updates.  This was true for both the MTA and the mailbox storage
> mechanism.

You are looking at the problem from the wrong end.  A quota is good
for you, but it sucks for your user, who loses legitimate traffic,
if illegitimate traffic pushed them over their quota.

What this comes down to is the level of service you are offering
your customer.  Your definition of "adequate" and their definition
of "adequate" are likely not the same.

If we take two examples: HotMail and Yahoo Mail (formerly Rocket
Mail), it's fairly easy to see that the "quota squeeze" was
originally intended to act as a circuit break for the disk space
issue.

However, we now see that it's being used as a lever to attempt to
extract revenue from a broken business model ("buy more disk space
for only $9.95/month!").

The user convenience being sold here lies in the ability for the
user to request what is, in effect, a larger queue size, in
exchange for money.

If this queue size were not an issue, then we would not be having
this discussion: it would not have value to users, and, not having
any value, it would not have a market cost associated with its
reduction.

> >  The Usenet parallel is probably not that apt.  Usenet provides
> >  an "Expires:" header, which bounds the pool retention time to a
> >  fixed interval, reagardless of volume.
> 
>         Almost no one ever uses the Expires: header anymore.  If they do,
> it's in an attempt to ensure that the message stays around much, much
> longer than normal as opposed to the reverse.

Whether the expires is enforced by default, self, or administratively
is irrelevent to the mere fact that there is a limited lifetime in
the distributed persistant queueing system that is Usenet.

>         No, what I was talking about was the fundamental fact that you
> cannot possibly handle a full USENET feed of 650GB/day or more, if
> you don't have enough spindles going for you.  It doesn't matter how
> much disk space you have, if you don't have the I/O capacity to
> handle the input.

This is a transport issue -- or, more properly, a queue management
and data replication issue.  It would be very easy to envision a
system that could handle this, with "merely" enough spindles to
hold 650GB/day.  An OS with a log structured or journalling FS,
or even soft updates, which exported a transaction dependency
interface to user space, could handle this, no problem.

Surely, you aren't saying an Oracle Database would need a significant
number of spindles in order to replicate another Oracle Database,
when all bottlenecks between the two machines, down to the disks,
are directly managed by a unified software set, written by Oracle?

> >  Again, I disagree.  Poor design is why they don't scale.
> 
>         Right, and another outcome of poor design is their stupid choice
> of single-instance store -- a false economy.

I'm not positive that it matters, one way or the other, in the
long run, if thigs are implemented correctly.  However, it is
Esthetically pleasing, on many levels.

> >>          These slides have absolutely nothing whatsoever to do with the
> >>  MTA.  They have to do with the mailbox, mailbox delivery, mailbox
> >>  access, etc....  You need message locking in some fashion, you may
> >>  need mailbox locking, and most schemes for handling mailboxes involve
> >>  either re-writing the mailbox file, or changing file names of
> >>  individual messages (or changing their location), etc....  These are
> >>  all synchronous meta-data operations.
> >
> >  You do not need all the locking you state you need, at least not
> >  at that low a granularity.
> 
>         Really?  I'd like to hear your explanation for that claim.

Why the heck are you locking at a mailbox granularity, instead
of a message granularity, for either of these operations?

>         I would be very interested to know at what time they have ever
> used any Vax/VMS systems anywhere in the entire history of the
> company.  I have plenty of contacts that I can use to verify any
> claims.

Sorry, I was thinking of Compuserve, who had switched over to
FreeBSD for some of its systems, at one point.

> >  Sendmail performance tuning is not the issue, although if you
> >  are a transit server for virtual domains, you should rewrite the
> >  queueing algorithm.
> 
>         My point is not that Sendmail is the issue.  My point is that
> Nick has designed and built some of the largest open-source based
> mail systems in the world, and he and I worked extensively to create
> the architecture laid out in my LISA 2002 talk.

And I was the architect for the IBM Web Connections NOC, and
for an unannounced IBM Services product.  This isn't a size
contest...  8-).

> >                       See:
> >
> >       ftp://ftp.whistle.com/pub/misc/sendmail/
> 
>         This was written for sendmail 8.9.3, way before the advent of
> multiple queues and all other sorts of new features.  It is no longer
> relevant to modern sendmail.

I was at the Sendmail MOTM (Meeting Of The Minds) architecture
discussion in 2000. I was one of about 8 outside people in the
world who was invited to the thing.  I am aware of Sendmail.  I
still have the thermal underwear.

The answer is that it *is* relevent to modern sendmail, because
the multipl queues in the modern sendmail are, effectively,
hashed traversal domains.  If you read the presentation that
David Wolfskill did for BayLISA (the "mpg" in that directory),
you will see the difference.

The point of the queue modification, in this system, which was a
large transit mail server designed for 50,000 virtual domains on
a single server instance, was to ensure a 100% hit rate in all
queue runs.

The main problem sendmail faces, when performing fractional queue
runs, is that it must open each queue file and examine its contents,
in order to know whether or not a given queue element should be
operated upon.

Breaking up the queue runs into multiple hash directories, and
running each queue with a seperate process avoids the global
queue lock issue, but only *statistically* decreases the chance of
a run-collision between two runners.  It does *not* increase the
probability of a "hit" for a given queue element, *at all*.

Even if you did "the logical thing", and ensured that all domain
destinations ended up in the same hash bucket (I would be rather
amazed if you could do that, and simultaneously balance queue
depth between queues, given a fixed hash selection algorithm!),
the increase in "hit" probability will only go up by the average
total queue depth divided by the average number of queue entries
per queue.  This number is *negligible*, until the number of
queues approaches the number of domains.

Compare; before modification, the maximum load on a SPARC Center
10 was 300 client machines making a queue run every hour.  After
modification, the maximim load on an RS6000/50 was 50,000 client
machines making a queue run every hour.  Before modification, the
number of messages which could transit the system was 30,000 8K
messages in a 24 hour period.  After, it was 5,000,000 8K messages
in a 24 hour period.

Assuming 100 hash queues... the increase in processing over 30,000
messages is *negligible*.  The reason for this is that we have
reached queue saturation, since a triggered run must take place in
all queues, anyway.

> >  The Open Source book is wrong.  You can not build such a system
> >  without significant modification.  My source tree, for example,
> >  contains more than 6 million lines of code at this point, and
> >  about 250,000 of those are mine, modifying Cyrus, modifying
> >  OpenLDAP, modifying Sendmail, modifying BIND, etc..
> 
>         IIRC, Nick talks about the changes that were required -- some,
> but not excessive.  Read the paper.

I have read it.  The modifications he proposes are small ones,
which deal with impedence issues.  They are low hanging fruit,
available to a system administrator, not an in depth modification
by a software engineer.

> >  Because Open Source projects are inherently incapable of doing
> >  productization work.
> 
>         True enough.  However, this would imply that the sort of thing
> that Nick has done is not possible.  He has demonstrated that this is
> not true.

*You've* demonstrated it, or you would just adopt his solution
wholesale.  The issue is that his solution doesn't scale nearly
as well as is possible, it only scales "much better than Open
Source on its own".

Try an experiment for me: tune the bejesus out of a FreeBSD box
with 4G of RAM.  Do everything you can think of doing to it, in
the same time frame, and at the same level of depth of understanding
that Nick applied to his system.  Then come back, and tell me two
numbers: (1) Maximum number of new connections per second before
and after, and (2) Total number of simultaneous connections, before
and after.

>         Yes, using open source to do this sort of thing can be difficult
> (as I am finding out), but it doesn't have to be impossible.

It doesn't have to be, that's agreed, but it takes substantially
more investment than it would cost to build out using multiple
instances of commercial software, plus the machines to run it, to
"brute force" the problem.  Or the resulting system ends up being
fragile.

> >  $0 is not really true.  They are paying for you, in the hopes
> >  that it will end up costing less than a prebuilt system.
> 
>         It's a different color of money.  They had already signed the
> contract stating that I would be working for them through April (at
> the earliest), before this project was dumped in my lap.  So, that's
> not anything extra.  Buying new machines, or buying software, now
> that's spending extra.

I can't imagine a business which did not run on a cost accounting
basis.  8-).

> >  Contact Stanford, MIT, or other large institutions which have
> >  already deployed such a system.
> 
>         I've already read much of Mark Crispin's writings.  I know how
> they did it at UW, and they didn't use NFS.  I've read the Cyrus
> documentation, and they didn't use NFS either.

UW is not the place you should look.  Stanford (as I said)
already has a deployed system, and they are generally helpful
when people want to copy what they have done.

>         That only leaves Courier-IMAP, and while I've read the
> documentation they have available, I am finding it difficult to find
> anyone who has actually built a larger-scale system using
> Courier-IMAP on NFS.  Plenty of people say they've heard of it being
> done, or it should be easily do-able, but I'm not finding the people
> themselves who've actually done it.

If you are looking at IMAP4, then Cyrus or a commercial product
are your only options, IMO, and neither will work well enough, if
used "as is".

> >  Not in Open Source; Open Source does not perform productization or
> >  systems integration.
> 
>         Therein lies the problem.  You may be able to write or re-write
> all of the open source systems in existence, but that sort of thing
> is not within my capabilities, and would not be accepted for this
> project.  They're looking askance at my modifications to procmail to
> get it to use Maildir format and hashed mailboxes -- there's no way
> they'd accept actual source code changes.

How many maildrops does this need to support?  I will tell you if
your project will fail.  8-(.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E4B11BA.A060AEFD>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation