From owner-freebsd-chat Wed Feb 12 19:33:52 2003 Delivered-To: freebsd-chat@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D39D737B401 for ; Wed, 12 Feb 2003 19:33:44 -0800 (PST) Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id D8BF643F85 for ; Wed, 12 Feb 2003 19:33:43 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0418.cvx22-bradley.dialup.earthlink.net ([209.179.199.163] helo=mindspring.com) by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18jA7v-0001zH-00; Wed, 12 Feb 2003 19:33:32 -0800 Message-ID: <3E4B11BA.A060AEFD@mindspring.com> Date: Wed, 12 Feb 2003 19:32:10 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Brad Knowles Cc: Rahul Siddharthan , freebsd-chat@freebsd.org Subject: Re: Email push and pull (was Re: matthew dillon) References: <20030211032932.GA1253@papagena.rockefeller.edu> <3E498175.295FC389@mindspring.com> <3E49C2BC.F164F19A@mindspring.com> <3E4A81A3.A8626F3D@mindspring.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4dfa0e541153ecfc161abcd97ca8806f2350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Brad Knowles wrote: > At 9:17 AM -0800 2003/02/12, Terry Lambert wrote: > > In terms of I/O throughput, you are right. > > > > But we are not interested in I/O throughput, in this case, we > > are interested in minimizing dynamic pool size, for a given > > pool retention time function, over a given input and output > > volume. > > Under what circumstances are you not interested in I/O throughput?!? When the problem is recipient maildrop overflow, rather than inability to handle load. Since a single RS/6000 with 2 166MHz CPUs and a modified Sendmail can handle 500,000 average-sized email messages in a 24 hour period, load isn't really the problem: it's something you can "throw hardware at", and otherwise ignore. > I have seen some mail systems that were short of disk space, but > when we looked carefully at the number of messages in the mailboxes > and the number of recipients per message, there just wasn't a whole > lot of disk space that we could potentially have recovered. This was > across 100,000+ POP3/dialup users at an earlier time in the life of > Belgacom Skynet, the largest ISP in Belgium. The issue is not real limits, it is administrative limits, and, of you care about being DOS'ed, it's about aggregate limits not resulting in overcommit. > Virtually all other mail systems I've ever seen have not had disk > space problems (unless they didn't enforce quotas), but were instead > critically short of I/O capacity in the form of synchronous meta-data > updates. This was true for both the MTA and the mailbox storage > mechanism. You are looking at the problem from the wrong end. A quota is good for you, but it sucks for your user, who loses legitimate traffic, if illegitimate traffic pushed them over their quota. What this comes down to is the level of service you are offering your customer. Your definition of "adequate" and their definition of "adequate" are likely not the same. If we take two examples: HotMail and Yahoo Mail (formerly Rocket Mail), it's fairly easy to see that the "quota squeeze" was originally intended to act as a circuit break for the disk space issue. However, we now see that it's being used as a lever to attempt to extract revenue from a broken business model ("buy more disk space for only $9.95/month!"). The user convenience being sold here lies in the ability for the user to request what is, in effect, a larger queue size, in exchange for money. If this queue size were not an issue, then we would not be having this discussion: it would not have value to users, and, not having any value, it would not have a market cost associated with its reduction. > > The Usenet parallel is probably not that apt. Usenet provides > > an "Expires:" header, which bounds the pool retention time to a > > fixed interval, reagardless of volume. > > Almost no one ever uses the Expires: header anymore. If they do, > it's in an attempt to ensure that the message stays around much, much > longer than normal as opposed to the reverse. Whether the expires is enforced by default, self, or administratively is irrelevent to the mere fact that there is a limited lifetime in the distributed persistant queueing system that is Usenet. > No, what I was talking about was the fundamental fact that you > cannot possibly handle a full USENET feed of 650GB/day or more, if > you don't have enough spindles going for you. It doesn't matter how > much disk space you have, if you don't have the I/O capacity to > handle the input. This is a transport issue -- or, more properly, a queue management and data replication issue. It would be very easy to envision a system that could handle this, with "merely" enough spindles to hold 650GB/day. An OS with a log structured or journalling FS, or even soft updates, which exported a transaction dependency interface to user space, could handle this, no problem. Surely, you aren't saying an Oracle Database would need a significant number of spindles in order to replicate another Oracle Database, when all bottlenecks between the two machines, down to the disks, are directly managed by a unified software set, written by Oracle? > > Again, I disagree. Poor design is why they don't scale. > > Right, and another outcome of poor design is their stupid choice > of single-instance store -- a false economy. I'm not positive that it matters, one way or the other, in the long run, if thigs are implemented correctly. However, it is Esthetically pleasing, on many levels. > >> These slides have absolutely nothing whatsoever to do with the > >> MTA. They have to do with the mailbox, mailbox delivery, mailbox > >> access, etc.... You need message locking in some fashion, you may > >> need mailbox locking, and most schemes for handling mailboxes involve > >> either re-writing the mailbox file, or changing file names of > >> individual messages (or changing their location), etc.... These are > >> all synchronous meta-data operations. > > > > You do not need all the locking you state you need, at least not > > at that low a granularity. > > Really? I'd like to hear your explanation for that claim. Why the heck are you locking at a mailbox granularity, instead of a message granularity, for either of these operations? > I would be very interested to know at what time they have ever > used any Vax/VMS systems anywhere in the entire history of the > company. I have plenty of contacts that I can use to verify any > claims. Sorry, I was thinking of Compuserve, who had switched over to FreeBSD for some of its systems, at one point. > > Sendmail performance tuning is not the issue, although if you > > are a transit server for virtual domains, you should rewrite the > > queueing algorithm. > > My point is not that Sendmail is the issue. My point is that > Nick has designed and built some of the largest open-source based > mail systems in the world, and he and I worked extensively to create > the architecture laid out in my LISA 2002 talk. And I was the architect for the IBM Web Connections NOC, and for an unannounced IBM Services product. This isn't a size contest... 8-). > > See: > > > > ftp://ftp.whistle.com/pub/misc/sendmail/ > > This was written for sendmail 8.9.3, way before the advent of > multiple queues and all other sorts of new features. It is no longer > relevant to modern sendmail. I was at the Sendmail MOTM (Meeting Of The Minds) architecture discussion in 2000. I was one of about 8 outside people in the world who was invited to the thing. I am aware of Sendmail. I still have the thermal underwear. The answer is that it *is* relevent to modern sendmail, because the multipl queues in the modern sendmail are, effectively, hashed traversal domains. If you read the presentation that David Wolfskill did for BayLISA (the "mpg" in that directory), you will see the difference. The point of the queue modification, in this system, which was a large transit mail server designed for 50,000 virtual domains on a single server instance, was to ensure a 100% hit rate in all queue runs. The main problem sendmail faces, when performing fractional queue runs, is that it must open each queue file and examine its contents, in order to know whether or not a given queue element should be operated upon. Breaking up the queue runs into multiple hash directories, and running each queue with a seperate process avoids the global queue lock issue, but only *statistically* decreases the chance of a run-collision between two runners. It does *not* increase the probability of a "hit" for a given queue element, *at all*. Even if you did "the logical thing", and ensured that all domain destinations ended up in the same hash bucket (I would be rather amazed if you could do that, and simultaneously balance queue depth between queues, given a fixed hash selection algorithm!), the increase in "hit" probability will only go up by the average total queue depth divided by the average number of queue entries per queue. This number is *negligible*, until the number of queues approaches the number of domains. Compare; before modification, the maximum load on a SPARC Center 10 was 300 client machines making a queue run every hour. After modification, the maximim load on an RS6000/50 was 50,000 client machines making a queue run every hour. Before modification, the number of messages which could transit the system was 30,000 8K messages in a 24 hour period. After, it was 5,000,000 8K messages in a 24 hour period. Assuming 100 hash queues... the increase in processing over 30,000 messages is *negligible*. The reason for this is that we have reached queue saturation, since a triggered run must take place in all queues, anyway. > > The Open Source book is wrong. You can not build such a system > > without significant modification. My source tree, for example, > > contains more than 6 million lines of code at this point, and > > about 250,000 of those are mine, modifying Cyrus, modifying > > OpenLDAP, modifying Sendmail, modifying BIND, etc.. > > IIRC, Nick talks about the changes that were required -- some, > but not excessive. Read the paper. I have read it. The modifications he proposes are small ones, which deal with impedence issues. They are low hanging fruit, available to a system administrator, not an in depth modification by a software engineer. > > Because Open Source projects are inherently incapable of doing > > productization work. > > True enough. However, this would imply that the sort of thing > that Nick has done is not possible. He has demonstrated that this is > not true. *You've* demonstrated it, or you would just adopt his solution wholesale. The issue is that his solution doesn't scale nearly as well as is possible, it only scales "much better than Open Source on its own". Try an experiment for me: tune the bejesus out of a FreeBSD box with 4G of RAM. Do everything you can think of doing to it, in the same time frame, and at the same level of depth of understanding that Nick applied to his system. Then come back, and tell me two numbers: (1) Maximum number of new connections per second before and after, and (2) Total number of simultaneous connections, before and after. > Yes, using open source to do this sort of thing can be difficult > (as I am finding out), but it doesn't have to be impossible. It doesn't have to be, that's agreed, but it takes substantially more investment than it would cost to build out using multiple instances of commercial software, plus the machines to run it, to "brute force" the problem. Or the resulting system ends up being fragile. > > $0 is not really true. They are paying for you, in the hopes > > that it will end up costing less than a prebuilt system. > > It's a different color of money. They had already signed the > contract stating that I would be working for them through April (at > the earliest), before this project was dumped in my lap. So, that's > not anything extra. Buying new machines, or buying software, now > that's spending extra. I can't imagine a business which did not run on a cost accounting basis. 8-). > > Contact Stanford, MIT, or other large institutions which have > > already deployed such a system. > > I've already read much of Mark Crispin's writings. I know how > they did it at UW, and they didn't use NFS. I've read the Cyrus > documentation, and they didn't use NFS either. UW is not the place you should look. Stanford (as I said) already has a deployed system, and they are generally helpful when people want to copy what they have done. > That only leaves Courier-IMAP, and while I've read the > documentation they have available, I am finding it difficult to find > anyone who has actually built a larger-scale system using > Courier-IMAP on NFS. Plenty of people say they've heard of it being > done, or it should be easily do-able, but I'm not finding the people > themselves who've actually done it. If you are looking at IMAP4, then Cyrus or a commercial product are your only options, IMO, and neither will work well enough, if used "as is". > > Not in Open Source; Open Source does not perform productization or > > systems integration. > > Therein lies the problem. You may be able to write or re-write > all of the open source systems in existence, but that sort of thing > is not within my capabilities, and would not be accepted for this > project. They're looking askance at my modifications to procmail to > get it to use Maildir format and hashed mailboxes -- there's no way > they'd accept actual source code changes. How many maildrops does this need to support? I will tell you if your project will fail. 8-(. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message