From owner-freebsd-chat Fri Feb 14 10:44: 8 2003 Delivered-To: freebsd-chat@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C2BDA37B407 for ; Fri, 14 Feb 2003 10:43:58 -0800 (PST) Received: from wolfbert.skynet.be (wolfbert.skynet.be [195.238.3.13]) by mx1.FreeBSD.org (Postfix) with ESMTP id A101743F93 for ; Fri, 14 Feb 2003 10:43:56 -0800 (PST) (envelope-from brad.knowles@skynet.be) Received: from riker.skynet.be (riker.skynet.be [195.238.3.89]) by wolfbert.skynet.be (8.12.7/8.12.7/Skynet-OUT-FALLBACK-2.22) with ESMTP id h1EExLZp022199 for ; Fri, 14 Feb 2003 15:59:21 +0100 (MET) (envelope-from ) Received: from [10.0.1.2] (ip-26.shub-internet.org [194.78.144.26] (may be forged)) by riker.skynet.be (8.12.7/8.12.7/Skynet-OUT-2.21) with ESMTP id h1EEwn2o002496; Fri, 14 Feb 2003 15:58:51 +0100 (MET) (envelope-from ) Mime-Version: 1.0 X-Sender: bs663385@pop.skynet.be Message-Id: In-Reply-To: <3E4CB9A5.645EC9C@mindspring.com> References: <20030211032932.GA1253@papagena.rockefeller.edu> <3E498175.295FC389@mindspring.com> <3E49C2BC.F164F19A@mindspring.com> <3E4A81A3.A8626F3D@mindspring.com> <3E4B11BA.A060AEFD@mindspring.com> <3E4BC32A.713AB0C4@mindspring.com> <3E4CB9A5.645EC9C@mindspring.com> Date: Fri, 14 Feb 2003 15:58:09 +0100 To: Terry Lambert From: Brad Knowles Subject: Re: Email push and pull (was Re: matthew dillon) Cc: Brad Knowles , Rahul Siddharthan , freebsd-chat@freebsd.org Content-Type: text/plain; charset="us-ascii" ; format="flowed" Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org At 1:40 AM -0800 2003/02/14, Terry Lambert wrote: > If you > are using an NFS server (which you are), then it's based on your > ability to saturate your network device. You're still limited by disk devices that may be used temporarily on the local server, as well as the disk devices on the other end of that network connection. Putting them on the network does not magically solve the problem that disk I/O is still many orders of magnitude slower than any other thing we ever do on computer systems. > Disagree. These locking issues are an artifact of the system > design (FS, application, or both). And you have magically solved all these problems in what way? > Simple answer: Don't use a metadata intensive storage mechanism. So, use what -- a pure memory-based file system for hundreds of gigabytes or even multiple terabytes of storage? Even that will still have synchronous meta-data update issues with regards to the in-memory directory structure, even if those operations do take place much faster. > In other words, the message takes up your disk space, no matter > what. I other words, I can protect the entire system from being taken down by a concerted DOS attack on a single user. They're going to have to work harder than that if they want to take down my entire system. >> SIS increases SPOFs, reduces reliability, increases complexity, >> increases the probability of hot-spots and other forms of contention, >> and all for very little possible benefit. > > The only one of these I agree with is that it increases complexity. In what way does SIS *not* increase SPOFs, reduce reliability, increase the probability of hot-spots and other forms of contention, and in what way does it magically solve all the storage problems of the system? > This discussion *started* because there was a set of list floods, > and someone made a stupid remark about an important researcher > indicating he was cancelling his subscription to the -hackers > mailing list over it, and I pointed out to the person belittling > the important researcher that such flooding has consequences that > depend on the mail transport technology over and above "just having > to delete a bunch of identical email". Okay, so let's say that you've got this magical SIS which solves all storage problems, and you let your users have unlimited disk space. All it takes is someone applying trivial changes to the messages so that they are not all actually identical, and you're back to storing at least one copy of each. Such transformations are typically found in message headers (message-ids are supposed to be unique, and combinations of date/time stamps and process ids will probably be unique, especially when taken over the entire message and the multiple hops it might have traversed). Such transformations are becoming much more typical with spam, where the recipient's name is part of the message body. So, you're right back where you started, and yet you've paid such a very high price. > As far as "dealing with DOS", in for a penny, in for a pound: if > you are willing to burn CPU cycles, then implement Sieve or some > other technology to permit server-side filtering. We're doing that, too. However, server-side filtering can only do so much. Yes, it can eliminate duplicates that have the same message-id (although there is some risk that you'll eliminate unique messages that have colliding ids), and there is the possibility to program it so that it can actually inspect the content and eliminate additional messages that have the same message body fingerprint as previously seen. But even that can only go so far. See above. > We also know that, for most DOS cases on maildrops, the user > simply loses, and that's that. True enough. But I don't have to throw out all of my users simply because just one of them was the target of a DOS. > Let's quit talking about the free services. Yes, please. > So let's limit ourselves to the realm of LWCYM - "Lunches Which > Cost You Money". Sounds good. > The replication model is actually a pretty profound issue. Prior > to replication, if you connect to one of the replicas, the message > can be seen as "in transit". Post deletion on an original prior to > the replication, and the deletion can bee seen as "in transit". The > worst case failure modes are that a message has increased apparent > delivery latency, or the message "comes back" after it's deleted. Yes, at another level, the particular replication model chosen will be important. However, at this level what we really care about is the fact that the message/mailbox is replicated, and we don't really care how. >> That's what I was calling the "recipient system". It is the >> system where the message was received. > > This is not useful to talk about in terms of a POP3 maildrop. Sure it is. I've got limited disk space that I can afford to give each user, in accordance to the amount of money that they are paying for their service (or is being paid on their behalf). But their local disk storage is limited only by their own budget (or the budget of their group), and is not an expense that I have to account for. So, when defining "recipient system", it makes perfect sense that this would be the point at which the mail is accumulated into some sort of a mailbox or queue and held on their behalf, regardless of whether that mailbox/queue is downloaded/retrieved with UUCP, POP3, IMAP4, or some other protocol. > To all intents and purposes, message in a POP3 maildrop are > "in transit on a point to point mail transport". That's really > the whole point of acknowledging a "pull" technology exists, in > the first place. Yes, there is another component to the system, which comprises the system of the end user, their bandwidth to the server that holds their mail, etc.... But this is not the "recipient system". This is the "end-user system". It's an important system in the overall scheme of things, but is different from the one we're talking about -- they manage their own end-user system, but I manage the recipient system(s). > The majority of that latency is an artifact of the FS technology, > not an artifact of the disk technology, except as it impacts the > ability of the FS technology to be implemented without stall > barriers (e.g. IDE write data transfers not permitting disconnect > ruin your whole day). Again, I'd like to know where you get this magic filesystem technology that solves all disk I/O performance issues and makes them as fast as a RAM disk, while also being 100% perfectly safe. > Unless I can use someone else's stored copy of the message to > recover my corrupted stored copy of the message, that's not > replication, it's duplication. Correct. But with only ~1.3 recipients per message (on average), there isn't much duplication to be had anyway. The whole replication issue is a different matter. > The reason I brought up SIS again is that you seemed more than > willing to let a message sit in the main mail queue, but almost > paniced at the idea of throwing it into the user mailbox instead. No, I don't panic "...at the idea of throwing it into the user mailbox...". I have defined queueing & buffering mechanisms that function system-wide, which help me resist problems with even large-scale DOS attacks, and help ensure that all the rest of my customers continue to receive service even if a single user has an overflowing mailbox. But it's easier to solve this problem at the system-wide level where I can allocate relatively large buffers, as opposed to inflicting it on the end user and letting them try to deal with it across their slow dial-up line (or whatever). > Nope; I want to do it to get you to agree to turn off quotas, > if your business model is not based on the idea that it's OK > to drop email into /dev/null for customers who don't pay you > more money. Bait not taken. The customer is paying me to implement quotas. This is a basic requirement. Moreover, even if it wasn't a basic requirement, I'd go back to the customer and make sure that they understood that they're placing the entire mail system for all thousands of users at risk if there is a single mail loop or a large DOS attack on a single user, where I have better tools to constrain these issues at a system-wide level. If they still said that they didn't want quotas, then I'd let someone else build the system for them -- I wouldn't want my name on it. I don't drop the stuff in /dev/null. I just put some limits on things so that I've got brakes that will automatically kick in and start slowing the train down if there is an excessive overspeed problem for an excessive period of time. > FS design issue. And metadata updates in FreeBSD (with soft > updates) or SVR4.2 or Solaris (with delayed ordered writes) are > *NOT* synchronous, they are merely ordered. Well, we're not talking about FreeBSD. I wish we were. However, I can assure you that UFS+Logging definitely has synchronous meta-data update issues -- making them ordered or putting them into a commit log and doing them in larger chunks does not eliminate them. Fortunately, in this case I have architected the system so that we shouldn't run into those problems very often. However, there's nothing I can do about synchronous meta-data issues with the network & filesystem implementation of the NFS server, and any related problems with the NFS client. > You limited my options to Open Source, however. Because there is no additional money to spend, open source is really the only practical choice. However, neither UW-IMAP nor Cyrus will work on NFS, thus leaving us with either the complete Courier package, or just the Courier-IMAP component. > Maildir is a kludge aound NFS locking. Nothing more, and nothing > less. Yup. And I'm convinced that it introduces more problems than it solves. But I still don't have much choice. > MS Exchange does, and so does Lotus Notes. I know they suck, but > they are examples. They're not IMAP servers. They are proprietary LAN e-mail systems that may happen to have an interface to this alien IMAP protocol. >> Nope. mmap on NFS doesn't work. > > Who's using mmap?!? Cyrus. All those databases it keeps to help inform it what the status is of the various messages, etc... are using mmap to access the information inside the database files. Or are you not familiar with the method of operation of tools like Berkeley DB? > This is interesting to know; from the documentation available, > they imply they scale, and a single instance of one seems to > match their claims for a single instance. I guess it's always > worse than the marketing literature, when you deploy it. 8-(. Actually, the Netscape/iPlanet e-mail server is just a re-badged SIMS, which is itself a partial port of PMDF from Vax/VMS to Unix, which was a port of the original MMDF from Unix to Vax/VMS. While I have a lot of respect for PMDF and the work that Innosoft did, we know from practical experience that SIMS can't scale beyond ~60,000 POP3 users with 5MB mailbox quotas, if you're using a Sun Enterprise 5500. At that point, if you want to add any new users, you must first delete some old ones. Belgacom Skynet bought a small ISP co-op in southern Belgium that was using SIMS as their mail system, and one of the reasons they were selling themselves to us was the fact that their mail system couldn't scale. We moved their users over to a system on a Sun E420R with an external Comparex D1400/Hitachi Data Systems DF400 RAID array which was already serving several hundred thousand users, and we didn't even notice. SIMS and Netscape/iPlanet mail server are dead-end products. Scott McNealy was very unpleasantly surprised when the Sun Europe guys sprung SIMS on him, and it is definitely going the way of the dodo. Note that Sun is a major investor in Sendmail, Inc. and they have on their payroll one of the key members of the Sendmail Consortium. > 40 seconds to transfer on a Gigabit ethernet... assuming you can get > it of the disks. 8-). Do you really expect them all simultaneously? Not a one of these machines has GigaBit Ethernet. They all have 100Base-TX FastEthernet, and the front-end machines may also have a second 100Base-TX FastEthernet interface (if I can scrounge a couple of NICs). The big problem is that most of the users will also have 100Base-TX FastEthernet. It won't take too many of them trying to access the server at once to completely swamp it. > You don't need to assert a lock over NFS, if the only machine doing > the reading is the one doing the writing, and it asserts the lock > locally (this was more talking about the Cyrus cache files, not > maildir). This assumes that there is only one machine ever writing to a particular mailbox. This is not a valid assumption. -- Brad Knowles, "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania. GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message