From owner-freebsd-questions Mon Jan 24 15:39: 2 2000 Delivered-To: freebsd-questions@freebsd.org Received: from www.beastie.net (cr13646-a.lngly1.bc.wave.home.com [24.113.138.52]) by hub.freebsd.org (Postfix) with ESMTP id 68CA014BEE for ; Mon, 24 Jan 2000 15:38:56 -0800 (PST) (envelope-from beastie@beastie.net) Received: from [192.168.1.2] (helo=Beastie) by www.beastie.net with smtp (Exim 3.03 #1) id 12CsQj-000Ao8-00; Mon, 24 Jan 2000 14:57:53 -0800 Message-ID: <003d01bf66be$b97168e0$0201a8c0@uniserve.com> From: "David Fuchs" To: "Sean Heber" , References: Subject: Re: Update regarding stuck file systems Date: Mon, 24 Jan 2000 14:59:42 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2615.200 X-Mimeole: Produced By Microsoft MimeOLE V5.00.2615.200 Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Well... just think... what happens other than your backup script at 1:00am to 2:30am on your server?? By default the cron process starts it's daily server maintenance at precisely 1:00. I know that when I'm around the server during this time it's making a hell of a racket... you should probably try turning system maintenance off for one night just to see if it solves the problem. I know it isn't a definite answer, but it should bring you a step closer to the solution. :) -David Fuchs ----- Original Message ----- From: Sean Heber To: Sent: Monday, January 24, 2000 1:50 PM Subject: Update regarding stuck file systems > Ok, you may remember my previous e-mail about this a few days ago.. I > have since done a LOT of testing. I don't have much of a conclusion > (which is why I'm writing again). > > As you may recall, my system had an odd problem. If I ran my backup > script (which tars files on one hard drive and puts them on another hard > drive), all file system access stopped. So, the box would still be up, > top would still be running on the console, but nothing would work because > the OS couldn't seem to read from the drive. > > The kicker, though, is no error messages. Nothing in the logs. Nothing > on the console. It would just stop and the processes would happily wait > for data from the drives, but none would ever come. > > So, after a whole lot of swearing and Dew drinking, I have narrowed it > down only slightly. It seems that for some reason this only happens > around 1:00 - 2:30 AM or so. Never any other times. > > For example, as I write this a backup is being performed. For testing > purposes I've been running one backup after another since 8:00 AM (3:30 PM > now). No problems at all. > > I can't think of any reason why this would fail in the early morning hours > and never any other time. It's not uptime related since just yesterday I > had the box up and down (while testing this) and everything was going > great. When I tried to run the backup again around 1:30AM, it died. I > was forced to hit the rest button. Once the system came back up, I > figured I would try to narrow things more. So, I unloaded vinum on my two > IDE backup drives (see below), reformated one and gave it the same mount > point. (So the backup would still work. I don't need all that space just > yet.) Once that was done, vinum was not loaded and I gave it another > shot. The backup froze again. The box had only been up about 30 minutes. > > The first night I made the backup process, I put it at the end of my > daily.local cron script. It runs at 1:59 or something like that. Before > that time, the box was up for 2 days. That first night brought it down > with a froze file system. > > The night after I gave the backup script it's own entry in crontab for > 3:00AM. It worked just fine. When I woke up in the morning things still > worked. > > Just the other night I changed the cron's run time to 12:05 AM. That also > made it through the night just fine. > > Does any of this make any sense? It doesn't to me. > > I suppose I have two basic questions here: > 1) Is there anyway to make this work aside from the obvious "Don't run it > between 1:00 and 2:30 AM"? Because this really bothers me. I have no > idea if heavy server load would cause this to happen or if this is just a > backup problem due to something stupid I'm doing. > > 2) I really need a better backup method. The idea originally was to have > a duplicate structure on the backup drive as well as the main drive so > that in the event of a disk faliure the broken drive could just be > unplugged. Is that reasonable? Obviously using tar the way I am doesn't > really allow this. The catch (at least it seems like one to me) is the > drives are all different sizes.. (see below) > > > Ok, the famed "below": > > > Running FreeBSD 3.3-RELEASE (I had 3.4-STABLE before. Don't ask. Long > story. But the problem is still the same in either case.) > SMP Kernel > 256 MB RAM > Dual PII-400Mhz > Currently sitting in my room with no other active users and no outside > activity via web or anything (it's still being configured, after all) > > Drives: > SCSI id6: 4.5 GB (boot: /, /usr, swap) > SCSI id9: 9.0 GB (backup: /eddie) > IDE bus1master: 37 GB (data: /sites) > IDE bus1slave: none > IDE bus2master: 25 GB (backup1) > IDE bus2slave: 20 GB (backup2) > > The last two backup drives are concated using vinum. Mounted as > /wowbagger. > > The idea is that everything on the boot SCSI drive could be on the backup > SCSI drive, and the same for the IDE. This layout is like this because > our original plan was to have the ability to unplug the broken drive and > get things backup with minimum pain. But using tar sort of defeats the > purpose--which is why I would like some more suggestions. :-) > > The backup script does this right now: > > echo "Backup /:" > tar -cslpf /eddie/root.tar / > echo > > # Backup by itself to be handy, maybe. > echo "Backup /usr/local:" > tar -clspf /eddie/usr.local.tar /usr/local > echo > > echo "Backup all of /usr:" > tar -clpsf /eddie/usr.tar /usr > echo > > echo "Backup /sites:" > tar -clpsf /wowbagger/sites.tar /sites > echo > > Make sense? One thing I just realized, though, is that I might hit that > famed 2GB file limit. I imagine FreeBSD is prone to this? Oh well. I > need a better method anyway.. > > Just so you know, here's the current df: > > Filesystem 1K-blocks Used Avail Capacity Mounted on > /dev/da0s1a 99183 45741 45508 50% / > /dev/da0s1e 3713364 507654 2908641 15% /usr > /dev/da1s1e 8679993 1227161 6758433 15% /eddie > /dev/wd0s1e 35503710 449097 32214317 1% /sites > /dev/vinum/vinum0 43643010 996729 39154841 2% /wowbagger > procfs 4 4 0 100% /proc > > > As you can see, the partitions that are being backed up are not over 2GB, > so that shouldn't be the problem right now. > > Anyway.. I'm looking for some input here. It's very very hard to make > this problem happen. I can try all day and nothing will come of it, but > wait until 1:30AM or so, and it happens almost(key word) everytime. Is > something deadlocking? Perhaps something to do with SMP? Or am I doing > something terrbily stupid? (feel free to flame.. I need to learn > sometime, right? :-) > > I hope someone has a clue of where to start digging, at least. The last > e-mail generated one response. The person suggested I try removing drives > one by one from the equation. I'm going to attempt that tonight in more > detail. The problem is, setting the clock to 1:30 AM myself doesn't > seem to matter. Maybe it's tied to the BIOS time... Or perhaps it's not > time related at all and just really really coincedental that it happens > around that time all the time regardless of how long the box was up, how > hot it is, etc. > > l8r > Sean > > PS> ARG!!!! (This has been driving me nuts for the past 4.5 days now) > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-questions" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message