Date: Thu, 16 Feb 2006 10:30:09 -0500 From: Chuck Swiger <cswiger@mac.com> To: Ashley Moran <ashley.moran@codeweavers.net> Cc: questions@freebsd.org Subject: Re: Log analysis server suggestions? [long] Message-ID: <43F49A81.4090203@mac.com> In-Reply-To: <43F4951E.5090203@wmptl.com> References: <200602161418.32982.ashley.moran@codeweavers.net> <43F4951E.5090203@wmptl.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Ashley Moran <ashley.moran@codeweavers.net> wrote: [ ... ] I'm not sure who the original poster was, but whoever is interested in this topic might benefit by reading a thread from the firewall-wizards mailing list: -------- Original Message -------- Subject: [fw-wiz] parsing logs ultra-fast inline Date: Wed, 01 Feb 2006 16:03:38 -0500 From: Marcus J. Ranum <mjr@ranum.com> To: firewall-wizards@honor.icsalabs.com OK, based on some offline discussion with a few people, about doing "large amounts" of system log processing inline at high speeds, I thought I'd post a few code fragments and some ideas that have worked for me in the past. First off, if you want to handle truly ginormous amounts of log data quickly, you need to build a structure wherein you're making decisions quickly at a broad level, then "drilling down" based on the results of the decision. This allows you to parallelize infinitely because all you do is make the first branch in your decision tree "stripe" across all your analysis engines. So, hypothetically, let's say we were handling typical UNIX syslogs at a ginormous volume, we might have one engine (CPU/process or even a separate box/bus/backplane/CPU/drive array) responsible for (sendmail | named) and another one responsible for (apache | imapd) etc. If you put some simple counters in your analysis routines (hits versus misses) you can "load balance" your first tree-branch appropriately using a flat percentage. Also, remember, if you standardize your processing, it doesn't matter where it happens; it can happen at the edge/source or back in a central location or any combination of the two. Simply design your analysis as an always 3-stage process consisting of: - weeding out and counting instances of uninteresting events - selecting, parsing sub-fields of, and processing interesting events - retaining events that fell through the first two steps as "unusual" The results of these 3 stages are - a set of event-IDs and counts - a set of event-IDs and interesting fields and counts - residual data in raw form back-haul the event-IDS and counts and fields and graph them or stuff them into a database, and bring the residual data to the attention of a human being. I suppose if you needed to you could implement a "log load balancer" in the form of a box that had N interfaces that collected a fat stream of log data, ran a simple program that sorted the stream into 1/N sub-streams and forwarded them to backend engines for more involved processing. You could scale your logging architecture to very very large loads this way. It works for Google and it'd work for you, too. The first phase of processing is to stripe across engines if necessary, then inside each engine you stripe the processing into functional sub-parsers that deal with a given message format. The implementation is language-irrelevant though your language choice will affect performance. Typically you write a main loop that looks like: while ( get a message ) { if(message is a sendmail message) parse sendmail message if(message is an imap message) parse imap message ... } Once your system has run on a sample dataset you will be able to determine which messages come most frequently and you can put that test at the top of the loop. This can result in an enormous performance boost. Each sub-parse routine follows the same structure as the main loop, performing a sorted series of checks to sub-parse the fields of the message-specific formats. I.e.: parse sendmail message( ) { if(message is a stat=sent message) { pull out recipient; pull out sender; increment message sent count; add message size to sender score; done } if(message is a stat=retry message) { ignore; //done } if(message is a whatever) { whatever; done } // if we fell through to here we have a new message structure // we have never seen before!! vector message to interesting folder; } Variant messages are a massive pain in the butt; you need to decide whether to deal with variants as separate cases or to make the sub-parser smarter in order to deal with them. This is one of the reasons I keep saying that system log analysis is highly site specific! If your site doesn't get system logs from a DECStation 5000 running ULTRIX 3.1D then you don't need to parse that data. Indeed, if you build your parse tree around that notion then, if you suddenly start getting ULTRIX format log records in your data stream, that'd be - shall we say - interesting and you want to know about it. I remember when I was looking at some log data at one site (Hi Abe!) we found one log message that was about 400K long on a single line. It appeared that a fairly crucial piece of software decided to spew a chunk of its memory in a log message, for no apparent reason. This is what I mean by "interesting." A philosophical digression: Why not use regexps? Other than the fact that they look like modem line noise (kids these days probably don't even know what that looks like, I bet... arrgh!) it's hard to look quickly at a regexp and a string and tell if it'll match. So most of the log processing approaches that I've seen resort to having a loop like: while( get a message) { if(regexp #1 matches) { sub_parse format #1; } else if(regexp #2 matches) { ... } What you wind up doing is parsing the string twice or, worse, having loads of regexps in the main loop, which makes your parse tree too broad to execute quickly. The worst case of this is really really ugly because you wind up executing every regexp in your pattern-set against every line of the input, which can suck quite badly if your pattern-set is large. If you have to use regexps for this, use a pre-parse switch, i.e.: while( get a message) { if(message begins with "sendmail") if(regexp #1 matches) { sub_parse format #1; } else ... } ... } I think that's pretty pointless, though, since by the time you've done the first match you can just jump straight into the sub-parse and forget the regexp entirely. So here's a code-fragment of evil that demonstrates this kind of technique. ------------------------ #include <stdio.h> #define B_DATEOFF 16 /* the end of the date/time stamp */ main() { char in[BUFSIZ * 8]; /* hypothetical fields we want to skin out of the messages */ char b_ho[256]; /* hostname */ char b_prog[256]; /* program name */ int b_nstore1; /* stash a number */ char b_cstore1[256]; /* stash a char */ int p; /* offset */ int r; while(fgets(in,sizeof(in),stdin)!= NULL) { /* pluck the program name from the message */ /* magic here is that p stores the match offset into the string */ if((r = sscanf(in + B_DATEOFF,"%s %s%n",b_ho,b_prog,&p)) != 2) continue; /* branch on the program name */ if(strcmp(b_prog,"/bsd:")) continue; /* ------------------------------------------------------- from here on demux on next term(s) with individual matches you should match with the case that is most likely to occur first, then the second most likely, etc. */ /* RESIDUAL: dkcsum: wd0 matched BIOS disk 80 */ r = sscanf(in + B_DATEOFF + p," dkcsum: %s matched BIOS disk %d",b_cstore1,&b_nstore1); if(r == 2) { printf("found BIOS disk b_cstore1=%s b_nstore1=%d\n",b_cstore1,b_nstore1); continue; } /* RESIDUAL: root on wd0a */ r = sscanf(in + B_DATEOFF + p," root on %s",b_cstore1); if(r == 1) { printf("found root on b_cstore1=%s\n",b_cstore1); continue; } } } You can see how this structure could be nested a bit deeper - in fact, the more you nest it, in terms of logical tokens left-to-right the faster it will run because you have to perform fewer tests. So in the example above you could branch on ("root" | "dkcsum") if there were more than a few cases. This approach runs faster than hell on even low-end hardware and can crunch through a lot of logs extremely rapidly. It has the disadvantage that it's hard-coded, but the fact that it's a hand-constructed hard-coded parse tree is what makes it so fast. So suppose you were getting 8 squadjillion firewall permit/deny log messages and wanted to do something with them - filtering in this manner you could pretty easily rip statistics into memory, and periodically checkpoint counts to a BSD DB b-tree or even (if you like slow!) an SQL database. Just don't do the updates in-line; collect statistics in-line and checkpoint them to the SQL every minute or something like that. If your logging system crashes the fact that you lost a minute's worth of your log data is kind of irrelevant because you'll lose several minutes' worth by the time it's down and back up. With something like a firewall that's spewing out scads of permit/deny your parse routine is very small, right? It's about the size of that little program I posted up above. And if you get something other than permit/deny - well, that's interesting, so bounce it out to a human's attention. The "gotcha" with this approach is that it is data-centric and the level of work expands linearly with the number of different types of data that you want to deal with. So if you want to rapidly parse 400 different kinds of log messages, you're going to write some code. Maybe a thousands lines, by the time you're done. That's not scary!! What _is_ scary is if you wanted to handle _all_ the variant forms of log messages. My summer project 2 summers ago ("what I did on my summer vacation", by Marcus Ranum) was going out to San Diego Supercomputer Center and plowing around in SDSC's syslogs with Abe and a couple bottles of tequila and a ton of gummi bears. The goal of the project was to attempt to automatically generate a parse tree engine that could recognize all the variants of syslog messages seen. So first I wrote something that attempted to enumerate the total # of variants seen. After it ran for a day (much data, but the computer I was using was fast!) my analysis program concluded there were 50,000+ different forms of syslog messages in 10 years of syslog data. Arrrgh. That's a lot. What the hell impels programmers to keep changing the structure of their log messages? There were lots of variants of sendmail messages (dang it!) - the biggest culprit. Anyhow, you might only want a few of them so it's not like you'd need parse rules for all 50,000+ forms of message. The important part of the process is the sitting down and seeing what you've got and deciding what you want to do with it. It comes back to what I was saying in my last few postings on this topic: - look at your data - then decide what you want to collect from it - then collect it - count the stuff you throw away - vector the rest to a human's attention mjr. _______________________________________________ firewall-wizards mailing list firewall-wizards@honor.icsalabs.com http://honor.icsalabs.com/mailman/listinfo/firewall-wizards -------- Original Message -------- Subject: Re: [fw-wiz] parsing logs ultra-fast inline Date: Thu, 02 Feb 2006 13:00:14 -0500 From: Chuck Swiger <chuck@codefab.com> Organization: CodeFab To: Marcus J. Ranum <mjr@ranum.com> CC: firewall-wizards@honor.icsalabs.com References: <6.2.0.14.2.20060201142844.06e4c5a8@ranum.com> Marcus J. Ranum wrote: > OK, based on some offline discussion with a few people, about > doing "large amounts" of system log processing inline at high > speeds, I thought I'd post a few code fragments and some ideas > that have worked for me in the past. Thank you, there is good stuff lurking here for those who spend the time to think about your posting carefully. One of the most important points you've made is that you need to pay attention and handle the common cases efficiently, but you should also look for and notice exceptional events which are not expected. Those should stand out from the routine information even if they only happen once. Perhaps especially if the event only happens once, uniqueness can be quite interesting. I have one analysis tool to offer which can do histograms: http://www.pkix.net/~chuck/histogram.py This makes a really good tool to point at a large mass of logdata and see what kind of patterns appear how often. If you've got, say, an httpd logfile in combined format, try the following: histogram.py -F\" -f 2 -n 20 httpd-access.log histogram.py -F\" -f 6 -n 20 httpd-access.log histogram.py -F\" -f 2 -n 20 -P jpg httpd-access.log histogram.py -F\" -f 2 -n 20 -X HTTP/1.0 httpd-access.log histogram.py -F\" -f 2 -n 20 -X HTTP/1.1 httpd-access.log Well, I'm sure you get the idea. Add the -r flag, and you can see the least common events instead. > [... ] If you put some simple counters in your analysis routines > (hits versus misses) you can "load balance" your first tree-branch > appropriately using a flat percentage. Also, remember, if you > standardize your processing, it doesn't matter where it happens; > it can happen at the edge/source or back in a central location > or any combination of the two. Simply design your analysis as > an always 3-stage process consisting of: > - weeding out and counting instances of uninteresting events > - selecting, parsing sub-fields of, and processing interesting events > - retaining events that fell through the first two steps as "unusual" This is a fine explanation of event parsing. Note that the order is not necessarily best done in sequential fashion: ie, while you want to weed out common uninteresting events quickly, do the most common matches first and have tests matching common interesting events, before you test for and discard uncommon uninteresting events. [ ... ] > // if we fell through to here we have a new message structure > // we have never seen before!! > vector message to interesting folder; > } ...and then you go back a week later and add matching cases for the messages you've never seen before. Lather, rinse, repeat, until you're done for the moment, or until a new version of the software comes out, or someone decides to make syslogd format the date field differently. > Variant messages are a massive pain in the butt; you need to decide > whether to deal with variants as separate cases or to make the > sub-parser smarter in order to deal with them. This is one of the > reasons I keep saying that system log analysis is highly site > specific! I think I can tell which software has been written by people who have had to update log analysis tools, and which has not: the former never change logging output, especially of important messages, between minor program versions. Other software seems to fiddle with the output details from line to line and from version to version. Add fields at the end or add new log lines instead. Please! [ ... ] > A philosophical digression: Why not use regexps? Other than the > fact that they look like modem line noise (kids these days probably > don't even know what that looks like, I bet... arrgh!) it's hard to look > quickly at a regexp and a string and tell if it'll match. So most of > the log processing approaches that I've seen resort to having a > loop like: > while( get a message) { > if(regexp #1 matches) { > sub_parse format #1; > } else > if(regexp #2 matches) { > ... > } > > What you wind up doing is parsing the string twice or, worse, > having loads of regexps in the main loop, which makes your > parse tree too broad to execute quickly. The worst case of this > is really really ugly because you wind up executing every > regexp in your pattern-set against every line of the input, > which can suck quite badly if your pattern-set is large. Absolutely, you should try to use static string matching instead of regex'es for your most common test cases if at all possible. You can make life easy for your own log analysis to do this if your program log messages of something like: EVENTTYPE: ...message... ...where "...message..." is either a human-readable string, or perhaps a set of key-value pairs. Most of your logging messages should correspond to the stuff the program is supposed to be doing. An individual log line should provide enough context to be understandable or at least integratable with related log messages if the logfile is truncated or rotated, or if the log data is processed in a non-sequential fashion. Sendmail queue IDs or PIDs in general logfiles are common examples of data which can be used to make such correlations. The closer your events are to the significant real things or changes in state that are happening in your program, the more effective your logging will be towards the people using the software. (I only repeat the obvious because there is a vast amount of software which misses this point.) If you need to add debugging code to figure out what's going on in some weird or exceptional case that you don't understand, use an event type of "DEBUG", "INFO", "FATAL" or whatever is appropriate followed by a human-oriented message. Once you understand the situation better, classify it and change the event type to the message and add it to your parser. If you use this consistently, you can easily process the messages your code generates into discrete events, and you can easily separate out "normal" logging output from the stuff that a human should review out-of-band. [ ... ] > What _is_ scary is if you wanted to handle _all_ the variant > forms of log messages. My summer project 2 summers ago > ("what I did on my summer vacation, by Marcus Ranum) > was going out to San Diego Supercomputer Center and > plowing around in SDSC's syslogs with Abe and a couple > bottles of tequila and a ton of gummi bears. The goal of the > project was to attempt to automatically generate a parse > tree engine that could recognize all the variants of syslog > messages seen. So first I wrote something that attempted > to enumerate the total # of variants seen. After it ran for a > day (much data, but the computer I was using was fast!) > my analysis program concluded there were 50,000+ different > forms of syslog messages in 10 years of syslog data. > Arrrgh. That's a lot. What the hell impels programmers to > keep changing the structure of their log messages? There > were lots of variants of sendmail messages (dang it!) - the > biggest culprit. Anyhow, you might only want a few of them > so it's not like you'd need parse rules for all 50,000+ > forms of message. The important part of the process is > the sitting down and seeing what you've got and deciding > what you want to do with it. Agreed, agreed. But it's never too late to try to become more consistent. (Starting with not breaking things which work now.) -- -Chuck _______________________________________________ firewall-wizards mailing list firewall-wizards@honor.icsalabs.com http://honor.icsalabs.com/mailman/listinfo/firewall-wizards -- -Chuck PS: And you should review the usage of the code with regard to security. You might even induce Marcus to buy you a Guinness Stout. :-)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?43F49A81.4090203>