Skip navigation

I recently had cause to search through some mail log files.  I was trying to do some correlation and had gotten the transaction identifiers (TIDs) used for the individual mails I was interested in.  The difficulty was, these transaction identifiers didn’t give me the initial connection information, just the specific transaction within the connection.  I also wanted the connection information to, so it seems like an easy problem: use the connection identifiers (CIDs) and grep on that.  This too ran into some trouble because each connection could have multiple transactions within it and if I grep on just the CID, I get all the transactions, not just the TID I’m looking for.  This might not be a big deal if I’m dealing with a connection that has a dozen or so transactions associated with it, for example.  But some of the connections had literally thousands of transactions.  Sorting through all of those other transactions was much too much of a pain to deal with.  If only there was a way to tell grep “Grep for the CID in the file as long as it a) has the specific TID I’m looking for or b) doesn’t have any TID at all.”  Sadly, there’s not. . . but there is a way to do that in awk! Here are the givens I’m working with:

  • If I split on spaces, I know the CID is always the fifth bit of information and is of the form “c=CID”
  • If I split on spaces, I know the TID, if present, is always the 7th bit of information and is of the form “t=TID”.
  • I’ve got the TID and CID in shell variables (likely from a surrounding ‘while read TID CID; do … done’ loop)
  • I’ve also got the name of the log file, which is compressed using gzip
  • awk is a pain in the ass and you can’t just use shell variables inside the code, so you need to somehow pass $TID and $CID into the system without being able to quote them explicitly inside the awk code

So, here’s how to crack this particular nut:

   > zgrep -E $CID $logfile | awk -v cid="c=${CID}" -v tid="t=${TID}" ' if ( $5 ~ cid && $7 !~ /^t=/ ) print $0; else if ( $5 ~ cid && $7 == tid ) print; }'

Walking through this, the ‘zgrep -E’ does the heavy lifting of uncompressing the file and getting everything that has the CID, then passing it on through STDIN.  This is the easy part.  The awk is where it gets tricky.  We use ‘-v’ to create two variables we will use inside the awk code, ‘cid’ and ‘tid’.  I made them lower case to easily distinguish between the awk variables and the shell variables.  Inside the single quotes is the actual awk code that runs.  We have a simple ‘if … then; else if … then’ syntax here.  The logic is as follows:

  • If the fifth field contains the contents of ‘cid’ AND the seventh field does not start with ‘t=’, go ahead and print the line (which is $0).  This prints out any line that contains the CID but doesn’t contain any TID (so we get the connection information for the overall connection)
  • If the fifth field contains the contents of ‘cid’ AND the seventh field contains the contents of ‘tid’, go ahead and print the line as well.  This prints out any line that contains the CID and contains the TID we’re looking for (so we get the transaction information for the specific transaction we’re interested in)
  • That’s it

This handily filters out all lines that are specific to other transaction IDs since, if it doesn’t match either of the two conditions stated above, the line’s silently discarded.  The time spent digging into the arcana of awk to pass the variables as well as get the right matching parameters was around half an hour.  However, using awk (instead of grep) let me save that much time for the several 1000+ TID connections I ran into while looking through the logs as well as making the output cleaner so I vastly reduced the possibility of mistakes because of an overflow of information.  And knowing the information now, I’ll be able to use this when appropriate later and save even more time.  So, it’s not time spent, it’s time invested . . . and long-term investments can reap great rewards, or so I hear.