Skip navigation

I’ve had a truism for a while that I write programs three times: the first time to get it working; the second time to add in the features I didn’t realise I needed the first time through; and the third time to make the program scalable and maintainable long-term.  In point of fact, I often go back several times after I finish writing a program, but there are three major re-writes.  As an example of this, let’s go through a program I recently began and follow the evolution of that program. NOTE: The ‘code’ commenting in the default WordPress kinda blows… or I just don’t know how to use it. But, I apologize for the lack of proper code indentation and will rectify that ASAP. Before writing any code, first you have to have a problem to solve.  In this case, I was reading a conversation between folks about a change in the guest list for Dragon*Con.  Having gone down the revision control path previously, I figured it should be pretty easy to periodically query the Guest List page, parse out the guests, record those guests, compare the guest list to what we saw before and note any changes.

Let’s do this the manly way, the way God and Brian Fox intended

I’ve been doing quite a bit of shell programming, so I decided to see if I could do this in bash.  Here’s what I originally came up with: #!/bin/bash # # Grab the list of guests from the DragonCon site # Compare that list to what we've gotten previously. #   If there's a difference, make a note of it to a separate file with a date stamp #   Take current list and list of differences and put them into an html page # Set a righteous PATH PATH=/bin:/usr/bin BASEDIR=/home/docxstudios/web CURR=$BASEDIR/dcg.current.out PREV=$BASEDIR/dcg.prev DIFF=$BASEDIR/dcg.diff LOG=$BASEDIR/dcg.log HTML=$BASEDIR/dc_guest_list.html HR='<hr width=50% align=center>' # Get current list curl -s | grep dc_guest_detail.php | sed -e "s/^.*id=[0-9]*'> *//;s/<\\/a>.*//;s/<span style='color: Red;'>//;" | sort -fu > $CURR diff $CURR $PREV 2>&1 > $DIFF # If there's a difference, append it to the log, then copy the current list into the 'prev' list if [ $? -eq 1 ]; then echo "Difference detected on " `date` "<br>" >> $LOG grep '^[<>]' $DIFF | sed -e 's/^> /Removed guest "/; s/^< /Added guest "/; s/$/"<br>/' >> $LOG echo "$HR" >> $LOG cp -p $CURR $PREV fi # Now assemble the current guests and log of guests into a proper HTML page echo "<html><title>Dragon*Con Guest List and difference tracker</title><body>" > $HTML echo "<h1>Current Guest List</h1>" >> $HTML echo "This is an alphabetical list of Dragon*Con guests from the <a href='http://'>Dragon*Con Guest Page</a> and is accurate as of " `date` " (when the list was retrieved)<br>" >> $HTML echo "If you want to see what's changed, click <a href='#changelog'>here</a> to skip down to the log of changes<br>" >> $HTML echo "<dl>" >> $HTML sed -e 's/^/<li> /' $CURR >> $HTML echo "</dl>" >> $HTML echo "<h1><a name='changelog'>Log of changes to guest list</a></h1>" >> $HTML echo "<i>Please note: Changes only go back as far as 15 June 2011</i>" >> $HTML echo "$HR" >> $HTML cat $LOG >> $HTML echo "</body>" >> $HTML echo "</html>" >> $HTML The script itself is pretty utilitarian.  Use curl to grab the page, grep out the specific lines you’re looking for, strip out the extra bits and save that in a temporary file.  Then, do a diff between what we just got and what we’d previously retrieved.  If there’s a difference, save that in another temporary file then take the difference file, do some text modification to translate from “diff” to English, append that to a log file then copy what we just retrieved to the file we’re going to use for comparison later on. After that, we assemble the list of guests and the log of differences into a bare-bones HTML page.

Wouldn’t an RSS feed be nice?

So, after doing this and letting it run for a couple of days, I realised it would be much more convenient if I could simply monitor the changes via my RSS reader. So, I needed to add an RSS feed to this. Given there aren’t many RSS creation utilities I know that would integrate seemlessly into what I’d written already, I decided to write up an external utility to use the data that had been generated already and turn that into an XML feed.  In order to do this in a proper fashion, I needed to make some edits to my previous script.  Here’s the diff output. [12:12:04] > diff 11c11,13 < BASEDIR=/home/docxstudios/web --- > HOMEDIR=/home/docxstudios > BASEDIR=$HOMEDIR/web > UTILDIR=$HOMEDIR/dc-utils 27c29,30 <   echo "Difference detected on " `date` "<br>" >> $LOG --- >   LINK=$(date | sed -e 's/ /_/g') >   echo "<a name='$LINK'>Difference detected on " `date` "</a><br>" >> $LOG 30a34,35 >   # Run XML generation utility >   $UTILDIR/mk_rss.rb 38a44 > echo "Also, there's an <a href='dc_guest_list.xml'>XML feed</a> if you'd like to be notified via your XML reader when there's an update<br>" >> $HTML I modified some of the variables I used to allow for the creation of a ‘dc-utils’ directory so I didn’t have to keep my executables either in my home directory or in the web directory. I add in <a name=”datestamp”> links for each of the edits so the RSS links can point to the actual entry, not to the generic URL for the page.  To do this, I run ‘date’ and replace all the white spaces with ‘_’ (just to make the URLs a little easier for me to read when debugging things).  After the log is written, I run the ‘mk_rss.rb’ utility to generate the XML file. Finally, I update the HTML for the page to add a pointer to the RSS feed. “But what about that ‘mk_rss.rb’ script there? What does that do?” I hear  you ask?  Well, here’s the code: [12:48:07] > cat dc-utils/mk_rss.rb #!/usr/bin/ruby # # Take the dcg.log file generated from and create an # XML feed out of it require 'rss/2.0' require 'rss/maker' version = "2.0" max_feed_entries = 50 rss_feed = "/home/docxstudios/web/dc_guest_list.xml" src_file = "/home/docxstudios/web/dcg.log" site_url = '' entries = dates   = DEBUG = false # Get the data from the source file, "r") do |src| while (line = src.gets) if line =~ /Difference detected on\\s+(.*?)\\s*<.a><br>/ dates << $1 # Clear out the entry string so we can use it below entry = # I want to make this stuff grammatically correct, so add a count # here to see how many guests there are and, if there are more than # one, say 'guests have'. If there's only one say 'guest has' count = 0 while ((line = src.gets) !~ /hr width=50% align=center/) count += 1 line.sub!(/"\\s*<br>/, ', ') line.sub!(/(Removed|Added) guest "/, '') entry << line end # Remove the training ', ' and turn it into a '.' entry.sub!(/, $/, '.') if(count > 1) entry = 'The following guests have been added: ' << entry else entry = 'The following guest has been added: ' << entry end entries.unshift(entry) end end end if DEBUG 0.upto(entries.count - 1) do |i| print "== Entry #{i} ==\ " print dates[i] << "\ " print entries[i] end end # Count up the number of entries and record that. If it's greater than # 'max_feed_entries' set the record of the count to 'max_feed_entries' max_vals = entries.count - 1 if(max_vals > max_feed_entries) max_vals = max_feed_entries end DEBUG and p max_vals content = RSS::Maker.make(version) do |m| = "Dragon*Con Guest list Updates" = "" = "Completely unofficial tracking of Dragon*Con Guest List modifications" m.items.do_sort = true  # Sort items by date # For each item we grabbed from 'dcg.log', add an RSS entry for it 0.upto(max_vals) do |num| DEBUG and p "Doing entry #{num}" i = m.items.new_item i.title   = "Guest Updates as of #{dates[num]}" i.description = entries[num] mod_date = dates[num] mod_date.gsub!(/\\s/, '_') DEBUG and p "mod_date: #{mod_date}" my_link  = "#{site_url}##{mod_date}" DEBUG and p "my_link: #{my_link}"    = my_link    = Time.parse(dates[num]) end end  # End block from content = RSS::Maker.make(version) do |m|,"w") do |f| f.write(content) end A quick overview of what the script does.  It opens up the ‘dcg.log’ file and reads in all of the entries from it.  It does this by looking for a line containing ‘Difference detected on DATESTAMP</a><br>’.  This line indicates the beginning of an entry.  The script takes that DATESTAMP from the line and adds that to an array so we can can use that later for our links.  The script then gets lines, does some text munging, and adds them to a temporary string until it sees a line containing ‘hr width=50% align=center’.  This is an indication that the entry has finished.  We take the temporary string, remove the ‘, ‘ and replace it with a ‘.’ then add it to an array that we’ll use later. We have some debugging code for testing when we’re performing debugging.  Then we have a section to truncate the number of entries to a reasonable number.  If the guest list changes too much, the RSS feed will get quite long.  To deal with that in a reasonable fashion, I cap the number of results we will return to “max_feed_entries” which is set earlier in the script to 50 entries. The next section of the script is adapted from example code posted at I create the “headers” or static information for the RSS feed, then add in each entry from the ‘entries’ array up to the maximum number we set previously. And the last thing we do is write out the RSS content to a file.

“What a piece of junk!”

Well, she will make point five past lightspeed, but currently, this is not good code. It works and does what it should, but even without going into moving the data store from a flat file to a database and adding in some CSS to make it pretty, there are some fundamental flaws.  It’s not modular, it’s not commented well enough and it contains bugs.  Let me talk about what I mean for each of those. It’s not modular.  The code pretty much runs from the beginning to the end.  If functionality needs to be used, it gets defined inline where it’s used.  As a primary example, the section creating the RSS contents for each entry should be in its own subroutine.  Conceptually, that’s a single idea: “Grab the entries from the array and stuff them into the RSS object”.  Actually doing that takes a few lines of code, but when I’m looking over the code I like to keep things simple so I don’t mix conceptual chunks together so each chunk of code is dedicated to doing primarily one thing.  This helps when thinking about the code as well as debugging.  If I modify a subroutine and stuff breaks, I can keep my debugging to that single subroutine, even if the problem manifests itself in another section of the code.  If the code blocks get bigger and bigger mixing in conceptual ideas, debugging gets very complex very quickly, something I generally try to avoid.  In addition, if I want to add in new functionality, if the code is in a subroutine, I can make the modification there.  If the functionality is completely new, I can write a new subroutine and add it into a standard code block to test functionality at that point. When I segment out code into individual subroutines, it becomes easier to comment it.  When a subroutine fits into 10-20 lines, I can add a couple lines to describe the intent of the subroutine and a few lines within the subroutine describing how it achieves that intent.  When the code blocks get bigger and bigger, describing what it does becomes quite the burden and, like all lazy programmers, I don’t comment as well as I should (or as well as I expect others to comment).  When I come back to the code later, I expect to be able to read the comments and use that to get back in the head space for what I was doing when I originally wrote the code.  If I haven’t commented properly, getting back into that head space becomes much, much harder and takes more time than it would have to comment the code in the first place. Also, the script is buggy.  It works well enough as guests are added, but what happens if a guest is removed? Right now it doesn’t handle that well, if at all.  I’ll be updating this code to a) make it more module to b) fix the bugginess and c) provide me the impetus to properly comment the code.  But, given how long I’ve been writing this today, I’m going to leave that for another time and I will post when I have made those updates.