I had a 280MB xml file containing about 120,000 records. I needed a way to parse out a subset of the records (about 15,000) and then put their data into a db.

I was developing on a vps with only 256MB of RAM, so I wanted to avoid memory intensive operations. I’m using ruby, so I started looking into how to ’stream’ the xml with ruby in such a way that I could read if it was the type of record I wanted to keep, somehow manipulate and store the data and move to the next record. The more I looked into that strategy, the more complex and ominous it seemed. I just really wanted to find a simpler way that wouldn’t require all the apparent tediousness of streaming the xml file (it may seem simple enough, but there are dependencies you have to install, api’s you have to read and learn, etc). There were just too many moving pieces and it made me feel eery about the outcome. Another aspect of it was just pure laziness. I just don’t care how I get that data into the db – I just want to get it in there.

Fortunately, the glorious wonderfulness of linux utilities saved me from a long tedious solution. Behold:

  csplit -q catalog.xml '/<title_index_item>/' '{*}'

This splits the large xml file into sub-xml files that start with ‘xx’ by default, followed by an incremental number, xx10004 for example. It splits the file based on the <title_index_item> tag – which is the tag for the items I want. See `man csplit` for more info…

  find xx* | xargs grep -L 'label="instant"' | xargs rm -f

The ‘find’ lists all the sub-xml files, then we grep for the filenames that do not contain what I’m looking for, then delete those files. I’m left with a directory of 15,000+ xml files with just the type of data I’m interested in.

  find . | wc -l

This command is just so I can track the progress of the operations. Obviously `ls` would return too much data, so we pipe the file list to `wc -l` which gives us the number of lines – which in this case is the number of files in the dir.

So, two lines of code on the command-line instead of a far-more complex ruby/streaming-xml solution. Now I can have a ruby script process each individual file and add the data to the db – a much simpler problem to deal with.

My point is that you can accumulate code debt by doing the seemingly ‘right’ solution sometimes. There are probably coders that will cringe that I just used a couple of shell commands to do this rather than write up a long, well-documented, properly OOP, TDD, etc “right” solution. However, the “right” solution in that case would have accumulated code debt – more code to maintain, more moving pieces, more things that can go wrong. I’d trade maintaining two lines of shell code over 10’s or 100’s of lines of ruby code and their dependencies any day of the week.