Saturday, 24 February 2007

Using awk to parse HTML

Sorry for disappearing away like that… we had lots of stuff happening on campus (read our placements). And between tracking job posts and interview preps and and the interviews themselves, my blog was the last item on my priority list. Anyways, now that placements are done (and successfully for me), let me get back to the second post about working with data from the web.

The objective
One thing I did not make clear in the last post was the objective of our exercise. What we were trying to build was a model that could suggest the price for a user if (s)he decides to put up a car for sale. The model would be based on how people are currently pricing their used cars.

So, in order to build a model, we needed to work up some data from, as we did the last time. The next step is to convert all of that data in the 17,000 html files into a single spreadsheet. Each listed car on the website should have its details on a separate record in the spreadsheet.

In this post, let's explore how to do this using awk.

A simple sample HTML
Let’s take a look at the listing The webpage looks like this

In order to build our spreadsheet, we need all the data such as the year of manufacturing, the make of the car, the model and the style. Let’s take a look at the html source code.

For most of the variables that we’re interested in, there are specific identifiers that have been written in the html, like lblYear, lblMake and so on.

Given all such identifiers, it becomes a simple programming task to list out all the lines that match these identifiers, get rid of the tags, and bingo! we’ll have all the information that we need.

Let’s try out a simple awk script to see how we can do this.

The first awk script
I’m not going to delve into the details of how awk works. You can find out more detail on the GNU awk guide at Yep, GNU’s implementation of awk is known as gawk. Though I have used gawk for all of the examples, I’ve interchangeably used the names awk and gawk. The name’s inconsequential, the result is the same.

Back to our job in hand, we’re interested in specifying a pattern and printing out the details of the lines that these patterns match.
/lblYear/ ||
/lblMake/ ||
/lblModel/ ||
/lblStyle/ {
print $0;
In the above snippet, we’re telling awk to print the entire line (represented by the $0) if it matches any of the regular expressions listed between the two forward slashes.

If we save the above snippet into a file named “split.awk” and run the following command
C:\> gawk –f split.awk buy-used-car130625.html
We get the following output:
1.1 GS Zip Drive
Eliminating the span tags
Let’s look at the above information closely – if we can eliminate the stuff between and including the angle brackets, we’d be left with exactly the information that we need. Let’s modify the awk script we wrote earlier.

The two gsub commands are global search and replace commands. They will look for the regular expression (specified as the first argument) and replace it with the substitution string ( the second argument -- in this case, a blank) in the text that is specified as the third argument). Now that we’ve wrapped that up in a neat function, let’s look at the output that this gives us
C:\> gawk –f split.awk buy-used-car130625.html
1.1 GS Zip Drive

In a spreadsheet, please!!
The last teeny-weeny bit of stuff that we need to do is to plug the above into a single line, preferably in a comma-separated or tab-delimited format, so that it can open up in a spreadsheet program like MS Excel.

We can use a simple trick. Each time the pattern is found, instead of printing the output after stripping off the
tags, we can concatenate that output to a string, that can be printed off later using the END code block of awk.

Here we go…

And the corresponding output
C:\> gawk –f split.awk buy-used-car130625.html
2002 Hyundai Santro 1.1 GS Zip Drive
If I decide to redirect this into a file using the > operator of DOS,
C:\> gawk –f split.awk buy-used-car130625.html > file.txt
I can open file.txt using MS Excel (by using the tab-delimited option) to get something like this:

Expanding the above
The last bit that remains is to run the above technique across all the HTML files that we’ve generated. This can be done by writing a single batch script in DOS with the awk command that we’ve used for all the 17000 HTML files that we’ve downloaded

Of course, since we needed more information than the simple example that we’ve seen above, our split.awk script ran into many more lines. There were some fields that we handled using associative arrays

Our final code looked like this:

And our output in a spreadsheet:

Downloading awk
If you’re going to use awk for the first time, I’d recommend you try it on a linux or Unix machine. There are a whole lot of other tools that gel well with awk, using the pipe mechanism that Unix scripts provide. Plus, I’ve still not been able to write simple awk scripts that can be written on the windows command line. Even simple stuff like
C:\> awk ‘BEGIN{ print “hello world”; }’
fails miserably. Probably the quotes or something – haven’t been able to figure it out.

Nevertheless, if you do need to use awk on Windows, then awk comes as part of an entire Unix toolkit available at Try the zip picker – it should give you the entire stuff – awk, sed, the whole nine yards.

Alternately, you could try using cygwin. Runs a bit slower as it uses something called a Unix emulation layer. But it's the closest replication of a Unix environment on Windows that I've seen, and is packed with all the unix tools that one can imagine.

Additional reading
  1. The GNU Awk User’s Guide -
  2. Unix classic tools - Awk Programming -
  3. Awk man pages for Linux -

Other Comments

In response to my previous post, Punit had asked me why I didn’t use RSS and an XML parser - here's why:
  1. It didn't look like has been refreshing it's RSS feed (I wasn't able to subscribe to it) and the XML file that's on the server contains some very old data from 2005, and about 15 records from 2007 - it was insufficient for us.
  2. RSS is typically used to only publish the latest n items that the site has listed. n could be 15, 100, or 1000, depending on the webmaster. If you want all the data that a site has, like we wanted, you'd be extremely lucky to get that off RSS. In our case, even if the RSS feed has been refreshed regularly, we would have needed to monitor it over a long period of time (probably a year) to build a sufficiently large data set.
  3. Depending on how the webmaster has configured the XML, the description may not contain all the fields that one is looking for. In our case, we wanted everything - from the price of the car to the location to whether it has power steering. The site's webmaster didn't publish all of that. So, RSS was clearly out.
awk was useful for us in this exercise because of the way html is structured. Had the html been a little more complicated, I’d probably have had to use perl and an HTML parser module off CPAN. In the end, using awk has to be a conscious judgement decision – it can give you results quickly, but you need to be careful – it can be quite painful to debug. And if you’re the kind of person who doesn’t believe in documenting code, you’ll end up having loads of fun :-)
    Our project
    What I've described above was just the initial part of our project, where we gathered the data from the web. We went on to build a price-prediction model that turned out to be relatively successful. We had an error in the range of 10,000 rupees for the top 37% data, I guess that was fairly acceptable. Anyways, the report has now been submitted.. let’s wait for the grades.

    1 comment:

    Anonymous said...

    Thanks for ur gawk script was very handy dude !