Saturday, 24 February 2007

3,2,1: My placement story

shortlists - 3
offers - 2
accepted - 1

Hyderabad = home for quite some time in the near future :-)

BTW, does anyone know a place that I can rent?

Using awk to parse HTML

Sorry for disappearing away like that… we had lots of stuff happening on campus (read our placements). And between tracking job posts and interview preps and and the interviews themselves, my blog was the last item on my priority list. Anyways, now that placements are done (and successfully for me), let me get back to the second post about working with data from the web.

The objective
One thing I did not make clear in the last post was the objective of our exercise. What we were trying to build was a model that could suggest the price for a user if (s)he decides to put up a car for sale. The model would be based on how people are currently pricing their used cars.

So, in order to build a model, we needed to work up some data from, as we did the last time. The next step is to convert all of that data in the 17,000 html files into a single spreadsheet. Each listed car on the website should have its details on a separate record in the spreadsheet.

In this post, let's explore how to do this using awk.

A simple sample HTML
Let’s take a look at the listing The webpage looks like this

In order to build our spreadsheet, we need all the data such as the year of manufacturing, the make of the car, the model and the style. Let’s take a look at the html source code.

For most of the variables that we’re interested in, there are specific identifiers that have been written in the html, like lblYear, lblMake and so on.

Given all such identifiers, it becomes a simple programming task to list out all the lines that match these identifiers, get rid of the tags, and bingo! we’ll have all the information that we need.

Let’s try out a simple awk script to see how we can do this.

The first awk script
I’m not going to delve into the details of how awk works. You can find out more detail on the GNU awk guide at Yep, GNU’s implementation of awk is known as gawk. Though I have used gawk for all of the examples, I’ve interchangeably used the names awk and gawk. The name’s inconsequential, the result is the same.

Back to our job in hand, we’re interested in specifying a pattern and printing out the details of the lines that these patterns match.
/lblYear/ ||
/lblMake/ ||
/lblModel/ ||
/lblStyle/ {
print $0;
In the above snippet, we’re telling awk to print the entire line (represented by the $0) if it matches any of the regular expressions listed between the two forward slashes.

If we save the above snippet into a file named “split.awk” and run the following command
C:\> gawk –f split.awk buy-used-car130625.html
We get the following output:
1.1 GS Zip Drive
Eliminating the span tags
Let’s look at the above information closely – if we can eliminate the stuff between and including the angle brackets, we’d be left with exactly the information that we need. Let’s modify the awk script we wrote earlier.

The two gsub commands are global search and replace commands. They will look for the regular expression (specified as the first argument) and replace it with the substitution string ( the second argument -- in this case, a blank) in the text that is specified as the third argument). Now that we’ve wrapped that up in a neat function, let’s look at the output that this gives us
C:\> gawk –f split.awk buy-used-car130625.html
1.1 GS Zip Drive

In a spreadsheet, please!!
The last teeny-weeny bit of stuff that we need to do is to plug the above into a single line, preferably in a comma-separated or tab-delimited format, so that it can open up in a spreadsheet program like MS Excel.

We can use a simple trick. Each time the pattern is found, instead of printing the output after stripping off the
tags, we can concatenate that output to a string, that can be printed off later using the END code block of awk.

Here we go…

And the corresponding output
C:\> gawk –f split.awk buy-used-car130625.html
2002 Hyundai Santro 1.1 GS Zip Drive
If I decide to redirect this into a file using the > operator of DOS,
C:\> gawk –f split.awk buy-used-car130625.html > file.txt
I can open file.txt using MS Excel (by using the tab-delimited option) to get something like this:

Expanding the above
The last bit that remains is to run the above technique across all the HTML files that we’ve generated. This can be done by writing a single batch script in DOS with the awk command that we’ve used for all the 17000 HTML files that we’ve downloaded

Of course, since we needed more information than the simple example that we’ve seen above, our split.awk script ran into many more lines. There were some fields that we handled using associative arrays

Our final code looked like this:

And our output in a spreadsheet:

Downloading awk
If you’re going to use awk for the first time, I’d recommend you try it on a linux or Unix machine. There are a whole lot of other tools that gel well with awk, using the pipe mechanism that Unix scripts provide. Plus, I’ve still not been able to write simple awk scripts that can be written on the windows command line. Even simple stuff like
C:\> awk ‘BEGIN{ print “hello world”; }’
fails miserably. Probably the quotes or something – haven’t been able to figure it out.

Nevertheless, if you do need to use awk on Windows, then awk comes as part of an entire Unix toolkit available at Try the zip picker – it should give you the entire stuff – awk, sed, the whole nine yards.

Alternately, you could try using cygwin. Runs a bit slower as it uses something called a Unix emulation layer. But it's the closest replication of a Unix environment on Windows that I've seen, and is packed with all the unix tools that one can imagine.

Additional reading
  1. The GNU Awk User’s Guide -
  2. Unix classic tools - Awk Programming -
  3. Awk man pages for Linux -

Other Comments

In response to my previous post, Punit had asked me why I didn’t use RSS and an XML parser - here's why:
  1. It didn't look like has been refreshing it's RSS feed (I wasn't able to subscribe to it) and the XML file that's on the server contains some very old data from 2005, and about 15 records from 2007 - it was insufficient for us.
  2. RSS is typically used to only publish the latest n items that the site has listed. n could be 15, 100, or 1000, depending on the webmaster. If you want all the data that a site has, like we wanted, you'd be extremely lucky to get that off RSS. In our case, even if the RSS feed has been refreshed regularly, we would have needed to monitor it over a long period of time (probably a year) to build a sufficiently large data set.
  3. Depending on how the webmaster has configured the XML, the description may not contain all the fields that one is looking for. In our case, we wanted everything - from the price of the car to the location to whether it has power steering. The site's webmaster didn't publish all of that. So, RSS was clearly out.
awk was useful for us in this exercise because of the way html is structured. Had the html been a little more complicated, I’d probably have had to use perl and an HTML parser module off CPAN. In the end, using awk has to be a conscious judgement decision – it can give you results quickly, but you need to be careful – it can be quite painful to debug. And if you’re the kind of person who doesn’t believe in documenting code, you’ll end up having loads of fun :-)
    Our project
    What I've described above was just the initial part of our project, where we gathered the data from the web. We went on to build a price-prediction model that turned out to be relatively successful. We had an error in the range of 10,000 rupees for the top 37% data, I guess that was fairly acceptable. Anyways, the report has now been submitted.. let’s wait for the grades.

    Thursday, 8 February 2007

    Crawling web data using "wget"

    This is the first of a two-part post in which I discuss tools that we used for our project in our Business Intelligence using Data Mining course. During this project, we were to work on some real-world data to come up with some intelligent analysis using the tools we learned during the course.

    Let's take a look at the website If you try to search for all cars that are for sale, you'd reach If you click on any vehicle that is listed there, the site lists each individual vehicle in a separate webpage that looks like

    Let’s take a look at the URL - If you change the trailing number (17262) to some other number, the site would take you to a different vehicle that they’re listing. For our project, we wanted to download the html files for all the cars listed on the site – we’re talking about 17,000 files here. If you find yourself in a similar situation, chances are you would not want to save each page manually – even if you can save 15 pages a minute, it’d take you 18 hours to save 17000 web pages. Whew!

    wget provides a smarter solution. From wget’s website,

    GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

    In a nutshell, it means you can open a command window, and type something like

    C:\> wget

    and hit enter. And wget will download that html file for you.

    Now our job has become a little simpler. We can take any spreadsheet software, have it generate the numbers 1 through 17262. Use a simple search and replace to add the leading string “wget” and the trailing “.html” to the generated number. Save that into a text file. Copy the entire file into a command window, and then go and have a cup of black tea. Do make sure you add a dash of lemon. Depending on your network connection speed, you’d have downloaded all the 17000 files by the time you finish your tea.

    Now that we have 17000 files with us, these are still in html. We now need to convert the data in html to something in a spreadsheet, so that the data for a given car lies on a single row. I’ll discuss how to do exactly that using awk in my next post.

    Wednesday, 7 February 2007

    Misguiding content

    The website of the National Institute of Smart Governance (NISG) loops through five images on the opening page -- two of these images are pictures from ISB. Even in the photogallery page, there are seven images shot at ISB.

    I'm surprised why an organisation would put up photographs of facilities that are not their own. I mean, let them put up pics of the IIIT campus where they reside. Or at least, let them mention that these pics were shot at ISB.

    BTW - does this count as image plagiarism?