Thursday, 8 February 2007

Crawling web data using "wget"

This is the first of a two-part post in which I discuss tools that we used for our project in our Business Intelligence using Data Mining course. During this project, we were to work on some real-world data to come up with some intelligent analysis using the tools we learned during the course.

Let's take a look at the website www.autoindia.com. If you try to search for all cars that are for sale, you'd reach http://autoindia.com/BuyUsedVehicle/BikeCarListings.aspx. If you click on any vehicle that is listed there, the site lists each individual vehicle in a separate webpage that looks like


Let’s take a look at the URL - http://autoindia.com/UsedVehicle/buy-used-car17262.html If you change the trailing number (17262) to some other number, the site would take you to a different vehicle that they’re listing. For our project, we wanted to download the html files for all the cars listed on the site – we’re talking about 17,000 files here. If you find yourself in a similar situation, chances are you would not want to save each page manually – even if you can save 15 pages a minute, it’d take you 18 hours to save 17000 web pages. Whew!

wget provides a smarter solution. From wget’s website,

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

In a nutshell, it means you can open a command window, and type something like

C:\> wget http://autoindia.com/UsedVehicle/buy-used-car17262.html

and hit enter. And wget will download that html file for you.

Now our job has become a little simpler. We can take any spreadsheet software, have it generate the numbers 1 through 17262. Use a simple search and replace to add the leading string “wget http://autoindia.com/UsedVehicle/buy-used-car” and the trailing “.html” to the generated number. Save that into a text file. Copy the entire file into a command window, and then go and have a cup of black tea. Do make sure you add a dash of lemon. Depending on your network connection speed, you’d have downloaded all the 17000 files by the time you finish your tea.

Now that we have 17000 files with us, these are still in html. We now need to convert the data in html to something in a spreadsheet, so that the data for a given car lies on a single row. I’ll discuss how to do exactly that using awk in my next post.

4 comments:

Unknown said...

Dont you think picking up their RSS Feeds and extracting the required information from their XML is an easier and better idea?

Chiranth Channappa said...

@punit
Yes, parsing an XML file would have definitely been easier and far more elegant than the messy awk script I wrote.

But RSS wasn't the option for us for reasons other than parsing - I'll write about that in a future post.

Thanks for your comment.

Jo said...

the second post on this subject??

Chiranth Channappa said...

@jo

sorry jo, had got busy with our placements.. but have uploaded the next post now. hope to blog more regularly henceforth.