Let's take a look at the website www.autoindia.com. If you try to search for all cars that are for sale, you'd reach http://autoindia.com/BuyUsedVehicle/BikeCarListings.aspx. If you click on any vehicle that is listed there, the site lists each individual vehicle in a separate webpage that looks like
Let’s take a look at the URL - http://autoindia.com/UsedVehicle/buy-used-car17262.html If you change the trailing number (17262) to some other number, the site would take you to a different vehicle that they’re listing. For our project, we wanted to download the html files for all the cars listed on the site – we’re talking about 17,000 files here. If you find yourself in a similar situation, chances are you would not want to save each page manually – even if you can save 15 pages a minute, it’d take you 18 hours to save 17000 web pages. Whew!
wget provides a smarter solution. From wget’s website,
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
In a nutshell, it means you can open a command window, and type something like
and hit enter. And wget will download that html file for you.
Now our job has become a little simpler. We can take any spreadsheet software, have it generate the numbers 1 through 17262. Use a simple search and replace to add the leading string “wget http://autoindia.com/UsedVehicle/buy-used-car” and the trailing “.html” to the generated number. Save that into a text file. Copy the entire file into a command window, and then go and have a cup of black tea. Do make sure you add a dash of lemon. Depending on your network connection speed, you’d have downloaded all the 17000 files by the time you finish your tea.Now that we have 17000 files with us, these are still in html. We now need to convert the data in html to something in a spreadsheet, so that the data for a given car lies on a single row. I’ll discuss how to do exactly that using awk in my next post.