Monday, 23 July 2007

long time, no rant

So here's one.

Why is it that some people have to put up HP#7 spoilers on their google talk status message? Do they want to show off that they are speed readers, or that they had been reading spoilers all along, or they just want to ruin the read for people on their contacts' list for the fun of it?

I'd been ignoring the HP fan sites for months, just in case I read a spoiler; and no, I did not read TOI on Sunday, the 22nd either. But (and here's the big mistake) I logged into google talk this morning, and boom - now I know how the story will unfold.

Thanks for ruining my reading, mate.

Friday, 4 May 2007

Ladakh Snaps

Here are the Ladakh snaps!!
There are a few more snaps in this web-album.. and the story follows soon, after I move to biryani-land :-)

Tuesday, 17 April 2007

Graduation Snaps

Here are some snaps from our convocation rehearsal, the class photo-session and the convocation ceremony.

More snaps here


the stage is set


shine on you...


crossing swords

with my parents

our class is too big for a single snap

walking in to the makeshift auditorium...

... and Mr Tee poses for the camera

Back to Term-2: A crash course in macroeconomics

Graduated!!

oops! I threw my hat too high!

hat throwing: take-2

waiting for the snaps to be burnt to CD... didn't anyone study Service Ops Management?

ISB's class of 2007: the final handshake

Wednesday, 11 April 2007

Placement Story

When watching TV last night, I hit upon CNN-IBN playing this news snippet:




To quote from the text version from CNN-IBN's site
Five students from the prestigious Indian School of Business have bagged international jobs worth Rs one crore annually with the highest clocking Rs 1.2 crore.
Five crorepatis? Well, sure 416 is a big class, but I didn't know it was big enough for me to miss meeting (or even knowing) these five classmates of mine.


Update [17 April 2007]
After intense speculation, some friends of mine & I did come up with five possibilities who could have hit the one-plus mark. So I (probably) do know these five classmates of mine.

Comment on an earlier post

I received a pretty strongly worded comment on one of my earlier posts titled Tale of two networks I thought it deserved the attention of another full post.

Here goes the comment by Mr Anonymous (or maybe Ms Anonymous):
Your stand about viral marketing in Tagged.com is a clear stupidity. It's still at your own discretion whether you want to send invitation to the list in your addressbook.

Haven't you wonder that it just makes easier for you to send invitation to your friends. If you think about blogging something , think twice before posting it.
Hmm... it looks like I touched someone's nerve there.

OK, Mr Anonymous, please look at another post of mine - First Gazzag, now Tagged & Facebox. Now you say it is at the user's discretion whether to send invitation to the list in his/her addressbook. Then please explain to me how to avoid the screen that I've listed in Step 3 which not only asked me for my hotmail password, but also refused to allow me to go ahead unless I typed the password in. It takes quite a bit of ingenuity to bypass this step - something that the average hotmail/yahoo user would not be able to do - therefore it is no longer at the user's discretion as to whom to invite.

Your last question was if it doesn't just makes easier for me to send invitation to my friends? No, it does not. It just adds to the spam in my friends' mailboxes. And I detest spam, especially if it goes out in my name without my explicit intention.

Graduated

We graduated on a sunny Saturday morning under a lot of glare. Proud parents, smiling guests, excited faculty - it must be the most memorable day for most of my classmates.

Huge posters saying "Graduation - Class of 2007" hung on either sides of the door of a makeshift auditorium. And when we entered it, we were welcomed by a roar that still rings in my ears. I'd seen a lot of videos where a person walks down through an eagerly anticipating crowd and when I experienced that for myself, I did feel like a celebrity, even though it was for a tiny fleeting moment.

Well, I am an MBA now. I did enjoy the one year I spent at b-school and learned a few things along the way. Maybe I could have learned a lot more. Nevertheless, I'm happy, and my parents are proud.

Tuesday, 3 April 2007

Going offline

My laptop loses network connectivity tomorrow. So, I guess this'll be my last post from ISB. We have our convocation this weekend.. finally I'll be an MBA. Reminds me, I ought to change my "about me" description.

What's next for me? A few weeks in Bangalore; then about five days in Leh (really looking forward to this one); another week in Bangalore; and then I return to full-time employement in Hyderabad.

Saturday, 31 March 2007

Project Discotheque

Here's a snap I had posted to section C about a classmate of ours.


The person on the right once acted in a movie called DiscoDancer. The one on the left used to work for Cisco. We let our logic take a hop-step-and-jump from these facts, and end up naming the person on the left as Disco (it doesn't matter what Disco's real name is - don't ask, and I won't tell).

While the name Disco is reasonably popular within our class at ISB, I am getting worried that his name isn't getting recognised among our alumni network. What would happen if subsequent batches at ISB don't recognise him as Disco? It'd be such a shame... we're talking about a missed legacy here.

So, here is my prescription for correcting this problem. Let's do a googlebomb :-)

To quote wikipedia,
Because of the way that Google's algorithm works, a page will be ranked higher if the sites that link to that page use consistent anchor text. A Google bomb is created if a large number of sites link to the page in this manner
Currently, if you google with the search phrase Disco + Cisco + ISB, Disco's blog shows up as the second search result.


If you just search for Disco + ISB, his blog is the 21st.

To me, these search results are unacceptable.

The objective of Project Discotheque is as follows - get the ranking of Disco's blog up and higher in Google's ranking for the search phrase "Disco" :-) Within a week of this announcement, Disco's blog should be the first search result for the phrase "Disco Cisco ISB" on google, yahoo and msn search. Within a month, the blog should be the first search result for the phrase "Disco ISB".

So, if you're interested in helping out, here's what you can do. If you are a blogger, or you maintain your own website, please make a link to http://isb2007.blogspot.com/ using the anchor text as "Disco".

Oh, and if you have any other suggestions, please do let me know.

Friday, 30 March 2007

done with it

This time last year, I was dreading my last day at work. And today, I'm done with my last examination at ISB; nothing left but the convocation next weekend.

Yep, the roller-coaster is over.

And the emptiness will sink in pretty soon.

Thursday, 29 March 2007

A Tip

For the incoming batch at ISB, here's a tip:

Beware of the funnel.

But if you are really adventurous, then maybe...

Monday, 26 March 2007

Losers' Lessons from World Literature

Here's a small sequence of events in the book Asterix and the Chieftain's Shield

The defeat of a proud Gaul


The place of defeat gets a bad name

..and the attitude sticks, even among the commoners
In the legend, it says "An attitude which has persisted down the centuries, with the result that scene of the Gauls' defeat by Caesar is still unknown. A regrettably chauvinist state of affairs."

It's a trifle sad the Caribbean Islands are already well-charted territory, else we could forget about the battles of Trinidad & Tobago just as easily as the Gauls forgot Alesia.


Wednesday, 21 March 2007

Why did India lose to Bangladesh?

Question: why did India lose the opening match against Bangladesh?
Answer: because this is how they spent their practise session.



Scarcity & Spam

Question: How does a real-estate broker attempt to speed up the pace at which he could close a deal?
Answer: He creates a perceived scarcity in the mind of the buyer. As soon as he has explained the nuts and bolts of the deal, you can be sure the first thing he's going to say is "I've got two other buyers who're looking to lease this place." That the demand is perceived to be greater than supply will automatically make the buyer willing to pay a higher price. Of course, knowing about this little gimmick of real-estate brokers doesn't help much - we are not 100% rational.. we somehow can't seem to account for this gimmick.

Real-estate is an essential, going by the roti-kapda-makan philosophy. Others may have lists longer than that - you could count education, telephone connections, maybe even Internet connections. But why would anyone consider orkut accounts as an essential? Well, let's take a look at this e-mail I received:
from: (name hidden)reply-to (email hidden)
to my friends
date 20-Mar-2007 23:05
subject Message from Orkut
mailed-by orkut.bounces.google.com

HEY ITS DIANNA, FROM THE DIRECTOR OF ORKUT,EVERYBODY SORRY FOR THEINTERRUPTION BUT ORKUT IS CLOSING THE SYSTEM DOWN BECAUSE TOO MANYBOTTERS ARE TAKING UP ALL THE NAMES, WE ONLY HAVE 57 NAMES LEFT, IF YOUWOULD LIKE TO CLOSE YOUR ACCOUNT, DONT SEND THIS MESSAGE, IF YOU WANT TO KEEP YOUR ACCOUNT ,SEND THIS MESSAGE TO EVERYONE ON YOUR LIST. THIS IS NOT A JOKE, YOU'LL BE SORRY IF YOU DONT SEND IT. THANKS DIRECTOR OF ORKUT, TIM BUISKI. WHOEVER DOESNT SEND THIS MESSAGE, YOUR ACCOUNT WILL BE DEACTIVATED AND IT WILL COST YOU $ 10.00 A MONTH TO USE IT.
This message was sent to you by xxxx. To see xxxx's profile click:http://www.orkut.com/Profile.aspx?uid=xxxxx

For a moment, let's forget that this is a hoax. Let's also forget that Tim Buiski has been "the director" of bebo.com, yahoo.com and msn, before taking over as director of orkut ;-)

Now, the underlying principle of this hoax was scarcity. The email speaks of something that is running out, and the reader somehow is able to connect with that. But why wouldn't this sender consider whether or not the item being spoken of is an essential or not. It doesn't really matter if my orkut account expires - it doesn't really affect me. And I can hazard a guess that most of my friends don't treat their orkut account as an essential.

So, question for the day: why do hoaxes thrive? And why do they thrive as much on the Internet?

If you have any pointers, please leave a comment. I shall try to collate them and write down my own thoughts in a future post.

Monday, 19 March 2007

We were discussing cricket with Prof Montealegre at a small get together that we had last Thursday. The conversation went something like this:

Prof Montealegre : So, I hear it takes a whole day to play a game of cricket.. that's quite fantastic!
ISBian Cricket Buff #1:Sir, there is another format that lasts five days.
Prof Montealegre : What! Oh.. that's amazing! So, which are the good teams?
ISBian Cricket Buff #1: Australia is the best. South Africa's good . India is good too. Pakistan are alright, but they play their best against India.
Prof Montealegre : Really?
ISBian Cricket Buff #2: Yeah, Indo-Pak games are very sought after. The roads are empty when there's an India-Pakistan match. Everyone takes off, nobody works.. and there are firecrackers if we win.
Prof Montealegre : Hmm.. very interesting. So, when is India's first game?
ISBian Cricket Buff #1: Day after tomorrow. We're playing Bangladesh.
Prof Montealegre : Okay, are they a good team?
ISBian Cricket Buff #2: Well, they're new, they have a lot to learn. They'll learn from us.
Prof Montealegre :
Hmm.. nice, nice.

I hope the Professor has followed up on the latest.

Thursday, 15 March 2007

Tale of two networks

I think I'm getting slightly obsessed with this topic of web-based social networks, so I solemnly swear this'll be my last post on this subject. For a while, in any case :)

Some people believe that it takes a maximum of six people to link up any two people in the world. This is a familiar theory called the six degrees of separation. Mathematically, this makes sense. Let's say every person on earth knows 500 people. I don't think 500 is too high a number - I have 472 people on my orkut list, which include seven people I haven't ever met (they could be fake for all I know), but I think I know more people who don't have an orkut account. The 500 number means that through six other people, I could theoretically form links with 500^5 = 31250000000000, which is definitely greater than the population of the earth. Yes, there might be repeats, but there's a good chance I could cover everyone. At least that's what the theory says. But here's where I disagree -- there might be some clusters of people which are so remotely connected from the rest of the world, that they might not have a link. This is especially true with tribes that have been untouched by other civilised humans.. so it's difficult for me to believe in the Six Degrees theory outright. Nevertheless, if you keep increasing the degrees - from six to eight to maybe 15 - you are bound to connect most humans on the planet. For me, the more interesting fact about the "Six Degree of Separation" theory isn't in the number six but that most of us are connected to each other, most of us know each other. At least indirectly.

Let's draw a parallel to this in the Internet space - quite arbit, but please bear with me. Let's say that email addresses are the equivalent of humans. Well, they are born, they live and then they die. And while they're living, they contact other email addresses. How many degrees separate any two email addresses in the world? Maybe six, maybe more, maybe infinite (you do have closed loop intranets where all email addresses are internal to the network - pretty much like the tribes that have been left in isolation). If I assume the equivalent of knowing another person is an email sent from on email address to another, how many degrees would separate any two email addresses on the Internet?

Here's what I'm getting at. There are two networks - one of humans and one of email addresses. These two networks are separate, they are distinct, and they should not be confused for each other. If I am serious about Web 2.0 in any form, I have to draw a distinction between Internet users and the means they adopt to communicate on the Internet.

What orkut & yahoo! 360 and Linked in have done is to differentiate between the two networks. On Linked in, you don't invite an email address to hook up with you, you invite a person to join your circle. Even on orkut, if I ignore the malicious bots created by crackers (crackers, not hackers), it is people that I am connecting with. When the programmers behind this system forgets that he's trying to connect people and not email addresses, you end up getting a system like tagged.com. The website starts trying to find more and more email addresses to send spam invitations to in the vain hope that viral marketing will work. Well, sir, it doesn't. If gazzag.com were to get hold of my addressbook and sent invitation to everyone on it, they'd send invites to email addresses like that of the placement office at ISB. They won't be inviting the people working in the placement office, but the placement office itself - now what kind of a social network is that?

Lesson for running a web 2.0 site that bases its growth on viral marketing - let the users decide whom to invite and whom not to. That would be social; cracking through users' addressbook is not.

Monday, 12 March 2007

First Gazzag, now Tagged & Facebox

The site www.tagged.com is a new social networking site. I haven't been tracking on its popularity, but the site claims to be the fastest growing teen social community on the Internet. From the corporate information available on the site,
Tagged.com is the premier social networking destination for the Millennial Generation and an ideal place for advertisers who are trying to reach the teen market.
OK, so that tells me (and my friends) to stay away from the site.. we've not been teen-agers for quite some time now. But there are bigger problems with this site than that.

A few months back, I had written the articles de-gazzaging my inbox and more on gazzag about a popular web nuisance that called itself Gazzag. I think tagged.com is just another example of such Internet annoyances. Let's walk through tagged.com's registration process to see why:

Step 1
OK, here's the website. It does look a whole lot uglier on Internet Explorer with some nasty banner advertisements. Anyway, let's create a new account on this.

Step 2
Yep, filled in my details. But take a look at something that I've done - I specified my date of birth as the 31st of February, 1978. Not only did the website allow the non-existent date, it didn't reject me because even though I am not a teenager. I'm wondering what happened to their tagline of being a teen social community.

Step 3

Whoa! Give me a minute.. why does Tagged.com want my hotmail password?!? Yep, one reason only - to get my addressbook, so that it can spam all of my friends with invitations to Tagged.com. Another example of viral marketing in which the existing users don't get a say in who gets contacted. What's more, there doesn't seem to be a straightforward way to get around this screen. I guess the naïve user would not look to bypass this screen and simply provide their hotmail password. While I didn't check if the tagged.com guys have devised mechanism to import contacts from yahoo and gmail, I won't be surprised if they have.

The same problem exists on the website facebox.com, another social networking site. When you register on this, there does not appear to be any way to create your account without importing contacts from your e-mail address books.


But if you click on the "by email address" option, you'd get the option to bypass this step:

Nasty, isn't it?

Let me re-iterate what I had written earlier about gazzag and other sites that ask for e-mail passwords:
... The tool that allows one to import contacts from yahoomail & orkut requires one to enter the username and password of those sites. And a lot of my friends, including a few tech savvy ones, have done exactly that. My question is why??

Would you trust me with your yahoomail password? I think not, I hope not. Then why would one trust gazzag or any other site asking for your yahoomail password? ...
Tagged & facebox are just a few more site that are trapping users into sharing their addressbooks. But the point is that it could get worse.. what if someone in Tagged.com's office is reading your e-mail? Their privacy policy will probably mention that they won't - but how do you know they're going to stand by that? Anyways, the next time a site asks you for your hotmail/gmail password, please think about if you'd like to share your password with me.

If you've sent me a tagged or a facebox or a gazzag invitation recently, I hope you'll understand why I've not responded.

Friday, 9 March 2007

Work-ex for an MBA at ISB

As an ISBian, I keep getting queries from potential applicants whether their profile is good enough to be selected at ISB. One of the more worrying aspects for many aspirants is the amount of work experience they need before they can apply. ISB has a policy that reads
Preferably two years of full-time post qualification work experience.
We'll come back to my interpretation of the "preferably" in this statement in a short while, let me look at the aspirants first. Some of these people are really young; it's not rare to come across people in their final year of graduation and they go - "How should I spend my next two years before I apply to ISB?". Whew! talk about focus! Do you know why you want to do an MBA, I ask. And the reply usually starts with "ummm..."

Back to the subject - the number of years of work-ex is just one method of quantifying how much you have worked. The other method to measure the work-ex, perhaps not quantitatively, is to look at one's résumé. When one writes one's résumé, it takes a single line to mention how many years one has worked. What about the other lines? What is the story that they tell? Do they speak of a leadership that raised the value of your organisation? Does it speak of a benchmark you set for yourself, or for others? Does it speak of knowledge? Does it say that you chased your professional desire? Well, what does it speak of?

At the end of the day, when ISB's admission committee receives a huge bunch of profiles, it's the résumés that are going to get compared (yes, there are some other things that get compared too - a GMAT score, recommendations, essays, yadda yadda yadda - but let's stick to CVs for now). Naturally, a person who has worked for long would have a greater chance of having a richer résumé than one who's worked for say, just a year.

I also came this article earlier today: Opposing the Youth Movement. It justifies business schools seeking only those who have lots of work-ex. Essentially,
The whole concept of an MBA program is based around peer-to-peer learning—people coming with diverse experiences into a structured framework, where they're learning potentially how to run a business. So I don't feel that people who have experience are going to gain as much from [a program that admits undergrads]
This is probably a good description of the thought process running at ISB.

Right, so what can an ISB aspirant draw from all that's written above? Here would be my quick list:
  1. Forget an MBA - focus on your career. When the time to think about an MBA comes, you will know.
  2. Pick a career that gives you exposure to multiple aspects of business. B-schools value diversity all the way - in fact, a person who has spent two years in a programming job, followed by two years in sales might have a better looking résumé than one who's done four years purely in sales. Not trying to make a blanket statement here - but do think about it.
  3. Try to stay with an organisation long enough to have seen different aspects & levels of the company's hierarchy. Unless you've spent enough time to experience hierarchy first-hand, it might get difficult to learn from it, or comment on it.
If you prefer to comment or ask questions by e-mail about this post, my address is pqtsocss@trashmail.net.

Wednesday, 7 March 2007

0417 hrs - 2nd platoon Bravo team being overrun by the Viet Cong
the heroes - Charlie Sheen, William Dafoe, and a few others
the movie - Platoon
the sound - rapid fire, going rat-tat-tat...

1043 hrs - the Professor's voice overrun by the typing of students
the heroes - my classmates
the lecture - Web 2.0
the sound - frantic typing to complete an assignment, going rat-tat-tat...

Tuesday, 6 March 2007

Nike & holi


Managed to come across some very good advertisements recently, thought I'll write about them:

Nike



Not only has the airing of this advertisment been timed very well with the world cup season, it also captures the Indian love for the game very accurately. As soon as the advertisement starts, the traffic stops, and the cricketers take over. It doesn't matter who's bowling or who's batting.. it's just the game that has to be played.. the game has to go on till the fat lady sings, or in this case, the traffic starts moving.

When the first batsman hits the ball too hard and the bowler goes "kya kar raha hai, yaar?", it's almost like the kids playing in a neighbourhood street with rules like "if you hit the ball beyond that wall, you're out". There's more that makes the ad very Indian - the chicken that pops its head out of the basket, the kerosene lanterns, the elephant that adds to the madness, the confused thulla.. it's hard to make out it's a Nike ad except for the less than half-a-second frame where the familiar swoosh pops by.




A donation
A friend shared this link with me on the day of Holi. Unless you've seen this before, make sure you watch this clip till the end.



I'm surprised on not having seen this on television. I would expect this to be a lot more effective than the Aishwarya Rai advertisements that used to be aired on the same subject.

Placements at IIMI, the Chronicle version

A friend of mine forwarded me this article from Central Chronicle about placements at IIM Indore: http://www.centralchronicle.com/20070303/0303102.htm

Here is Chronicle's version of the companies that came visiting IIMI:

This year several renowned companies visited here to offer placements to the students; some of them are Macancy, Assenger business consultant, ... Deloty consultancy, ... Lehmen brothers, Goldman Saches, Duche bank, ... City group, ....

Some IT companies include .. Cognijent, ... Kotek Mahindra and Kevonsis etc.

:-)

Saturday, 24 February 2007

3,2,1: My placement story

shortlists - 3
offers - 2
accepted - 1

Hyderabad = home for quite some time in the near future :-)

BTW, does anyone know a place that I can rent?

Using awk to parse HTML

Sorry for disappearing away like that… we had lots of stuff happening on campus (read our placements). And between tracking job posts and interview preps and and the interviews themselves, my blog was the last item on my priority list. Anyways, now that placements are done (and successfully for me), let me get back to the second post about working with data from the web.

The objective
One thing I did not make clear in the last post was the objective of our exercise. What we were trying to build was a model that could suggest the price for a user if (s)he decides to put up a car for sale. The model would be based on how people are currently pricing their used cars.

So, in order to build a model, we needed to work up some data from www.autoindia.com, as we did the last time. The next step is to convert all of that data in the 17,000 html files into a single spreadsheet. Each listed car on the website should have its details on a separate record in the spreadsheet.

In this post, let's explore how to do this using awk.

A simple sample HTML
Let’s take a look at the listing http://autoindia.com/UsedVehicle/buy-used-car130625.html. The webpage looks like this


In order to build our spreadsheet, we need all the data such as the year of manufacturing, the make of the car, the model and the style. Let’s take a look at the html source code.


For most of the variables that we’re interested in, there are specific identifiers that have been written in the html, like lblYear, lblMake and so on.

Given all such identifiers, it becomes a simple programming task to list out all the lines that match these identifiers, get rid of the tags, and bingo! we’ll have all the information that we need.

Let’s try out a simple awk script to see how we can do this.

The first awk script
I’m not going to delve into the details of how awk works. You can find out more detail on the GNU awk guide at http://www.gnu.org/software/gawk/manual/gawk.html Yep, GNU’s implementation of awk is known as gawk. Though I have used gawk for all of the examples, I’ve interchangeably used the names awk and gawk. The name’s inconsequential, the result is the same.

Back to our job in hand, we’re interested in specifying a pattern and printing out the details of the lines that these patterns match.
/lblYear/ ||
/lblMake/ ||
/lblModel/ ||
/lblStyle/ {
print $0;
}
In the above snippet, we’re telling awk to print the entire line (represented by the $0) if it matches any of the regular expressions listed between the two forward slashes.

If we save the above snippet into a file named “split.awk” and run the following command
C:\> gawk –f split.awk buy-used-car130625.html
We get the following output:
2002
Hyundai
Santro
1.1 GS Zip Drive
Eliminating the span tags
Let’s look at the above information closely – if we can eliminate the stuff between and including the angle brackets, we’d be left with exactly the information that we need. Let’s modify the awk script we wrote earlier.


The two gsub commands are global search and replace commands. They will look for the regular expression (specified as the first argument) and replace it with the substitution string ( the second argument -- in this case, a blank) in the text that is specified as the third argument). Now that we’ve wrapped that up in a neat function, let’s look at the output that this gives us
C:\> gawk –f split.awk buy-used-car130625.html
2002
Hyundai
Santro
1.1 GS Zip Drive

In a spreadsheet, please!!
The last teeny-weeny bit of stuff that we need to do is to plug the above into a single line, preferably in a comma-separated or tab-delimited format, so that it can open up in a spreadsheet program like MS Excel.

We can use a simple trick. Each time the pattern is found, instead of printing the output after stripping off the
tags, we can concatenate that output to a string, that can be printed off later using the END code block of awk.

Here we go…


And the corresponding output
C:\> gawk –f split.awk buy-used-car130625.html
2002 Hyundai Santro 1.1 GS Zip Drive
If I decide to redirect this into a file using the > operator of DOS,
C:\> gawk –f split.awk buy-used-car130625.html > file.txt
I can open file.txt using MS Excel (by using the tab-delimited option) to get something like this:


Expanding the above
The last bit that remains is to run the above technique across all the HTML files that we’ve generated. This can be done by writing a single batch script in DOS with the awk command that we’ve used for all the 17000 HTML files that we’ve downloaded

Of course, since we needed more information than the simple example that we’ve seen above, our split.awk script ran into many more lines. There were some fields that we handled using associative arrays

Our final code looked like this:

And our output in a spreadsheet:


Downloading awk
If you’re going to use awk for the first time, I’d recommend you try it on a linux or Unix machine. There are a whole lot of other tools that gel well with awk, using the pipe mechanism that Unix scripts provide. Plus, I’ve still not been able to write simple awk scripts that can be written on the windows command line. Even simple stuff like
C:\> awk ‘BEGIN{ print “hello world”; }’
fails miserably. Probably the quotes or something – haven’t been able to figure it out.

Nevertheless, if you do need to use awk on Windows, then awk comes as part of an entire Unix toolkit available at http://www.delorie.com/gnu/ Try the zip picker – it should give you the entire stuff – awk, sed, the whole nine yards.

Alternately, you could try using cygwin. Runs a bit slower as it uses something called a Unix emulation layer. But it's the closest replication of a Unix environment on Windows that I've seen, and is packed with all the unix tools that one can imagine.

Additional reading
  1. The GNU Awk User’s Guide - http://www.gnu.org/software/gawk/manual/gawk.html
  2. Unix classic tools - Awk Programming - http://www.softpanorama.org/Tools/awk.shtml
  3. Awk man pages for Linux - http://www.die.net/doc/linux/man/man1/awk.1.html

Other Comments

In response to my previous post, Punit had asked me why I didn’t use RSS and an XML parser - here's why:
  1. It didn't look like autoindia.com has been refreshing it's RSS feed (I wasn't able to subscribe to it) and the XML file that's on the server contains some very old data from 2005, and about 15 records from 2007 - it was insufficient for us.
  2. RSS is typically used to only publish the latest n items that the site has listed. n could be 15, 100, or 1000, depending on the webmaster. If you want all the data that a site has, like we wanted, you'd be extremely lucky to get that off RSS. In our case, even if the RSS feed has been refreshed regularly, we would have needed to monitor it over a long period of time (probably a year) to build a sufficiently large data set.
  3. Depending on how the webmaster has configured the XML, the description may not contain all the fields that one is looking for. In our case, we wanted everything - from the price of the car to the location to whether it has power steering. The site's webmaster didn't publish all of that. So, RSS was clearly out.
awk was useful for us in this exercise because of the way html is structured. Had the html been a little more complicated, I’d probably have had to use perl and an HTML parser module off CPAN. In the end, using awk has to be a conscious judgement decision – it can give you results quickly, but you need to be careful – it can be quite painful to debug. And if you’re the kind of person who doesn’t believe in documenting code, you’ll end up having loads of fun :-)
    Our project
    What I've described above was just the initial part of our project, where we gathered the data from the web. We went on to build a price-prediction model that turned out to be relatively successful. We had an error in the range of 10,000 rupees for the top 37% data, I guess that was fairly acceptable. Anyways, the report has now been submitted.. let’s wait for the grades.