Monday, April 26, 2010

In computer science community, disambiguation refers to the problem of identifying the correct meaning (usage sense) of a word in a sentence. Recently, in search engines, it is studied in the context of understanding the correct meaning of a user query entered in a search box. E.g. if a query is "Avatar", what does the user mean? Is it the latest James Cameron movie - Avatar (hell ya!) or does he mean the hit Nickelodeon animation series Avatar: The Last Airbender or does he mean the deliberate descent of a human deity from heaven to earth as per the Hindu religion.

Disambiguation is a hard problem, mainly because (a) English is a crazy language (b) we have abused it. Here is a taste of it!

Blame English!

(a) English is a silly language - [Borrowed from a poem. Read its full text here ]
There is no egg in eggplant, nor ham in hamburger; Neither apple nor pine in pineapple. And while no one knows what is a hotdog, You can be pretty sure it isn't canine. English muffins were not invented in England Nor French-fries in France.
(b) Round and round - As per the Oxford dictionary, for the 500 words used most frequently, each word has an average of 23 different meanings. The word "round" has 70 distinct meanings and usage. Often it is the context that helps in figuring out the correct meaning of a word. Thus, in order to disambiguate a word, the entire context has to be clear.

(c) Paradoxes - Consider Liar's paradox which is a perfectly correct statement but logicallly impossible - "This sentence is false" . Or the classic Socrates' quote - "As for me, all i know is that i know nothing". Then, there are also oxymorons such as "silent scream" and "clearly misunderstood". Researchers have tried to apply natural language processing techniques to understand the context but these paradoxes make this approach extremely difficult.

(d) Other - There are many other such constructs in english language for example Amphibology, Double entendre, polysemy and the list goes on.

Language Abuse!

(a) Anarchy - There is complete anarchy when it comes to naming names and titles. There still does not exists any rule or regulation, not even a guideline on how to go about naming titles. Otherwise, who in the correct frame of mind would make a movie about Harvey Milk but would name it just Milk. Isn't there already enough confusion between different meanings of the word milk - Milk (food), soy milk, coconut milk, Milk(band), Milk magazine ....

(b) Timeliness / popularity - We are a very lazy species. For the search Harry Potter, the meaning of the query changes depending upon what's happening around the world. If J.K. Rowling publishes another book, we want that and if a new movie based on an earlier book comes out, we want the movie information. But never we would write beyond "Harry Potter". We expect the search engine to have learnt the magic of legilimency .

(c) Duh! isn't it obvious - For the 95% of the world, there is only one "San Francisco" in the world and it is in California, USA. But as per this wikipedia page, there are 27 different places on this planet, each named San Francisco. How to determine if you belong to the 95% bucket or which one of the other 26 meanings do you care about?

Such is the extent of our abuse that even the term disambiguation is ambiguous. There is a disambiguation page on wikipedia just to disambiguate disambiguation. Don't believe me? Check it out yourself- http://en.wikipedia.org/wiki/Disambiguation_(disambiguation)

No comments: