Monday 4 July 2011


Evolving Language Technologies


SCRIPT CONVERSION The Parivartan tool by Priya Informatics that transliterates English into Devanagiri script.
Credit: Department of Information Technology, Language Technologies
APPLICATIONS
As technologies for language computing evolve in India, developers, industrialists, and academia are finding new and innovative use for these. One prominent use is the digitization or creation of e-books of the mounds of rich literature in different Indian languages. “A very interesting use that is emerging of OCR is tagging, that is, scanning of hard copies of old books in regional libraries and creating an index for book search. This would help greater and better digitization of libraries across the Indian cultural terrain,” says Santanu Chaudhury, professor, IIT-Delhi, and head of OCR project at DIT. He mentions that DIT has received huge demand for Braille books and accessibility solutions, which is now being worked on. Physical documents can be converted into e-documents and these can be further read out using text-to-speech engines developed by private companies and institutions.
At IBM Research Labs India, senior researcher Arun Kumar also works on a very interesting project called the Spoken Web, or as IBM likes to call it, the World Wide Telecom Web. IBM says that it is the starting of its vision of creating a parallel voice-driven ecosystem just like the World Wide Web for better inclusion of the Internet and also of inclusion of challenged people. The Spoken Web is a collection of voice sites, each of which has a unique virtual phone number called the VoiNumber which maps to a physical phone number. When the user dials the VoiNumber of a website, he or she gets to hear the content of the respective site over the phone.
Another application of language computing comes into play with the concept of cross-lingual search and the wordnet that are being developed by Pushpak Bhattacharyya, professor of computer science engineering at IIT-Bombay and head of Laboratory for Intelligent Internet Access at the institute. The cross-lingual search addresses the need to search resources on the Web in different languages than the one typed. Currently, search engines like Google do a template-based search, that is, if a user types “professor” in the search box, documents and pages on the Web containing the exact string will show up. But there might be relevant documents in other languages which the user may want to see. Bhattacharyya is working on a search engine which can also serve vernacular results to users. He has also developed the widely recognized Wordnet, an online lexical reference system, for Hindi and Marathi languages. Wordnet’s structure allows for language processing and research in applications such as voice recognition and dictation systems.
SOCIAL COLLABORATION
Another prominent entity in the computer industry which is working on cross-lingual information retrieval is Microsoft. Microsoft India has started showing avid interest in the Indian localization market and, as a software giant, has made available many versions of the Microsoft Windows operating system as well as the Microsoft Office software in different Indian languages.
Kumaran from Microsoft Research has spearheaded the project called WikiBhasha with Wikipedia to enable people to be able to create Wikipedia pages in regional languages. The WikiBhasha label has been coined through a combination of “Wiki” and “Bhasha” (which means language in Hindi). It is a multilingual content creation tool which enables users to source existing English pages from Wikipedia and convert them into their selected language, and then manually edit or add content to those pages for a parallel Wikipedia page in Indian language. Microsoft says that WikiBhasha Beta has been released as an open-source MediaWiki extension.
WikiBhasha uses Microsoft’s machine translation systems which are based on statistical algorithms. Another aim of the project is to crowdsource or collect data on a volley of topics in different Indian languages which could then be used for further research. “This will help in solving the basic data problem that language computing experts are facing at the basic level,” says Kumaran. He tries to validate his point and says, “To do translation between any two languages, we need four million sentence pairs to develop robust machine translation systems for the purpose.”
India now stands at a point where the efforts seem to be producing results and the teams are getting affirmation of their work as they are called upon by European countries for consultation. The world now wants to take lessons from India on how to manage the huge complexity in permutations and combinations of languages, scripts, and dialects and successfully develop models for integration of natural languages in machines.

No comments:

Post a Comment