Technology

VERNACULAR TOOLS A virtual keyboard (top) developed by C-DAC for input in Devanagiri script. A cross-lingual software (down) that enables conversion of English into Marathi language.
Credit: Department of Information Technology, Language Technologies

COMPUTING

Different centers of C-DAC—in Bangalore, Kolkata, Mumbai, Noida, Pune, and Thiruvananthapuram—work on language computing technologies. Their activities include development of smaller utilities like desktops and Internet access in Indian languages and core research in areas of machine translation, OCR, cross-lingual access, search engines, standardization, digital library, and more. C-DAC Noida is engaged in English to Indian language translation and has already done it for Bengali, Hindi, Malayalam, Nepali, Punjabi, Telugu, and Urdu. It has also developed Indian to Indian language text-to-text translations for Punjabi-Urdu-Hindi combinations as all three languages follow Devanagari script. C-DAC Noida has also collaborated with language computing labs of various countries for effective speech-to-speech translation of other languages into Indian languages. The international languages which can be translated are Japanese, Thai, Cantonese, Arabic, and Brahmic. With inclusion of foreign languages, the centre’s initial Asian Speech Translation (A-Star) system has now been renamed as Universal Speech Translation (U-Star).

Anuvadaksh, a consortium of English to Indian language machine translation (EILMT), a part of the TDIL program, also allows for translation of English into six Indian languages—Bengali, Hindi, Marathi, Oriya, Tamil, and Urdu. It has advanced development of technical modules, such as named entity recognizer, word sense disambiguation, morph synthesizer, collation and ranking, and evaluation.“It is vital that linguists and technology experts both work in collaboration on such projects because language experts are not adept at technology and usually technology experts are not familiar with the nuances of languages,” says Manoj Jain, scientist, TDIL, DIT.

EILMT has also developed AnglaMT, a pattern-directed rule-based system with context free grammar like structure for English (source language). It generates a pseudo-target (pseudo-interlingual) applicable to a group of Indian languages (target languages), such as Indo-Aryan family (Asamiya, Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi, and more) and Dravidian family (Kannada, Malayalam, Tamil, and Telugu). Some of the major design considerations of AnglaMT have been aimed at providing a practical aid for translation to get 90 percent of the task done by the machine and 10 percent left to human post-editing and processing. Another system called Sampark from EILMT works on six pairs of Indian languages—Hindi to Punjabi, Hindi to Telugu, Punjabi to Hindi, Tamil to Hindi, Telugu to Tamil, and Urdu to Hindi. The Sampark system is based on the analyze-transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to the target language is carried out, and then the target language is generated. Each phase consists of multiple modules with 13 major ones. One main advantage of the Sampark system is that once the language analyzer has been developed for a single language, it can be paired with other language generators to get multilingual output. The DIT strongly believes that the Sampark approach helps control the complexity of the overall system.

STANDARDIZATION

The Government of India is working closely with international agencies such as World Wide Web Consortium, Unicode, International Organization for Standardization, and Bureau of Indian Standards for standardization in the fields of the Indian script keyboard, transliteration, SMS, speech resources, and electronic language resource development for all official Indian languages. While the inclusion of 22 Indian languages in the Unicode Standard is complete, the W3C is working on seamless Web for every Indian and at inclusion of the Indian languages in the international standards for the mobile Web, Web accessibility, styling and Web browsing. Standardization is important for developing applications for mass usage while retaining the fundamentals of Indian language usage.

THE CHALLENGES

For the development of every tool and utility a database of sentences and words in text and speech in every language is required since most programs are based on statistical algorithms. In India while some languages are spoken by a large number of people, some are limited to a smaller group. The criteria for sample collection requires the target group to be computer savvy and conversant in English as well as the local language. This narrows down the number of people who can be contacted for giving sample of the local lingo.

Since localization is done for the common man, certain words and phrases commonly used for working with computers have to be reinvented in many cases and have to be made user-friendly. Initially, the material recognized was a set of 60,000 basic strings in English and 200,000 strings for the advanced users which needed to be localized in all languages. Many of these strings were not grammatically complete sentences, they were just computer commands. Also, words like document, folder, delimiters, add-ons are not enlisted in any dictionary of Indian languages. While in some languages it has been transliterated and retained as it is, experts of some other languages went on to create a whole new set of words corresponding to the IT terminology. Post this mammoth task, a glossary has been created with consent of various language experts.

WEB DOMAIN

Many countries have been pushing for creation of multilingual domain names, that is, domain names like the .com or the .in typed in the country’s own local language(s). These are encoded in computer systems in multibyte Unicode and are stored in the domain name systems as ASCII codes usually. In 2009, the Internet Corporation for Assigned Names and Numbers, which manages the domain names across the world, approved of the use of Internet extensions based on a country’s name and its local language(s). The move has given an impetus to the growth of the Web and India too has joined the league of nations applying for a domain name written in a script other than Latin. Shanker of DIT is propelling this project ahead in India with the help of Govind who is a senior director in DIT. India has now got approval of the domain name .bharat which can be written in all 11 scripts and sees an extensive usage of the same in applications such as e-governance and e-learning. Internationalized domain names provide a convenient mechanism for users to access websites in local language. For example, if a person wants to give his or her system domain name in his or her local language, such as Hindi, then that will look like www.bhasha.bharat.

Technology

Monday, 4 July 2011

No comments:

Post a Comment

Blog Archive