Technology

Evolving Language Technologies

Indian researchers and government are turning English-dominant communications technologies into vernacular mediums to make them easily accessible for every Indian.

JUNE 2011

The growth of information and communications technology (ICT) is on steroids and India is trying to keep up with the pace. Every year the number of computers in India is increasing and so is the dependency on these machines for education and daily livelihood. According to a Gartner report, computer shipments to India are expected to increase by 24.7 percent to total 13.2 million units in 2011 over 2010. And while these machines are largely accessible in English, the irony is that the chunk of Indians that can speak, read, and write in English remains abysmally low at 10 percent as compared to the U.S., which has 95 percent, or even the UK that has 97 percent population that can read, write, and speak in English.

Credit: Shilpi Bhargava

India’s economy is one of the fastest growing in the world and its mobile market is the fastest growing market with around 771 million mobile subscribers, but the country ranks at the bottom in terms of proliferation of the ICT. While illiteracy can be seen as the umbrella problem, a large part of the literate populace cannot access these technologies which promise advancement and better employment simply because they are not available in local languages. In the reach and usage of Internet too India ranks very low. Again, as most Web pages are in English, the literate Indian populace, who is otherwise adept at local languages, shies away from using the Internet. “The very fact that the number of readers for local language newspapers are more than the readers of English newspapers in India points to the fact that, by and large, people of India prefer to still read in their regional language. If through the efforts of researchers in the language computing domain, the number of Web pages available in local languages is increased, then more people will be attracted to using the Internet and reap its benefits,” says A. Kumaran, research manager of the Multilingual Systems Research group at Microsoft Research Lab in Bangalore.

COMPUTING LOCAL LINGO

Language computing systems can be broadly classified into machine translation or optical character recognition (OCR, which converts handwritten or typed text into speech) and speech recognition (transliteration and conversion of speech to text). N. Ravi Shanker, joint secretary, Department of Information Technology (DIT), Ministry of Communications and Information Technology, Government of India oversees the language computing work across India. In 1991, the DIT’s Language Computing Group started a program to develop tools and resources so that computing or browsing the Internet can be done in English as well as in local Indian languages. The program, named Technology Development for Indian Languages (TDIL), was conceptualized even before the advent of computers in India between 1993 and 1995. The government has been working closely with some of its institutions such as the Centre for Development of Advanced Computing (C-DAC) in Pune. The TDIL program was aimed at making the language computing tools available to the masses free of cost, to in turn enable wider penetration and better awareness of such tools. The Phase I of the TDIL was completed recently in 2010, with beta versions of all engines of language computing systems available in 10 main Indian languages, and it has given way for the roll-out of Phase II which will see addition of more languages, smoother interfaces, and improved accuracy.

LANGUAGE PROCESSING

The natural language processing technology, which is one aspect of artificial intelligence, coverts spoken human languages into machine language for computation. The technology encompasses development of utilities for speech synthesis, optical character recognition, text-to-speech conversion, and machine translation.

VN Shukla, director of Special Applications group in C-DAC, Noida, UP, says, “Language computing starts right from the hardware level up to the software level and then at the applications layer. And when we talk about hardware, it is not only in the form of input devices, but the whole PC structure.” So the basic tool for language computing is a computer that is packed with keyboard drivers, display drivers, language fonts, rendering engines, translation tools, and more to accomplish all language needs.

In 1992, the Indian standards for Indian scripts called the Indian Standard Code for Information Interchange (ISCII) was developed, which could be used in place of the American Standard Code for Information Interchange (ASCII) for use of English language. Using a standard code for every character could help in standardization but the process required reverse engineering. However, researchers in India decided to take the alternate path and started working on the software layer, allowing people to build applications over the existing software for Indian languages. “The first thought was to localize the operating system–at that time DOS–to Hindi, but then we realized there was no point and then we started creating drivers for Indian languages for input and display purposes,” reminisces Shukla who was a scientist with C-DAC at the time of conception of TDIL in India. He worked closely with professor RMK Sinha of Indian Institute of Technology, Kanpur (IIT-K), who is known as the father of language computing in India.

Researchers working on Indian language computing soon realized that the tools present in the global market cannot be replicated in India owing to the complexity of multiple languages that exist in the country. (India comprises not only 22 major languages with as many dialects as 1,652, but there are also 11 scripts to represent these languages.) Swarn Lata, head of the TDIL program and director of Human Centred Computing group in the DIT, explains that “in Indian languages one-to-one mapping or translation of each word as it is to form a sentence is not workable. The methodology to be followed here is to first process the source language, convert words according to the target language, and then process it all again with respect to the target language for the conversion to make sense.”

Apart from the typical nature of Indian languages, cultures also affect our language usage and pronunciation. For example, in northern parts of India, Hindi is spoken in varied forms across different states and cities. Thus we cannot have a generic tool, especially for translation, and all tools have to be developed for all of the languages, adds Shukla. In a C-DAC report he mentions that although all Indian languages have emerged out of Sanskrit, the core ancient language, and mostly all of them follow Paninian grammar, but that itself is a problem as different languages depend on Sanskrit and Panini in different manner. Therefore accuracy for any of these systems is not 100 percent.

Technology

Monday, 4 July 2011

Evolving Language Technologies

No comments:

Post a Comment

Blog Archive