Evolution of the 16 Bit Encoding Scheme for Tamil
1. Introduction to Font Encoding
2. ISCII (Indian Script Code for Information Interchange)
3. Unicode
4. Simple and Complex Scripts
5. The Unicode (16-bit) Encoding Scheme for Tamil
6. Disadvantages of current Unicode
7. Proposed All Character Encoding for Tamil - All Tamil Block
8. Advantages of All Character Encoding (All Tamil Block)
9. Whether efficiency of the coding is something to be considered or not?
A computer stores text in the form of numbers and not in the visual written form. Each alphabet is stored internally as a number, but displayed on the screen as a glyph (written form). A font is used to convert the internal number to a visual form on the screen. By changing the font one can use the text in any type face.
"Hello" would be represented as 72 101 108 108 111 internally
72 101 108 108 111 would appear as Hello when viewed with a Times Font
72 101 108 108 111 would appear as Hello when viewed with an Arial Font
By changing the shapes defined in the font, the computer can represent any language.
For a long time a 'BYTE' (consisting of 8 bits) was the basic unit of a character in the computer. A BYTE could accommodate 256 different characters. Out of these 256 locations a minimum of 32 locations are reserved for use by the operating system. Hence only 224 locations are freely available. To represent a language in the computer one had to allot each one of these locations to a character or glyph of the language. This process of allotting a location to a character or glyph is called encoding.
The current English language encoding in the Windows operating system is the ANSI scheme. Since the English alphabets contains only 52 letters (26 capital letters and 26 small letters), the assignment of numbers to locations is straight forward.
The Tamil language has 313 alphabets including the Grantha alphabets. All though it is desirable to allot one location for each alphabet, we cannot do so due to limitations of space (224 locations). Hence we assign one location for each glyph (e.g. kaal, kombu, kokki, pulli etc.). For example in the TAM encoding scheme the following allocations are made
Now கெ can be represented with 170 232, செ with 170 234 etc. In these examples instead of representing கெ and செ as a single byte, we represent them with two bytes. We need more space but we do not have any other option, since the number of Tamil letters exceeds the available locations.
This system is called the "Glyph Encoding". By doing this we reduce the number of locations required to implement the Tamil script. The Government of Tamilnadu has announced two such encoding schemes for Tamil. They are the TAM and TAB standards. TAM is a monolingual encoding scheme and the TAB is a bilingual encoding scheme.
In these encoding schemes since there is a one to one relation between storage and display (i.e. every byte stored is displayed) the Tamil script could be implemented in any 'readymade' software package without any additional support. While it was relatively easy for the Tamil script to be encoded, because of the lesser number of characters, uniformity of script and absence of 'conjunct consonants', it was more difficult to encode the other Indian languages.
2. ISCII (Indian Script Code for Information Interchange)
|
Back to Top
|
The Indian government, with a view to have a common encoding for all Indian languages, developed the ISCII standard.
Features
• This encoding is bilingual in nature.
• The first 128 locations is exactly as per the ANSI encoding standard and contains the English alphabets, commonly used punctuation marks and symbols.
• The next 128 locations contains the encoding for each Indian language.
• Since these 128 locations are not enough to accommodate the alphabets of most of the Indian languages, only vowels, consonants and vowel modifiers were encoded.
• Similar vowels, consonants and vowel modifiers of various languages occupied the same slot. i.e the vowel 'a' would be allocated the same location in all the languages.
• This system of encoding enabled viewing of the text in various of languages just be change of the font. A text typed in Tamil could be easily read in a transliterated form in Hindi by using a Hindi font. It may not be true vice-versa since all the Hindi consonants do not have a Tamil equivalent.
Disadvantages
• It requires more space since almost all the alphabets would be stored in their broken down form.
• It does not have a one-to-one relation between the stored bytes and displayed bytes. For example the Tamil alphabet 'ku' would be displayed as a single glyph, but it would be stored internally as a consonant 'k' + vowel modifier 'u'. There is a two-to-one relation between the stored text and the display.
• This added complexity in display makes it unsuitable for usage in 'ready-made' software.
• Thus it was not used in Tamil.
It must be noted that the same 256 locations are used by different languages. For example the ASCII scheme uses location 65 to represent 'A' while the TAM encoding scheme uses the same location for the Tamil alphabet 'A'.
'and' would be read as 'and' if we use a TAM encoded Tamil font on an English text. Hence it is apparent that unless we know which language the text pertains to, we cannot use the appropriate font to view it. This led to a chaotic situation.
In order to avoid this confusion, alphabets of different languages had to be given different numbers. This required more locations. Thus was born the 'double byte' encoding scheme which uses 16 bits to represent a character instead of 8 bits. In the 16 bit space 65,536 locations are available as compared to 256 locations in the 8 bit space.
Unicode is a 16 bit encoding scheme which is the most common 16 bit encoding scheme in use today. It contains characters of all the major world languages. It is being developed by the Unicode Consortium which has the major software developers and computer manufacturers as members. The Indian Government is also a voting member in the consortium.
It is a stated policy of Unicode that only characters will be encoded and not glyphs. It also states that it is not concerned with efficiency of the encoding system. It must be noted however that in the beginning all existing standard encoding schemes of different languages were implemented 'as is' in Unicode without consideration of the above principles.
Because of this policy the ISCII standard which was primarily designed for an 8-bit environment was used as the base for implementing the Indian languages in Unicode. Thus the Tamil block of Unicode was based on the Tamil encoding in ISCII which was not used at all till then. A simple script was converted into a Complex script.
In some Indian languages, when two consonants come together, like ik and ir, the glyphs are not rendered in the normal way, but rendered in a different way. In these languages, for one character, more than one glyph rendering is possible. It depends on the situation. These languages are said to have complex scripts. For these languages, the character type representation in memory may be beneficial. But they have to pay the price of a rendering module.
In Tamil for each character there is only one way of rendering. Hence Tamil does NOT have a complex script.
Because of the necessity of a processing module to show the letters on the screen and coordinate with the memory content, any software designed for English will not work as such for complex scripts. But they will work smoothly for simple scripts. This is the reason for glyph encodings being popular in India, whereas ISCII has not become that much popular in the commercial world. In Tamil the use of ISCII is negligible.
5. The Unicode (16-bit) Encoding Scheme for Tamil
|
Back to Top
|
65,536 combinations are possible with 16 bits. If 16 bits are used for every character, then many languages of the world can be accommodated in this scheme. Unicode is designed to accommodate many languages of the world. It is supposed to encode characters and not glyphs.
Formation of Indic blocks from ISCII
As stated above, in ISCII each Indian language was given 128 slots. Basically the same scheme
was adopted in Unicode also. Each Indian language is given 128 slots in Unicode.
Each block is different from the other block. Hence the perceived advantage of ISCII –‘store in one language and see in any language’ is not valid in Unicode. But the disadvantages of ISCII have been carried over with further setback.
For Indian languages having complex scripts, there may not be much difference in using the Unicode instead of ISCII. But Tamil does NOT have a complex script.
In ISCII, like others, Tamil is also given 128 slots. Tamil having a simple script, should have been treated differently in the Unicode. All its 313 characters and the special symbols could have been
accommodated easily. But unfortunately, in Unicode, Tamil is given 128 slots only. This has resulted in many disadvantages.
6.1. Not a real character encoding
As already mentioned, this coding taken from ISCII is not character encoding. Unicode is supposed to encode characters, but this is not the case. Pure consonants, which are fundamental in nature, do not have single slots. Pure consonants have been treated in an unnatural way. This will be a constant irritant to the programmers while doing natural language processing in a large scale, in the future.
6.2. Multiple ways in representation
For some letters like ko, there seems to be two different ways of coding. One as ka and the vowel modifier for o. The other is ka plus the vowel modifier for ae plus the vowel modifier for aa. This ambiguity in representation will lead to a situation where search and similar operations
can lead to incorrect results.
6.3. Not a complex script
Above all, treating Tamil as having a complex script has enormous negative consequences, which will hinder the growth of the language use in the future.
7. Proposed All Character Encoding for Tamil - All Tamil Block
|
Back to Top
|
Representing all the Tamil letters, each with a separate slot is the natural way to treat Tamil. The table given in the primary school books should form the basis for such a scheme. It should include the special symbols also. Such a scheme is shown in Annexure A.
8. Advantages of All Character Encoding (All Tamil Block)
|
Back to Top
|
8.1. Real Character Encoding
It is the real character encoding, representing the true nature of Tamil. In the All Tamil Block the space required for any Tamil text will be just about two thirds of what is required in the current Unicode scheme. Let us see an example. Consider the word Tamil. In the current Unicode it will be
represented as five symbols. ta, ma, ikara modifier, la, and pulli. In the All Tamil scheme, it will be represented with only three letters, ta mi and izh.
Traditionally most of the popular Tamil software in the past 20 years have been encoding almost all the Tamil characters. The same is done in the TAM, the TN Govt. standard encoding for the printing and publishing industry. The “All Tamil Block” follows not only the traditional way of encoding Tamil characters but removes the earlier restrictions of the 8 Bit encoding.
8.2. Efficient Design
The creation of the 16 bits is done in a scientific way. Of the sixteen bits, the first 7 bits indicates the language. The next 5 bits gives the serial number of the consonant part of a Tamil letter. The next 4 bits gives the serial number of the vowel part of a Tamil letter. A zero here means the
absence of the consonant or vowel part, that is, it is a pure vowel or pure consonant. Hence, it is extremely easy to see what a letter contains. This simplicity comes from the natural way in which the coding is designed.
8.3. Savings in Cost of Computer Storage Space
Its simplicity leads to enormous savings. The space requirement for a Tamil text in the current Unicode is about 50% more than what is required in the all character encoding.
8.4. Saving in Cost of Internet Communication Bandwidth
The Time required to communicate Tamil text is also increased by about 50% in the current Unicode. It is common sense to note that any language processing will take more time when the length of the text is more.
8.5. Saving in Computer Display Time
Tamil data will be displayed on a computer monitor much slower when compared to the proposed scheme.
8.6. Other Savings
The rendering time, searching time and many language processing times are also significantly more in the current Unicode. When the unproductive waiting time of the users is included, this amount will be far higher. Also the additional storage cost of about one and a half times of what is really
required. This is an avoidable, never ending, recurring expenditure. Current Unicode will result in enormous drain of the resources of the Tamil community. Many crores of rupees will be wasted each and every month for many many years to come.
9. Whether efficiency of the coding is something to be considered or not?
|
Back to Top
|
The whole progress in the western scientific world was possible only when they came to know of the decimal number system. When they were using the Roman numerals, they took enormous amount of time even for simple calculations, and hence could not progress much. Try adding two Roman numerals and you will find the power of notation. History shows that the coding has enormous influence on the people who use it. One should not forget what history has taught us. Many essays and books on history can be found to testify this. A sample from the net is given below.
The following is from the website:
"Prior to the use of "Arab" numerals, as we know them today, the West relied upon the somewhat clumsy system of Roman numerals. Whereas in the decimal system, the number 1948 can be written in four figures, eleven figures were needed using the Roman system: MDCCCXLVIII. It is obvious that even for the solution of the simplest arithmetical problem, Roman numerals called for an enormous expenditure of time and labor. The Arab numerals, on the other hand, rendered even complicated mathematical tasks relatively simple.
The scientific advances of the West would have been impossible had scientists continued to depend upon the Roman numerals and been deprived of the simplicity and flexibility of the decimal system and its main glory, the zero. Though the Arab numerals were originally a Hindu invention, it was the
Arabs who turned them into a workable system; the earliest Arab zero on record dates from the year 873, whereas the earliest Hindu zero is dated 876. For the subsequent four hundred years, Europe laughed at a method that depended upon the use of zero, "a meaningless nothing." "
It may be noted that if Tamil had been implemented as having a simple script, Tamil would have been implemented in Unicode more than 10 years ago. Just because the ISCII based code was imposed on Tamil, it has taken so much time for providing the basic support for Tamil. This can be seen as the first negative effect of the encoding. A proof for what history foretells us.
Just because a mistake has been there in Unicode for a few years does not mean that we have to live with it forever. Citing stability and their policy of not considering efficiency in any manner, if the Unicode consortium does not agree to change the existing scheme, the best thing to do will be the
following:
1. Govt. of Tamilnadu to convince the Govt. of India to recommend the All Tamil Block scheme as the proposed scheme for Tamil and forward the proposal to Unicode Consortium.
2. Govt of Tamilnadu to represent the case for All Tamil Block scheme in the next UC meeting by direct presence and convince the UC for the need for the revised scheme.
Dr. Krishnamurthy, Mr. Elangovan and Mr. P. Chellappan
Kanithamizh Sangam
|