Using Unicode Conversion Gateway
Converting online pages
Use the protal Unicode Conversion Gateway for accessing the listed sites, converted into unicode.
Use Transliteration for transliterated Indian language unicode sites. Scripts supported include Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Tamil,and Telugu.
On command line for mass conversion
Steps to be followed
- Download latest version
- Uncompress the package.
- cd into the created folder and use the following command,
php5 cmd_convert.php5 [website_name] < input_file_name > output_file_name
The php script cmd_convert.php5 will take input from stdin and send output to stdout. The optional parameter website_name will be used to identify the default font to be used for this file.
Installing PHP for command line scripting.
As a web based document converter
Use  for Converting palin text documents in proprietary encoding into unicode.
Adding a New FontConverter
Inorder to add a new Font Encoding Converter, we need to convert all the characters in the font to an intermediate format called "Padma" and then convert from "Padma" to Unicode. Each Font Encoding Converter resides in a seperate file in the code base. This file contains all the mappings and functions to convert proprietary font to Padma. Conversion from Padma to Unicode is taken care by language file of the corresponding font.
The transformer is responsible for this whole process. It makes use of a parser to convert the input text to an intermediate (Padma) format. The basic job of the parser is to break input text into syllables. The intermediate format is then converted into the desired output (Unicode) format by using a simple lookup table in the corresponding language file.
In general, the complexity of adding a new font depends on how well the font is designed. The criterion here is how many special rules need to be written for describing shapes that can be rendered by the font compared to how many that can be generated by using generic principles. Some of these are script dependent - it seems like Malayalam and Tamil fonts are better designed than Telugu or Devanagari. That's not always the case though - TeluguLipi happens to be one of the cleanest designs, I have seen so far. The worst offenders are Devanagari fonts that try to use the same glyph for the consonant stem and the vowel sign (maatra) for 'aa'.
--(words from the Author of Padma)
- Open the font(.ttf file) in some font editor like FontForge.
- You can see the glyphs are represented by some value. Map all these glyphs to their respective values in your font specific file(UTF-8 values should be mapped in case of php file).
- After mapping all the glyphs to their respective values, the next step is to map all these glyphs to their respective intermediate Padma characters.
Following are the Attributes to be implemented inorder to design a font specific file:
- fontFace - how the font is specified in HTML
- displayName - currently unused (once upon a time, it was used in the UI for heuristic transformer)
- script - the script in which the font is typically rendered. The user can configure from the auto transform whitelist the script in which he wants the site to be rendered.
- hasSuffixes - a boolean assumed to be false by default - this is set "true" for fonts of languages like Devanagari, Gujarati, Kannada etc. These languages have complex rules for handling conjuncts that have 'ra' - for example in arjun, the syllable 'rju' is rendered with the glyph for 'ra' following the glyph for 'ja'.
- maxLookupLen - tells how many code points in the input should the parser examine before concluding that it has a right mapping. This is the length of the longest mapping you will write. Ideally, this would be 1 - but some fonts use as much as 4. (This is used in conjunction with isOverloaded() API, see below).
Following are the Functions to be implemented in order to design a font specific file:
Note: Here 'str' represents the sequence of codepoints whose length is <= maxLookupLen.
- lookup(str) - returns the intermediate format of the 'str' for the corresponding encoding.
- isPrefixSymbol(str) - prefixes are common to all the Indic scripts [Ex: Devanagari vowel sign 'i']. This API tells the parser if 'str' is visually rendered before its logical position.
- isSuffixSymbol(str) - similar to the above but it needs to be implemented only if the attribute 'hasSuffixes' is set "true".
- isOverloaded(str) - if 'str' is part of more than one lookup sequence, it returns true.
- handleTwoPartVowelSigns(str1, str2) - this API is used to handle the case where lot of vowel signs have more than one glyph.
- isRedundant(str) - if "str" doesn't add any value to the parser (for ex:talakattu in telugu), then it considers it as redundant symbol and removes it.
- preProcessMessage(input)- Currently, parsing is done in two phases - redundant code points are removed and syllables are then extracted. In some cases, it may make sense to rewrite the input string to avoid complicated special rules - In this case preprocessMessage should be implemented.
Note: Either isRedundant(str) or preProcessMessage(input) is implemented but not both.
Testing a New FontConverter
Instructions are given keeping Linux OS in consideration. Windows users can also do similar for testing.
Steps to be followed to setup the environment and adding a converter developed in PHP on your local machine.
Setup Test Environment For Unigateway
- Get the latest Unigateway code. Download page is here
- Uncompress the package.
tar -xvzf unigateway-0.5.3.tar.gz for tar archive (or)
unzip -r unigateway-0.5.3.zip for zip archive
- For web based usage,
copy the uncompressed directory contents into your webserver's home directory or one of its subdirectories.
example would be /var/www in the case of apache.
- Edit the file config.php5 and set the value for the variable $server as the URL to the unicode conversion gateway directory.
$server will be of the form http://<domain_name>/<directory_name>
Adding New Converter To Unigateway
Copy the encoding file(like Eenadu.php5) into Encoder/fonts/ directory
Accessing Unigateway in your local machine
Access http://localhost/unigateway-0.5.3/unicode.php5 file and pass the html page(with sample text to convert into unicode) to file variable.
Here small_sample.html is a sample html file with proprietary encoding.
Setup Test Environment For Padma
- Install the latest Padma
- Enter the Firefox(FF)'s Padma extensions directory.
It could be like,
- Then enter into the directory named "chrome". You should find a file named "padma.jar" inside it. Its a zipped file, so use the command "unzip padma.jar". It should create a directory named "content" there. You need to get down further into this directory but before that make sure FF reads the changes that your are going to make. For this follow the next step.
- In the previous directory where you found the directory "chrome", you should also find a file named "chrome.manifest".
In this file, replace the line containing "content padma jar:chrome/padma.jar!/content/" by "content padma chrome/content/"
Above tells FF to read from the content directory where you are going to make changes and not from the original zipped archive.
Adding New Converter To Padma
- The transformer should be told about the new encoding by
defining a mapping from it's name to implementation in the source file content/transformers/Transformer.js.
i.e something like
Transformer.dynFont_Eenadu = 0;//dynFont_Unknown should be the max
Transformer.dynFont_Class[Transformer.dynFont_Eenadu] = Eenadu;//Class that implements the font for eenadu font.
- The JS file for the encoding should go into the appropriate script folder.
- The JS file should be included in 2 XUL files in src /content folder - padma.xul and padmaMailOverlay.xul.