FOSS Localization/Localization Efforts in the Asia-Pacific

From Wikibooks, open books for an open world
< FOSS Localization
Jump to: navigation, search
 FOSS Localization 

ForewordAcknowledgementsIntroductionLocalization Efforts in the Asia-PacificRecommendationsAnnex A: Key ConceptsAnnex B: Technical AspectsFurther ReadingResources and ToolsGlossaryAbout the AuthorsAbout APDIPAbout IOSN

A Survey[edit]

In the winter of 2003, a survey was done on FOSS localization efforts in several Asian countries. Progress in localization varies considerably from place to place. Governmental sponsorship of the project appears to be the single most important factor in achieving rapid success in localization. Without exception, the developers surveyed cited "Freedom to Develop" as their primary reason for choosing to localize FOSS instead of using proprietary software.

The CJK Initiative[edit]

China, Japan and Korea are officially cooperating in FOSS localization. The well-funded and extensively promoted "CJK" programmes encouraging use of localized FOSS are at par with that of the wealthiest nations.

With educational and technological infrastructure and a large pool of skilled technology and language specialists, these three countries have the potential to dominate East Asian FOSS development in the next decade.

As donors and advisors, all three countries have a track record of supporting FOSS initiatives worldwide. Others should emulate the technical and organizational competence of the CJK FOSS and localization initiative, which is being led by the Chinese Software Industry Association (CSIA), Japanese IT Services Industry Association (JISA), and the Federation of Korean Information Industries (KFII).

This primer is too short to fully detail the impressive work of the CJK initiative. However, a number of references to CJK as well as contact information for the key leaders of the movement are provided.[1]

Indian Languages[edit]

There are already multiple localized Indian language versions of GNU/Linux, Mozilla and OpenOffice.org available online. The depth of technical experience of Indian programmers in localization should enable them to provide assistance to other Asian countries in their localization efforts.

India has world-class IT infrastructure and skills. It also has a multiplicity of languages. Currently, the Indian Ministry of Information Technology is localizing to Hindi, Marathi, Kannada, Malayalam, Tamil, Telugu and Sanskrit.

At least two independent groups as well as the Ministry of Information Technology are currently localizing to Tamil.

An initiative to localize GNU/Linux to around 10 Indian languages, called Indix, is being spearheaded by the Centre for Development of Advanced Computing (CDAC), a scientific society of the Ministry of Communications and Information Technology, Mumbai. There is also a Hindi GNU/Linux and a Bangla version in progress.

English is widely used in India. Therefore, the underlying programming languages are easily understood, making the work much easier. For English language FOSS development, South Asia can realistically compete with Europe and the United States today.


Thai[edit]

Localization of FOSS in the Thai language was started in 1993 by Thai students in Japan in an effort to use Thai in Unix environments. Later, a similar movement commenced in Thailand. These two movements merged and LinuxTLE and the Thai Linux Working Group were launched in 1999. However, the movement was limited to a small community until mid-2001 when the Thai version of OpenOffice.org, dubbed Pladao and sponsored by Sun Microsystems (Thailand), was announced to the public. Documentation is being carried out by private publishers who, in exchange, will retain the copyright on their work for later sale to the public in the form of books and manuals.

Only 10 individuals are actively involved in the project at this time, with development resources provided by a combination of private companies and government.

Thai developers are using GNU Compiler and Tools for modifying GNU/Linux, Mozilla's "Bugzilla" for defect tracking, and CVS for version and source control. In addition, OpenOffice.org, a productivity suite, is being localized for Thai users.

The primary difficulties the Thais are experiencing include lack of feedback from users, lack of documentation, and conflicts with other Thai localization projects because of a lack of coordination and standards.

The Thais intend to roll out the software for both the private sector and public institutions, but are still in the planning stage at this time. The team has not yet seen a high-profile effort to promote the use of FOSS in Thailand. This may change in the near future, depending on government and private funding. In comparison to the CJK efforts, Thailand has fewer developers, less money, and less outside interest due to the size of its market. Other Thai localization projects are discussed on pages 7-10.

Vietnamese[edit]

FOSS localization to the Vietnamese language began in 1998, primarily as an individual effort by an enthusiast. Today, there are between 20-50 people working on the project. They are volunteers, providing their resources and time for free, and all are using FOSS tools for the work. But even with the number of individuals currently involved, there are not enough people to perform all of the necessary work.

Vietnamese localizations of GNU/Linux, Gnome, KDE, XFCE, Fluxbox, OpenOffice.org, Mozilla, XMMS, GnuCash, Gaim and Knoppix are in progress.

Plans for introducing the software to the Vietnamese people include the publication of books, formation of user groups in major cities, and provision of the software to private and public institutions. Already, there has been some effort to promote the use of FOSS in secondary schools, universities and government offices. Recently the government adopted a FOSS policy.

Malay[edit]

The Malaysian Institute of Microelectronic Systems (MIMOS) has launched a Web site to document projects in FOSS within Asia with links to key resources. MIMOS's goal is to localize key applications into the national language, Bahasa Melayu. MIMOS has already developed a local language GNU/Linux GUI and open source applications for the government. Two other localization projects are GNOME, led by Hasbullah Bin Pit, and OpenOffice.org, spearheaded by MIMOS.

Khmer[edit]

The Khmer Software Initiative has an ambitious though under-funded plan to localize FOSS for Khmer speakers in Cambodia and elsewhere. In addition to a localized version of the GNU/Linux user interface and some applications, they intend to create a development library, complete documentation both online and in print, and specialized training materials for developers and end users. It is not known how many individuals are working on the project currently.

As elsewhere, the Khmer team envisions providing the software to both private and government users. A publicity campaign is planned and the developers expect significant support from the international development community to complete much of the work.

The quality of the planning for Khmer, a crucial first step, is very high. As with other developing countries, funding and personnel to complete the work are the main hurdles. For more information on this area, please read Khmer Case Studies on pages 12-15.

Case Studies[edit]

Thai Localization[edit]

Fortunately for localizers, Thai language support has been stabilized by means of standardization achieved in the preceding years. The only limitation is the readiness of the internationalization frameworks within Gnome, KDE, Mozilla and Open Office.

FOSS communities of Thailand contribute a lot to both software development and user support. Governmental organizations also play important roles in FOSS reaching the masses.

Thai Language[edit]

The official language of Thailand is Thai. Spoken by almost the entire population, with several dialects in different regions, Thai is a tonal, uninflected and predominantly monosyllabic language. Most polysyllabic words in the vocabulary have been borrowed, mainly from Khmer, Pali and Sanskrit.

Thai belongs to the Tai language family, which includes languages spoken in Assam, northern Myanmar, Thailand, Lao PDR, northern Viet Nam, and the Chinese provinces of Yunnan, Guizhou and Guanxi.[2]

Thai script belongs to the Bhrami family. The oldest evidence of Thai script dates back over 700 years to the Sukhothai Age. Over the years, Thai script has gradually changed. Contemporary Thai script is composed of 44 consonants, 21 vowel symbols (32 sounds when combined), four tone marks, two diacritic marks and 10 decimal digits. Tone marks, diacritic marks and some vowel signs are stacked over or below the base consonants. The stacking is a shared characteristic among many South-East Asian scripts, especially those that are derived from Bhrami like Lao, Khmer and Myanmar languages. However, there are no complex precombined conjuncts in Thai, unlike most Indic scripts (Devanagari, for example). Only stacking is required. Lao is the script closest to Thai.

Standardization[edit]

Standardization is the key to the success of Thai language support in computers. It allows interoperability and resolves many localization issues. Important standards include character set, keyboard layout and input/output method specifications.

The standardization of IT in Thailand has been recognized since 1984, [3] when there were many efforts to use the Thai language in computers. More than 26 sets of code pages were defined by different vendors resulting in incompatibility. As a solution they were all unified as TIS 620-2529/1986 as the national standard by the Thai Industrial Standard Institute (TISI). A prominent legacy was the code table defined by Kasetsart University (KU) in a successful R&D effort to enable the Thai system in MS-DOS. It was the most widely adopted standard. Therefore, computer programs were obliged to support both encodings until TIS-620 became more popular and KU became obsolete.

Therefore, when Microsoft released Windows into the Thai market, TIS-620 was the only encoding adopted. The same was true for Macintosh. Thus, the character encoding issue was firmly settled.

In 1990, TIS-620 was amended to conform to ISO standards, but the code table was left completely unchanged. This new version was called TIS 620-2533/1990. The amendment enabled TISI to actively join many international standardization activities. For example, it submitted the character set to the European Computing Manufacturers' Association (ECMA) for registration in the ISO 2375 repertoire, and was assigned as ISO-IR-166 so that it could be used with ISO/IEC 2022 mixed code page encoding. An example of such implementation is GNU Emacs. Around 1996, a TISI technical committee drafted a proposal for the Latin/Thai part (part 11) of ISO/IEC 8859, based on TIS-620. However, it was suspended due to the prohibition of combining characters. It was reactivated in 1999, and endorsed as an international standard in 2000.

The TIS-620 code table was pushed for inclusion in the Unicode table. Although the influence of the ISCII encoding scheme (which forced all vowels, including the leading vowels, to always be encoded after consonants) had made the Unicode consortium force Thai to change its encoding scheme, TISI defended the TIS-620 practice, as Thai script did not need such complications. Although this made Thai (and Lao) different from other Indic scripts, it saved Thai (and possibly Lao) implementations from the big hindrance of lacking supporting technology for ISCII practice at that time, as well as from the burdens of migration from the well-settled and widely-adopted practice. So all Thai language-processing codes for TIS-620 and Unicode, apart from the necessary code conversion, are fully compatible.

In addition to the character set, the Thai keyboard layout was standardized as TIS 820-2531 in 1988, and later amended by adding keys for special symbols and published again as TIS 820-2538 in 1995. Another keyboard variant designed after research on character frequencies, called Pattajoti, is also available. But it is not as popular and is not a national standard.

Thai input/output methods were also standardized through the efforts of the Thai API Consortium (TAPIC), which was formed by a group of vendors and academies and sponsored by the National Electronics and Computer Technology Center (NECTEC). The specification, called WTT 2.0 (from its Thai abbreviation 'Wor Tor Tor', which stands for "Wing Took Tee" or "runs everywhere"), was published in 1991. Its contents were composed of three parts, describing character set and encoding scheme, input/output methods, and printer identification numbers.

Although not endorsed by TISI as a national standard, WTT 2.0 was adopted by virtually all the major vendors who participated in the draft, including DEC, Sun, Microsoft, MacIntosh and IBM. WTT 2.0 had enjoyed being the de facto national standard for seven years, until it was dubbed TIS 1566-2541 by TISI in 1998.

There are other activities [4] with international standards bodies that promote understanding of Thai language requirements among vendors. For example, in 1998 the "tis-620" MIME character set was registered with the Internet Assigned Number Authority (IANA) for information interchange on the Internet.[5] Another example is an annex of ISO/IEC 14651, International String Ordering, describing how the predefined sorting order for Unicode can be tailored to match the requirements of Thai string ordering.

With these established standards, specifications for Thai implementations are clear and interoperability is guaranteed. The standards have played an important role in several developments in the Thai computer industry, including FOSS localization.

Localization[edit]

Thai localization of FOSS was started in 1993 by Thai students in Japan with the ThaiTeX project initiated by Manop Wongsaisuwan as a first effort to use Thai in the versatile typesetting program. [6] Subsequently, the project was maintained by the Thai Linux Working Group (TLWG) [7] which was formed in 1999.

Apart from Thai LaTeX, other surrounding UNIX environments have been modified by the same group of people to support Thai. Their work may be summarized as follows:

Manop Wongsaisuwan
ThaiTeX (ttex), X bit-map fonts.
Thai Project (by Vuthichai Ampornaramveth)
[8] cttex (C version of ttex), xiterm+thai (Thai X terminal), likit (Thai text editor for X).
ZzzThai project (led by Poonlap Veerathanabutr)
[9] thailatex-component, X bit-map fonts, Thai support in xfig, Thai-HOWTO and Thai RPMs.

At the same time, other Thai support projects were developed by researchers of the NECTEC in Thailand, including:

Virach Sornlertlamvanich
Thai support in Omega (Unicode-based TeX kernel), Thai in GNU. Emacs, machine translation, and many other NLP projects.
Surapant Meknavin
thailatex (babel-based), Thai search engine.
Phaisarn Charoenpornsawat
swath (Thai word break utility).
Theppitak Karoonboonyanan
Thai string collation , Thai Locale. [10]
The National Fonts Project
Standardized font design specification, three public domain vector fonts (Kinnari, Garuda and Norasi).

Another project worth mentioning is Linux SIS (School Internet Server), [11] which was initiated by NECTEC for use in the SchoolNet project. [12] Although it is a server-centric distribution, it was during this project that another main task force for Thai FOSS localization was constituted. Through a mailing list for supporting its users, the volunteers agreed that another distribution for desktops was needed, and was feasible to develop, as almost all of the specialists mentioned above were there. This task was undertaken by the Thai Linux Working Group. A Web site ( http://linux.thai.net ) was created for general user support. A new GNU/Linux distribution called Linux TLE (Thai Language Extension) was created to collect, as comprehensively as possible, the existing works of Thai developers and package them for users.

Apart from being a tool for boosting the use of FOSS by Thai users, Linux TLE also provided a platform for development, and a test cycle where users could participate through bug reports. The ultimate goal was to improve Thai support for FOSS from the source. Therefore, getting patches checked-in to upstream projects was the final success indicator.

So far, lots of source code from TLWG and Linux TLE have been incorporated in upstream projects, including:

  1. Thai locale definition in GNU C library.
  2. Thai keyboard maps in XFree86.
  3. Thai XIM in XFree86.
  4. Thai fonts in XFree86.
  5. Thai Pango modules.
  6. Thai string ordering in MySQL.
  7. GTK+, GLib, Qt, KDE, Mozilla, Xpdf, etc.

In 2000, Linux TLE was handed over to NECTEC for maintenance. Three versions (3.0, 4.0 and 4.1) were released and gained a lot of recognition from users in all parts of the country. A dedicated Web site was created for it at http://opentle.org in 2003. TLWG continued to build its user and developer communities.

Some of the TLWG projects that are hosted and maintained by the community are:

libthai
[13] a library of Thai basic support routines, including character sets conversion, input/output methods, word break, string collation, etc. Some plug-ins for Pango GTK+ IM are also provided.
thaifonts-scalable
[14] a collection of scalable fonts available to the public, plus some fonts developed in-house. All fonts are maintained and improved based on standard technical specifications.
thailatex
[15] Babel-based Thai support for LaTeX document preparation, based on Surapant Meknavin's work.

Other significant development efforts to boost the Thai environment in FOSS are Pladao [16] and OfficeTLE. [17] Both projects aim to develop Thai support in OpenOffice.org. Pladao was initiated by Sun Microsystems (Thailand) and was subsidized by Algorithms Co. OfficeTLE was initiated and operated by NECTEC. Both of these projects are working in parallel, emphasizing different aspects. Pladao is feature-rich, while OfficeTLE emphasizes Thai language processing quality. Many hope that they will merge, possibly through upstream projects.

Obstacles[edit]

These are the obstacles to the localization of FOSS in Thailand:

  1. Too few developers. With the growth of the user base, the lack of FOSS developers to serve the growing requirements and expectations is a big problem. The premature growth of FOSS adoption in Thailand has unbalanced the community, both in terms of the ratio between users and developers, and in terms of unrealistic expectations with regard to price and freedom. Worse, the community existing developers have been fragmented because they are often employed by competing businesses. Proprietary competition has reduced the traditional cooperation responsible for the progress of FOSS in Thailand.
  2. Misconception. If one wishes a movement to be initiated by the government, it is necessary for the government to have a good understanding of all of the concerns. The government must realize the benefits of FOSS as a means for developing technologies, as well as for improving the technical skills of the developers. Otherwise, it just becomes exploitative. According to some, even though the Thai government has popularized FOSS through the recent campaign for affordable PCs, they have also damaged public opinion of FOSS by their failure to provide proper support. Premature promotion of FOSS to uninitiated users when it is not ready would only create a bad impression. On the other hand, claiming that Microsoft's decision to lower the price of proprietary software is actually a success for FOSS, can be seen as exploiting FOSS as a negotiation tool. Moreover, "localization" tends to be perceived by the government as an activity that is about development of local GNU/Linux distributions, which is not true. Thus, many government policies simply miss the point and sometimes make things worse.

Lao Localization[edit]

"Laonux" is the name of the Lao language version of GNU/Linux. This section deals with Laonux implementation. An overview of its success and obstacles to its development is also provided.

Traditionally, the Lao language and its literature have been written in two scripts, Lao and Tham. The Tham script is derived from the Lanna script (of present-day Chiang Mai and northern Thailand), which in turn originates from ancient Mon. [18] The Lao language belongs to the Tai language family, which includes languages spoken in Assam, northern Myanmar, Thailand, Lao PDR, northern Viet Nam, and the Chinese provinces of Yunnan, Guizhou, and Guanxi. [19] Lao script is believed to have originated from the Khmer writing system. It originated in the Grantha script, which was the southern form of the ancient Indian Brahmi writing system. [20]

Lao script shares characteristics with other South-East Asian writing systems. It follows complex rules of layout involving consonants, vowels, special symbols, conjuncts and ligatures. Spaces are not used to separate words, and vowels appear before and after, under and over consonants.

Obstacles and Successes[edit]

In localizing FOSS for the Lao script, the following tasks must be completed:

  1. Identifying the technical obstacles to be overcome.
  2. Creating the world's first English/Lao technical dictionary.
  3. Finding and coordinating technical volunteers.
  4. Establishing de facto technical standards.
  5. Performing and testing the work.
  6. User training and education.
  7. Identifying the sources of funding.

The work of investigating how GNU/Linux can be localized for Lao began in the summer of 1999 and proceeded slowly at first. Months of research and experimentation identified the technical difficulties and tools available to resolve them. Anousak Souphavanh, working from Rochester, New York, eventually formed a team of volunteers, teachers and students from the National University of Lao PDR to focus on translating KDE into Lao. The KDE desktop in Lao allows the end user to access the basic functionalities of a computer, including email, Web browser and office applications.

Until 2002, the work progressed slowly but steadily, aided largely by helpful and frequent advice from fellow FOSS enthusiasts worldwide. The community regularly provides invaluable information and assistance in tracking down documentation of technical issues, relating experiences in solving similar problems, or simply encouraging the volunteers in their efforts.

In 2002, the Jhai Foundation provided a small stipend to support the development of Laonux. As before, research had to be performed on the technical issues and tools, and more volunteers had to be found for the required translation and programming. The programming work was completed in 2003. However, the bulk of the translating remains undone. The major obstacle is the lack of an English/Lao technical dictionary. Without this, any translation will be inconsistent and confusing to users.

Upon completion of the dictionary, it is relatively simple to translate the remaining message strings and integrate them into the desktop environment. This work can be performed by either professional or student translators at a relatively low cost. But until the dictionary is completed, the work will continue at a snail's pace.

Funding for the dictionary and translators is now being sought from international development agencies. In addition, funding for professional documentation, training and user education is required.

Standardization[edit]

The lack of technical standards for the Lao script and its implementation in software remains a difficult challenge. The following standards must be completed and accepted officially:

  1. Character set.
  2. Keyboard layout.
  3. Unicode fonts.
  4. Input methods.
  5. Output methods.

To date this project has yielded some very useful de facto technical standards. Lao officials now recognize the vital importance of these standards and have adopted the goal as their own. These standards and the technical dictionary could dramatically speed up the ensuing. Until the majority of Lao developers agree to follow standards, further development will be hampered.

Localization[edit]

Team members have resolved the following issues. These solutions follow ISO standards and should be adopted by future localization efforts to avoid duplication of effort and incompatibility between systems.

First, Unicode fonts were used instead of the existing fonts that override glyphs of English letters as per keyboard layout because these fonts are not standards compliant. Unicode is now the worldwide standard for FOSS developers. Also, development tools such as Kbabel and KDE require Unicode compatibility.

Laonux is a KDE localization, which is why the rendering of fonts is handled by the Qt libraries (as opposed to Pango or X Window). The Qt library file "Qt-qfont_X11.cpp" was modified slightly to point to the appropriate range for Lao. Initially Lao fonts were not rendered properly while stacking combined characters (i.e., consonant and a vowel). With the help of Theppitak Karoonboonyanan, patches were submitted to fix the rendering. Resolving these issues was a major breakthrough for Laonux development.

Inputs are handled by XKB, the keyboard map for X. It is used in converting key-strokes to key symbols defined in "X11/keysymdefs.h", with language switching capability. XKB Lao keys were created with the appropriate Lao Unicode range. Finally, the GNU C library definition for Lao was created. This is mainly UTF-8 locale settings for date/time and related issues.

Future Plans[edit]

  1. Continue work on Laonux.
  2. Begin work on localizing OpenOffice.org.
  3. Continue translation work for PHP Web portal tools.
  4. Begin work on localizing Mozilla and other Web tools.
  5. Train users and technical staff for localization projects.

The first priority is to create the English/Lao technical dictionary. Whether this will be funded by governmental sources, grants or international aid is unclear. Without this dictionary, translating and localizing the software cannot be completed. Another priority is to continue cooperating with other regional software localization efforts, especially where they have already addressed similar issues.

Additional modifications and updates to ISO and Lao government IT standards are required for the long term. These include Input/Output methods, keyboard layout, collation, locales, and additional Lao OpenType Unicode fonts standards.

Cooperation with universities for localization is ongoing. With most of the technical issues resolved, localization is now primarily a language issue. To avoid unnecessary anglicisms, professional linguists are needed. If technical professionals are involved in this process, they are likely to impose what they already know, which is English. It is far better for the future to adopt native language terms wherever practical. However, without funding, it is likely that a hodge-podge of anglicisms will prevail.

On the technical level, support for Lao script in IBM's ICU library is needed. ICU is the script support base for OpenOffice.org and for Java (released by IBM and Sun Microsystems). It is a complete library, including script rendering, layout, collation (sorting of words), line breaking, spell-checking, etc.

Some difficult and important technical work has been done. However, full integration in ICU will force us to define all of these issues very clearly, and this will pave the way for the developers to create Lao applications in Java and C++. Without this, professional Lao software development can be difficult.

Khmer Localization[edit]

This case study has been contributed by Javier Sola

Khmer Language[edit]

As in the case of Thai and Lao, Khmer script originates from the Grantha script, the south Indian form of the ancient Indian Brahmi writing system

Khmer script follows complex rules of layout in which consonants may take two different forms (e.g., the small form is placed on a lower line if it immediately follows another consonant). Space is used not to separate words but to indicate a pause in reading (very much like a comma in English). Vowels pronounced after a consonant may appear before, after, above, below; before and after (formed by two glyphs); before and above; below and above; or under and after the consonant.

At present, the definition of the language is so poor that even the number of vowels in the language is not clear. The number of vowels in the official reference (the only available dictionary) is different from the number of vowels taught in schools. The reference dictionary is sorted phonetically, making a systematic collation algorithm that will follow the same order impossible. Words starting with the same consonant may be ordered under different listings depending on how that consonant is pronounced in that word. As in the Lao localization project, an English/Khmer technical dictionary is not available, and the lack of it severely hampers the efforts to translate software into the local language.

Obstacles and Successes[edit]

When the KhmerOS project was first being considered, the technical situation was as follows:

  1. Khmer had already been included in Unicode. Fixed in 1996 by a team of people who had no contact with the Cambodian government, the definition in Unicode was later disputed, but to no avail. The Unicode Consortium refused to change anything, including the addition of necessary Khmer vowels, on the basis that these could be formed by combining other existing Khmer characters. The Consortium only permitted adding comments to the existing standard. The standard is now considered fixed by the Khmer government (in its 4.0).
  2. Microsoft had published OpenType specifications for Khmer, and included the language in its Uniscribe complex text layout engine. MS Publisher worked very well in Khmer, but MS still did not handle either line-breaking or sorting. MS Word crashed quite often while using Khmer.
  3. Some OpenType Khmer fonts already existed, though none in the public domain.
  4. No FOSS programs were implemented in Khmer in the GNU/Linux environment, but some FOSS (such as Mozilla) worked well in Khmer under Windows, using the Microsoft Uniscribe engine.
  5. Some people had been considering FOSS in Khmer, but the idea had not gone beyond mailing list discussions.
  6. There has been an amazing proliferation of legacy (non-Unicode) fonts. Up to 26 different font encodings had been defined. They worked well enough under MS Word by modifying " normal.dot".

It is also important to understand the social situation, as follows:

  1. The lack of computers in the Khmer language increases the digital divide. Only people who speak English have access to jobs that require the use of computers.
  2. Because computer interfaces are in English, the words used for computer-related work are also in English, introducing many anglicisms that could have been easily avoided if the software had been translated.
  3. As people have to memorize new English words (in the menus), training for computer use takes a long time and it is soon forgotten if it is not used.
  4. There is no English/Khmer technical dictionary.
  5. The development of databases for governmental purposes is not possible, as no database management systems that handle either legacy or Unicode Khmer encodings have been developed.
  6. In order to join the World Trade Organization, Cambodia has passed an intellectual property law that, when in force will, in theory, compel people to pay for proprietary software. FOSS allows developing countries to comply with the requirements of WTO without having to spend large amounts of money.

Spanish computer scientist Javier Solá, who lives in Cambodia and has more experience with computers and strategy than with FOSS, has decided to see what can be done in this situation. He has started writing an ambitious project with the following deliverables:

  1. A full computer operating system, office suite and entertainment applications needed by an average computer user - entirely in the Khmer language. A user will see only the Khmer language script on his/her screen. The system will include full documentation in Khmer, in electronic and paper formats.
  2. A "development library" (a set of programs) to be used by software developers to include support for the Khmer language in their applications; documentation of the development library.
  3. A set of up to 50 computer fonts (Unicode OpenType fonts) to be used in the application menus, for word-processing or for computer design.
  4. A keyboard layout, supporting drivers and 5,000 physical keyboards that support the abovementioned Unicode fonts.
  5. 5,000 copies of an installation disk that would be easy to use and would include all of the above software and documentation; with 1,000 of them accompanied by a full printed documentation.
  6. Training materials addressed to end-users, including the use of the system and the applications, and a training course for typing with the new keyboard.
  7. Computer end-user trainers trained to teach the new system using the above-mentioned materials: University professors, students and software development company personnel trained in advanced GNU/Linux and FOSS and application development using the Khmer script support tools provided by the project; computer vendor personnel trained in the installation of the system.
  8. FOSS expertise centres in universities, including trained professors, students and computers with Internet connectivity.
  9. Software development companies empowered to develop applications based on FOSS that require support for the Khmer language script.
  10. Government personnel trained to use the system.
  11. Personnel expertise on software purchasing, coordination of applications between similar administrations, and analysis of priority applications for improved governance.
  12. Marketing materials for deployment of the system.
  13. Widespread publicity of the system directly or through computer and software vendors.

The project not only considers the final goal but also the order in which the programs will be translated. First Mozilla, an email client, and then OpenOffice.org, an office suite, will be released under MS Windows. Finally these will be included in a fully Khmer GNU/Linux release when the user interface is translated.

After considering and later rejecting (for lack of real industry interest) the creation of an Industry Consortium, the project soon found a home in the local Open Forum of Cambodia, an NGO with a long history of supporting technologies for social purposes in Cambodia.

Sharing the idea of the project with anybody who would listen, setting up a Web site ( http://www.khmeros.info ), and initiating contact with various people interested in Khmer and computers, brought in the first volunteers.

Typographer Danh Hong put into the public domain one of his Khmer OpenType fonts. This was an important preliminary step to get the FOSS community interested in this language.

Lin Chear, a Canadian engineer of Khmer origin, started looking at Pango and in a few days created the necessary programs to support Khmer in Gnome and its family of products. Khmer will be supported by version 1.6 of Pango. KDE followed soon after.

With the help of Lin Chear, the European maintainer of Qt, developed the routines needed for supporting Khmer in KDE and programs that use Qt for language support. Lin Chear applied a Pango patch to Mozilla and got a version of Mozilla working in Khmer on GNU/Linux.

Back in Phnom Penh, the KhmerOS project established itself in a small office at the Open Forum facilities. Using USD1,500 from the first small donations, they hired two computer scientists and with a couple of old donated computers, started creating a glossary of Khmer computer terms as preliminary work before starting the translation. A donation from a Hong Kong businessman would later permit the purchase of new computers.

Meanwhile, work and discussion regarding keyboards and collation algorithms are ongoing. Implementations in the Thai language show that dictionary based line-breaking can be done in languages that do not separate their words. Work on conversion of texts in legacy fonts to Unicode encoding has also started.

"Evangelization" for Unicode is a necessary part of the project's work. Large amounts of money are donated to Cambodia or used by NGOs for computer-related projects, such as creating a database of all Cambodian laws. If these projects are not done in Unicode, the databases will be useless in a few years.

Future plans include increasing the project team to five translators with at least one of them a professional translator, and releasing email, browsing, word processing and spreadsheet programs in Khmer ( Thunderbird and Firefox from Mozilla and OpenOffice.org Writer and Calc).

These programs will all be released on MS Windows (2000 and XP), and will have full documentation in Khmer, as well as basic training modules in Khmer to be used by computer training professionals. Training materials are important in order to fix the language and terminology used by teachers, assuring standardization and avoiding use of too many anglicisms.

The translation of Gnome or KDE (or both) began in 2004, as well as a series of minor applications that will allow the release of a complete GNU/Linux-based system in Khmer in the second half of 2005.

Funding is still a major concern. Project speed is adjusted to suit funding, which comes either through grants or through contracts with corporations or other NGOs that need translated programs or services related to Khmer Unicode that can be provided by the project itself.

On the technical level, support is still required for Khmer script in IBM's ICU library. Full integration in ICU will compel the project to define all such issues clearly, as well as open the way for developing Khmer applications in Java.

Another concern is being able to use low-priced computers. In the Microsoft environment, Khmer Unicode will work only in Windows 2000 and XP, which require modern computers with a fair amount of memory and large hard disks. The way to low-priced (i.e., secondhand) Khmer computers will definitely have to be either through GNU/Linux or by getting Khmer script support in earlier versions of Windows such as Windows 98.

The project's distribution strategy is to try to have the software pre-installed by computer vendors and distributed by "two-dollars-a-CD" software vendors. This is to allow easy copying and wide distribution, thereby saving the project the cost of making copies.

The government, under an IDRC grant, is pushing for the localization of Windows and MS Office. They have neither plans nor resources to get involved in FOSS localization, but are said to be interested in coordinating efforts so as not to duplicate work and to ensure that the Khmer language (collation, etc.) is implemented consistently in both platforms.

The project aims not to be sectarian about FOSS. It is better to have people using email in Khmer on Windows in the short run than to try to make them change to a completely new system abruptly. There are still two or three years before the intellectual property law is expected to be strongly enforced. This period can be used as transition time, assuring a gradual success rate, rather than trying to force an overnight change and thereby failing.

Status of Localization Projects[edit]

A survey [21] of the status of localization projects was conducted by the PAN Localization Project [22] during the training on "Fundamentals of Local Language Computing" in Lahore, Pakistan in January 2004. Participants from 13 Asian countries were asked to provide information about the status of localization tasks in their countries. The survey revealed the status of standardization, localization and national policy for local language computing of the participants' countries. Please note that the informal survey technique employed gives only a rough picture of the actual situation.

Table 1 summarizes the survey on standardization for character set, keyboard layout, key-pad layout (e.g., for mobile phones), collation sequence, terminology translation and locale definition.


Table 1 PAN Survey: Localization Standards
Country Language Char Set Kbd Keypad Coll. Seq Interface Term Locale
Myanmar Burmese x * *
Cambodia Khmer x x
Mongolia Mongolian x x  ? * * x
Lao PDR Lao x x
Nepal Nepali * * * * *
Sri Lanka Sinhalese x x * * *
Thailand Thai x x * * * x
Bhutan Dzongkha x x * *
China Tibetan x x  ?
Japan Japanese x x x x x x
Bangladesh Bangla x  ? * * x
Afghanistan Pashto x x  ? *
Iran Farsi x x * * *
Pakistan Urdu x * * * *

x: complete; *: partially done; ?: don't know; blank: no work done

From the table, character set and keyboard layout seems to be well defined for most countries, while other issues need more work.

Table 2 summarizes the current status of localization of applications on the GNU/Linux platform, namely, keyboard input, fonts, sorting and find/replace utilities, natural language processing, spell checker, and thesaurus. It also shows whether any GNU/Linux distribution has been released in the country.

Table 2 PAN Survey: Basic Localized Applications on GNU/Linux
Country Language KBD Driver Font Coll. Find/Replace NLP Spell Check Thes. Linux Distr.
Myanmar Burmese *
Cambodia Khmer
Mongolia Mongolian x x * * *  ? x
Lao PDR Lao
Nepal Nepali x * *
Sri Lanka Sinhalese * x * *
Thailand Thai x * x x * * x
Bhutan Dzongkha
China Tibetan * *  ?
Japan Japanese x x x x  ?  ?  ? x
Bangladesh Bangla x * *  ? *
Afghanistan Pashto *  ? x
Iran Farsi x * x * x * * *
Pakistan Urdu * * x *

x: complete; *: partially done; ?: don't know; blank: no work done


The survey shows that FOSS localization activities are taking place in many countries. Sharing expertise and resources among countries can boost progress in this area.

For message translation, a number of Asian language localization projects are underway. All of these projects are tracked in real time on the Internet. The following links provide the current status of some of these projects.

  1. http://i18n.kde.org/stats/gui/HEAD/fullinfo.php
  2. http://www.mandrakelinux.com/l10n/status.php3
  3. http://l10n-status.gnome.org/gnome-2.6/index.html

As of June 2004, Japanese and Korean translators have already completed most of their initial work in FOSS localization. China is close behind, followed by India and the South-East Asian nations. The situation is highly fluid, however, and the above sites should be checked for detailed and updated information on the current status.

Footnotes[edit]

  1. See http://www.linuxinsider.com/story/32421.html , http://www.economist.com/business/displayStory.cfm?story_id=2054746 , http://encyclopedia.thefreedictionary.com/CJK , http://www.chinadaily.com.cn/english/doc/2004-05/11/content_329529.htm and http://www.firstmonday.dk/issues/issue8_10/jesiek/ .
  2. Thai Language Audio Resource Center, 'Some Historical Background of Thai Language'; available from http://thaiarc.tu.ac.th/thai/thai.htm .
  3. Koanantakool, T. and the Thai API Consortium, Computer and Thai Language, National Electronics and Computer Technology Center, 1987, in Thai.
  4. Karoonboonyanan, T. and Koanantakool T., Standardization Activities and Open Source Movements in Thailand, Country Report, MLIT-4, Myanmar; also available at www.nectec.or.th/it-standards/mlit99/mlit99-country.html.
  5. Tantsetthi, T., 'Campaign for Internet-Standard-Conforming Thai Usage'; available from http://software.thai.net/tis-620 .
  6. Wongsaisuwan, M., 'Introduction to ThaiTeX'; available from http://thaigate.nii.ac.jp/files/thaitex.pdf .
  7. Thai Linux Working Group, 'Thai LaTeX'; available from http://inux.thai.net/plone/TLWG/thailatex .
  8. Ampornaramveth, V., 'NACSIS R&D Thai Project Page'; available from http://thaigate.nii.ac.jp .
  9. eucthai, 'ZzzThai Project'; available from http://www.fedu.uec.ac.jp/ZzzThai/ .
  10. Karoonboonyanan, T., 'Thai Sorting Algorithms'; available from linux.thai.net/thep/tsort.html; Karoonboonyanan, T., Raruenrom S. and Boonma P., 'Thai-English Bilingual Sorting'; available from linux.thai.net/thep/blsort.html; Karoonboonyanan, T., 'Thai Locale'; available from linux.thai.net/thep/th-locale.
  11. National Electronics and Computer Technology Center, 'Linux SIS: Linux School Internet Server'; available from http://www.nectec.or.th/linuxsis .
  12. National Electronics and Computer Technology Center, 'SchoolNet Thailand'; available from http://www.school.net.th .
  13. Thai Linux Working Group, 'LibThai Library'; available from http://linux.thai.net/plone/TLWG/libthai/ and http://libthai.sourceforge.net .
  14. Thai Linux Working Group, 'ThaiFonts-Scalable'; available from http://linux.thai.net/plone/TLWG/thaifonts_scalable/ .
  15. Thai Linux Working Group, 'Thai LaTeX'; available from http://linux.thai.net/plone/TLWG/thailatex/ .
  16. See http://www.pladao.org .
  17. See http://opentle.org/office-tle .
  18. See http://www.lan-xang.com/literature/lit_3.html .
  19. Thai Language Audio Resource Center, 'Some Historical Background of Thai Language'; available from http://thaiarc.tu.ac.th/thai/thai.htm .
  20. See http://seasrc.th.net/font/alphabet.htm .
  21. Hussain, S. and Gul S., PAN Localization Project: A regional initiative to develop local language computing capacity in Asia, 2004.
  22. See http://www.panl10n.net/ .