start
Table of Contents

Debian and Wordforge

This section documents the team effort to create an internationalization infrastructure for the Debian project, which implements Pootle.

This project is important for a number of reasons, not least of which is the need to explore the issues involved in creating a cohesive i18n process.

Background

At Debconf6 in Mexico, a number of events, discussions and decisions occurred to found this effort.

Here is Christian Perrier’s summary of that information.

1. Summary

The work on i18n at Debconf6 has been particularly interesting and productive.

The main topic is of course, currently, discussion on i18n infrastructure, both summarizing existing features (most of them being summarized in the paper I published along with Javier Fernandez Sanguino) and future features.

The two main aims were to:

Two BOF sessions were organized, and the i18n talk by Javier and myself concluded the week.

Many other informal discussions also happened, with several people involved, among whom I’ll cite:

Discussions from the mailing list:

During our first session and the initial discussions, the main point was trying to setup ideas about the needs for the infrastructure: what its targets (a.k.a. users) were, and what features might be needed by each of these targets.

The second session happened after some informal work between contributors, and was aimed at being a summary of the ideas that were floating around.

All this lead to the following conclusions:

Infrastructure targets

We identified the following targets, or categories of users:

Needs of each user category

These design goals raise the importance of the project to being as modular as possible. The core of the project should be a backend to which all other modules will plugin.

Going with WordForge

The presentation by Javier Sola about the Pootle project has finished convincing us that working with the members of the WordForge project is certainly the way to go. Their project is aimed mostly at these goals and at even more which we had not yet formalized completely.

That project doesn’t suffer the concerns we have with the Rosetta project from Ubuntu/Canonical. We indeed have later confirmed that WordForge also works on open standard for communication between localisation projects, and that these could be used to communicate between the Debian infrastructure and that of Ubuntu.

These early specifications for features certainly have to be enhanced and completed in the future, but they will probably already allow Debian to merge them in the Pootle/WordForge specifications which are still under work.

We will cohere WordForge specifications with ours (needs agreement on how to “mark” Debian additions/needs).

Google Summer of Code project

The challenges with the GSoC project are multiple:

The first idea that came out has been requesting some work on some “bounties” of the WordForge project. However, some of us prefer that the specifications are ready before we enter such path. Given that they won’t obviously be complete, we finally decided to launch a quite conservative project just to ensure that the work done has still some benefit for Debian without compromising the future.

As a consequence, it has been decided that the requested work will be separating the frontend from the backend in Pootle, which will allow future work to be concentrated on the backend used by all future software in the framework.

Some ideas about modules

This part is more prospective and will need some reorganization. It’s more intended to be a summary of ideas that have been floating in the mailing list.

Import modules

up-to-date POT. At the minimum, reproduce the current layout) * Import from programs’ PO (ditto) * Import from man pages converted with po4a (standard layout mandatory) * documentation (action from maintainers==opt-in) * Web site (no action, source under control)

Translation modules

* Different processes for different types of translations * Branching translations with merge features (manually or automatically for stable/testing/unstable)

Export modules

* Individual PO files * Set of PO files as a tarball * Individual XLIFF files * Online work

Interface modules

* Web interface * Mail interface

Communication modules

* Other Pootle servers * Rosetta servers * TP server (?)

Future plans for Debian i18n contributors

Reviving the DDTP

While the work on the new infrastructure advances, Grisu (Michael Bramer) will stabilize the current DDTP code to allow some maintenance of existing translations of package descriptions. Most of this work is already achieved, indeed.

In parallel, and because APT 0.6 now includes support for translated packages descriptions, the use of these will be promoted. Temporarily, the Translations files will be hosted on another server, namely ddtp.debian.net.

Some discussion has to happen with the ftpmasters team to decide whether the use of Translations files on FTP servers is considered suitable and when their inclusion can be possible. Of course, for this to happen, the DDTP must have a working maintenance system so that maintainers of packages can be sure that the bug reports they might receive from the DDTP, will be maintained.

The plan here is temporarily to use http://ddtp.debian.net as the demo case of what can be done with “Translations-*” files and the new APT. We should (re-)advertise this, have it used during a few weeks and then discuss with the ftpmaster having it integrated in the main repositories, along with Packages files.

Extremadura 2006

A meeting will be organized in Extremadura, from Thursday September 7th to Sunday September 10th.

This meeting will use the specifications previously finalized by the Debian i18n community to allow contributors to start building a consistent backend to the future Pootle system, benefitting from the work of Gintautas to separate both.

Inviting the Pootle developers to the meeting is considered highly desirable. Hopefully, in the meantime, enough will have been achieved, especially by Gintas during his work.

The goal could be setting up the first Debian Pootle server.

For all information about Extremadura sessions, the #extremadura-2006 channel can be used on freenode (irc.debian.org).

Further information

You can see the slides used during the 2nd BOF of Debconf about i18n infrastructure, plus the summary of the current ideas about the infrastructure. N.B. : this material is provided for interest: it is in being developed.

Christian and Javier gave a talk on the i18n infrastructure which may possibly be seen in this video.

The Debian wiki hosts i18n pages on Ideas for the Debian i18n Framework, The i18n Infrastructure, l10n Coordination and L10n Workflow.

Work started

At that point (20th May, 2006), Gintautas said:

“if my application is accepted, we should have something running in a few weeks (my first milestone is on June 20th).”

and Alberto Escudero said:

“I am working in infrastructure issues related to Wordforge with Javier and other people in the Wordforge gang. Lately I am defining a mechanism to connect Pootle backend to a OpenOffice.org build infrastructure (xml-rpc btw).”

BOF 2

On 22nd May 2006, Nicolas François posted this summary of the second BOF on the i18n infrastructure.

English English English Glossary
English English Word1
Translation Translation Translation1
Translation Translation2

(for all teams?, only at some point in time?)

[note from another BOF, I don’t know if this will be the case in the first implementation]

Some notes taken during Javier’s presentation about Wordforge

Speed and capacity

There has been considerable discussion on the most effective structure for the different components of the i18n infrastructure. Estimates of the number of files and users active on the proposed server at any one time are difficult to make at that stage, but the aim is to provide for different situations, and for growth.

gettext-0.15

Gettext 0.15 was released in August, 2006, containing, among other very useful improvements, a msgmerge capable of handling much bigger files much faster, and the long-awaited msgctxt (context) facility. Future planning for Pootle and the i18n interface needs to take this increased gettext capability into account.

scale

On 9th June 2006, Zejn Gasper said:

Not only databases, but also file systems can scale, e.g. you can set up a NFS cluster, so flat-files aren’t necessarily a bottleneck. Of course, only Pootle is given access to files.

XML files do not need to be parsed, because Python has pickle, which can store a Python object in binary format in a file. This may be a rather risky, hackish trick, but could probably greatly accelerate XML-handling. Of course, plain XML also gets stored besides the pickled version. Also, since Pootle is converting to Kid templates, I’m not afraid it would be slow, since Kid uses the element-tree XML parsing library, which has a C implementation. Pickle could also be used for RPC.

There are some proven solutions which should be used, e.g. memcached. It can be used as a distributed cache spread across a number of servers’ RAM, and has been in production use for a long time. The current cache system is less flexible and written in Python, while memcached is a dæmon written in C. The Python API binding is available, but not yet packaged in Debian.

Fuzzy matching should be implemented separately. This is usually a CPU-intensive task, so it could be handy to have ‘fuzzy servers’ separated from the main Pootle server that remains responsive at all times. This would also enable having e.g. one fuzzy server for Debian translations, one for GNOME, one for KDE ... of course - if needed. Also, if any group wants for some reason to disable fuzzy matching, implementing it separately makes perfect sense.

Indexing would be split; statistics, translation states and similar metadata would go into a relational database for faster access. But for the PO/XLIFF files, I think I’d somehow rather go with (pickled) flat-files.

So there we’d have indexing, caching, storage backend, fuzzy matching, middleware (this being the part of Pootle handling mails and similar) and web frontend. The Pootle server would glue this together in a simple-to-use package.

communication

Zejn also said:

I think XMPP (Jabber) is a good candidate for RPC because:

  1. The protocol is already XML-based, and XML can be nested.
  2. It would allow triggered updates, where the master server would

notify slave servers of a new or fuzzy string needing translation, so there would be less need for polling the master server to stay updated. Of course, slave servers would check for new strings, e.g. daily and at Pootle restarts, but updates would still propagate much much faster this way.

  1. Jabber could be integrated into the web interface in many ways,

e.g. instead of a standard username/password, the Pootle server would require your Jabber username, and would only log you in if your account is reachable (this is a subset of the Jabber account states). Using a Jabber account as a login means fellow translators know where to reach you, and since you are signed in, they can quickly intervene if you start messing around. Using Jabber for authentication over HTTP is also a SoC project.

  1. Encryption, compression and many similar features are

already implemented: all the Pootle project needs to do is to set up a Jabber server.

I was mostly inspired by a blogpost seen on planetkde.org, but have been thinking about this earlier.

workflow

On 10th June 2006, Javier wrote:

I think that we can actually separate process information from workflow.

XLIFF files allow storage of <phase> elements, which can be “translation”, “authoritative review” (the reviewer can make changes in the target), “reviewer recommendations” (the reviewer adds comments in a file, but does not change the <target>), “approval” or any other that we define.

For example:

<phase phase-name="xxx" process-name="translation" date = "2006-01-25T21:06:00Z" contact-name="Alberto Martinez"
contact-email="alberto@martinez.com">
</phase>

These phases are elements that can be used in a given worflow. If your workflow is translation → review → approval, then Pootle will enforce that a review phase is mandatory after a translation phase, and then an approval phase. You could also require only a single translation stage, or a workflow in which a file can be reviewed by N reviewer-recommenders (who mainly add comments to the file), and record N phases in the Pootle file (interestingly, with this method, reviewers can act at the same time, and their information and phase is added to the file when they upload... then the file is sent back to the translator, who in this case can act as “approver” (his/her translation editor must be able to display reviewer comments when in approver mode).

If we can define the rôles, then we do not need to consider workflows in the encoding of the files, only in the process tool (Pootle).

Workflow can be defined for each project (a project being the set of files of a piece of software that will be translated to a given language, such as Gaim 1.5 for French). It can be inherited by default (from the language level), to simplify configuration, but remain changeable.

If this is combined with allowing commit access (upstream) to whichever rôle the team decides (translator, approver), it should provide enough flexibility for any type of workflow. We will have to test a number of them to ensure that this assumption is correct.

Every string in the file can be associated with a specific phase. Some strings in the same file could be unstranslated, some translated but not reviewed, some reviewed. There are loopholes in the XLIFF definition, but they can be overcome with implementation specifications.

The diagrams begin

On 9th June, 2006, during the “DB or file-based storage” discussion, Otavio posted:

I like the idea that a RDBMS is the way to store the data, and the “final” format (po or whatever) should be built just as a product of it. Also, I agree about the idea to have a “server layer” between the client and backend because it makes trivial for user to use the backend without much hassle and without the need to control locks and whatever of internal details of it.

One thing that come to my mind just now is that we might have the following design:

With that in mind, I realise that it might be easier to get collaboration with the Pootle people, since they can help us to design the layers below their compatibility layer, and at same time improve their system in a easier way without needing to be careful not to break our system. So, with that we both are basically free and able to help each other.

The only drawback to it that I can see, is that we won’t contribute much code back to Pootle directly. Basically this makes the structure more “backend agnostic” and then plug our backend in there. That, IMHO, is a good conseguence, since they’ll be free to choose and develop other backends without having to compromise only on us.

Javier answered:

I would like to propose a course of action.

  1. Gintautas works on the API (which has to include the separation of front-end and back-end), and on the XML-RPC. This will really clear up things for further work, and will facilitate the specification of a DB. This work is done in a separate branch, as required by the project.
  2. The WordForge present staff will integrate these changes in Head as soon as they are clear, so that Gintautas’ branch and Head stay as close as possible, and integration is not a problem later.
  3. Independent of the work that is continuing on WordForge (front-end or file-based back-end), WordForge will be happy to fund somebody to work on designing and implementing a RDB back-end. This can be somebody working in parallel during the coming months, or Gintautas after he finishes (if he has time before going back to school). Then we just use — or continue working on — the best back-end that we have. We would work together with this person to ensure that all the data that we need in the project is present in the DB (even if most of should already be defined by the API).

I somehow have a feeling that the optimal result will be something between file-based and DB-based, but again, I could be completely wrong. We will not really know what it is until we try.

Regardless of whether the best storage is using files or a DB, the standard formats are precisely done so that they will last for a long time. XLIFF is an OASIS standard that will evolve, and will not be replaced in the coming decades. Just as with as Unicode, the standards are there to last... and, even if change occurs, you would still need to write converters. On the contrary, when these standards advance (within backwards compatibility), you will have to modify the database to handle new types of information, but you would not need to modify a file-based system.

I agree about the idea to have a “server layer” between the client and backend because that makes it trivial for the user to use the backend without much hassle, and without the need to control locks etc.

I would see the a separation between the Pootle web-based file server and the Pootle editor, maybe putting them on top of the API, even if this might not be the best place for the editor. Would this break the idea that you expressed with the server layer above?

Our idea — as I mentioned — is still that we work together in the different options. We would not like to see an “our back- end” and a “your back-end”. The advantage of working together is that there are more people thinking, and the good news is that we have sufficient funding to look at both options, and then pick up the one that we all agree is the best, and continue working in that direction.

Packaged for Debian

On the 15th July, 2006, Christian Perrier announced that he had created some very early Debian packages for python-jtoolkit and Pootle.

It turned out that Nicolas François had also produced Debian packages for python-jtoolkit and Pootle, which were uploaded by Otavio Salvador later that same day. Fast work, Debian! :)

New life for the DDTP

On the 4th August, Christian Perrier commented on the resuscitation of the DDTP (a web interface to translate Debian package descriptions):

It’s good that the DDTP is now getting results.

However, the current translation process (the mail interface) is a bit different from the usual way translation teams are working. For instance, this explains why the French effort is currently mostly stopped for the DDTP.

What would be good now is getting an interface with the PO format so that translators can work on PO files for DDTP translations.

Before we get the nice infrastructure (probably based on Pootle) which we will discuss in Extremadura, it would be good if we could come up with a crude way to get PO files for the DDTP.

That requires writing something that will convert the Debian descriptions and their translations to PO format. IIRC, Otavio did write some stuff already, 1 or 3 years ago.

Maybe a po4a module would be possible.....this should be checked with the po4a package maintainers: Nicolas François, Thomas Huriaux, Martin Quinson et al.

Note that Pootle integrates the po4a package.

Michael Bramer commented:

For the DDTP in the new i18n system we need:

  1. some importer to get package Descriptions from the Packagefiles from the ftp server
  2. do the work in the new i18n system (po files, translation, review, etc.)
  3. some exporter to generate the Translation files for the ftp master

He later added:

For Pootle, we need a Packages2po and a po2Translations script for the daily work.

We need a Translation2po script to include the translated Packages in the Pootle system. But I can write some lines to generate the po files with the translated descriptions from the database too.

Thomas Huriaux reminded:

Please don’t forget that intltool-debian scripts are already mainly doing this job. It’s not a hack to use them: even if the main frontends (debconf-updatepo and po2debconf) have a name which is debconf-related, these tools are generic and can be used in this situation.

From the 6th August, Martijn van Oosterhout started to add functionality to the current DDTS:

I’ve written a sort-of frontend for the DDTS. It’s a standalone system, it doesn’t have anything to do with the DDTS directly, except that it knows how to drive the system via email and process the responses.

It’s a web-interface-based system. What happens is that it tries to keep a configurable number of untranslated descriptions available. A user can click on it and provide the translation. After that is goes to the review list. From there. any number of people can review it. Once that has passed a configurable number (can be zero), it will be sent off to the DDTS.

Features include: you can configure a list of words that, if they appear in the untranslated text, will appear underlined and when you hover your mouse over each one, it provides a translation. This is good for words like script, command, executable, caller, library which tend to get varying translations depending on who’s translating.

On the 17th, Jens added some documentation for the DDTP temporary system.

The DDTS temporary system has served two useful purposes: it has made it possible for some languages to update their package descriptions for the upcoming Debian Etch release, and its development and testing have brought up a number of process issues which will be addressed by the new i18n infrastructure. Although the Wordforge team has been dealing with these issue for some time in producing Pootle and the Translate Toolkit, the future users of the Deb18n infrastructure (team leaders, project leaders and translators) have had an opportunity to test a web interface, and to consider how it would work best for their needs. After all, the more we put into designing this system, the more we will all get out of it. ;)