start => pootle
Table of Contents

Pootle Metadata Storage

This is an attempt to crystallize some of the discussions we have had recently on improving the Pootle architecture. If you have a different idea for how things should work, discussion on the mailing list is the right place, but clarifications etc are welcome in this page.

These are some of the interacting issues we need to consider in parallel:

  1. Scaling to multiple processes (e.g. when running under Apache) requires better locking of our file interaction, changes etc
  2. Scaling to larger numbers of files (e.g. 180000 for Debian) will probably require faster statistics generation etc (in general, translation metadata)
  3. Moving to generic API for translation storage (the base classes we use for PO, XLIFF, etc, etc) requires reworking our storage interaction

Locking and Base Classes are already being worked on in the Pootle-locking-branch. In addition because of Base Classes we have been factoring out the statistics generation etc which was horribly intertwined with PO file interaction.

It may be helpful to review the terms used in the base classes - they are outlined under terminology

Current status

Information Stored

Metadata here sinclude the following information that we currently store/use that is either not stored in the PO file, or is summary information:

  1. Counts of translation units (messages) in translation stores including
    • Number of strings in a translation store, and number of translated/untranslated strings
    • Number of words in original and translation of each string
    • Which strings in a file (referenced by position in the file) are translated/untranslated
    • Which strings in a file have suggestions waiting for processing
  2. Results of passing strings through checks
    • Which strings in each file have failed each of the checks (including fuzzy, untranslated, as well as all the punctuation checks etc)
  3. Assignment information
    • strings within a file can be assigned individually or in groups (e.g. the whole file) to translators for a particular purpose (e.g. translate, review, etc)
  4. Quick Statistics for a translation project (which is a set of translation stores translating a project into a language)
    • List of files
    • the number of words and strings and the number of translated words and strings for each file
  5. Goal information for a translation project
    • A number of goals can be defined for a project
    • each goal has a list of files or directories (implying all files within that directory) categorised in that goal
    • each goal has a list of users assigned to that goal
  6. Rights for a translation project
    • There are default rights, rights for a ‘nobody’ user (not logged in), and rights that can be assigned to specific users
    • These rights currently include view, suggest, translate, review, download archive, compile to mo, assign strings/goals, and administrate
  7. Users for Pootle
    • Authentication info: username, email address, hash of password, activation status
    • Site-wide rights (project administrator)
    • user preferences - selected projects and languages (for shortcuts)

Storage Formats

This information is currently stored in text files.

This is all far too messy and we need to clean it up properly.

Other Data

Other data that is not stored in the actual translation file (but isn’t strictly metadata):

Suggestions

Text Indexes

Plan for Relational Database

This is a proposal to move to storing all of the above metadata (Counts, Checks, Quick Statistics, Assignment, Goals, Rights and User information), to a backend relational database. This move would also give us an opportunity to clean up exactly what metadata we need to store, how it interacts with changes and locking, etc, etc.

Contentious Issues

Discussions that this would raise:

These are the easiest things for people to suggest, without getting into the nitty-gritty of solving problems. Discussion of these should take place separately to this discussion and planning. Reasons for this:

We can only handle so much change at one time and we already have 3 or 4 major changes going on, so lets make sure some of our current improvements land before we take up too much time in discussing the above.

Other options considered

Other options we looked at for how to store metadata:

In the end it comes down to this is the kind of thing relational databases were designed for, so it seems a clear choice

Issues for Design