Warning, /sdk/pology/doc/user/lingo.docbook is written in an unsupported language. File is not indexed.

0001 <?xml version="1.0" encoding="UTF-8"?>
0002 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
0003  "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
0004 
0005 <chapter id="ch-lingo">
0006 <title>Language Validation and Derivation</title>
0007 
0008 <para>Pology was designed with strong language-specific support in mind, and this chapter describes the currently available features in the direction of validation and derivation of translation as whole and various bits in it.</para>
0009 
0010 <!-- ======================================== -->
0011 <sect1 id="sec-lglangenv">
0012 <title>The Notion of Language in Pology</title>
0013 
0014 <para>A versatile translation-supporting tool has to have some language-specific functionality. But, it is difficult to agree on what is a language and what is a dialect, what is standard and what is jargon, what is derived from what, how any of these are named, and there are many witty remarks about existing classifications. Therefore, Pology takes a rather simple and non-formal approach to the definition of "language", but such that should provide good technical leverage for constructing language-specific functionality.</para>
0015 
0016 <para>There are two levels of language-specificity in Pology.</para>
0017 
0018 <para>The first level is simply the "language". In linguistic sense this can be a language proper (whatever that means), a dialect, a variant written in different script, etc. Each language in this sense is assigned a code in Pology, when first elements of support for that language are introduced. By convention this code should be an <ulink url="http://en.wikipedia.org/wiki/ISO_639">ISO 639</ulink> code (either two- or three-digit) if applicable, but in principle can be anything. Another convenient source of language codes is the GNU C library. For example, Portugese language spoken in Portugal would have the code <literal>pt</literal> (ISO 639) while Portugese spoken in Brazil would be <literal>pt_BR</literal> (GNU C library).</para>
0019 
0020 <para>The second level of language-specificity is the "environment". In linguistic terms this could be whatever distinct but minor variations in vocabulary, style, tone, or ortography, which are specific to certain groups of people within a single language community. Within Pology, this level is used to support variations between specific translation environments, such as long-standing translation projects and their teams. Although translating into the same language, translation teams will almost inevitably have some differences in terminology, style guidelines, etc. Environments also have codes assigned.</para>
0021 
0022 <para>In every application in Pology, the language and its environments have a hierarchical relation. In general, language-specific elements defined outside of a specific environment ("environment-agnostic" elements) are a sort of a relaxed least common denominator, and specific environments add their own elements to that. Relaxed means that environment-agnostic elements can sometimes include that which holds for most but not all environments, while each environment can override what it needs to. This prevents the environment-agnostic language support from getting too limited just to cater for perculiarities in certain environments.</para>
0023 
0024 <para>When processing PO files, it is necessary to somehow convey to Pology tools to which language and environment the PO files belong. The most effective way of doing this is by adding the necessary information to PO headers. All Pology tools that deal with language-specific elements will check the header of the PO file they process for the language and environment. Some Pology tools will also consult the user configuration (typically with lower priority than PO headers) or provide appropriate command line options (typically giving them higher priority). See <xref linkend="sec-cmheader"/> and <xref linkend="sec-cmconfig"/> for details.</para>
0025 
0026 <sect2>
0027 <title>Supported Languages and Environments</title>
0028 
0029 <para>The following languages and environments within those languages currently have some level of support in Pology (assigned code in parenthesis, "t.t." stands for translation team):
0030 
0031 <informaltable>
0032 <tgroup cols="2">
0033 <colspec colwidth="30%"/>
0034 <colspec colwidth="70%"/>
0035 <thead>
0036 <row>
0037 <entry>Language</entry>
0038 <entry>Environments</entry>
0039 </row>
0040 </thead>
0041 <tbody>
0042 
0043 <row>
0044 <entry>Catalan (<literal>ca</literal>)</entry>
0045 </row>
0046 
0047 <row>
0048 <entry>French (<literal>fr</literal>)</entry>
0049 </row>
0050 
0051 <row>
0052 <entry>Galician (<literal>gl</literal>)</entry>
0053 </row>
0054 
0055 <row>
0056 <entry>Japanese (<literal>ja</literal>)</entry>
0057 </row>
0058 
0059 <row>
0060 <entry>Low Saxon (<literal>nds</literal>)</entry>
0061 </row>
0062 
0063 <row>
0064 <entry>Norwegian Nynorsk (<literal>nn</literal>)</entry>
0065 </row>
0066 
0067 <row>
0068 <entry>Romanian (<literal>ro</literal>)</entry>
0069 </row>
0070 
0071 <row>
0072 <entry>Russian (<literal>ru</literal>)</entry>
0073 </row>
0074 
0075 <row>
0076 <entry>Serbian (<literal>sr</literal>)</entry>
0077 <entrytbl cols="1">
0078 <tbody>
0079 <row><entry>KDE t.t. (<literal>kde</literal>)</entry></row>
0080 <row><entry>The Battle for Wesnoth t.t. (<literal>wesnoth</literal>)</entry></row>
0081 </tbody>
0082 </entrytbl>
0083 </row>
0084 
0085 <row>
0086 <entry>Spanish (<literal>es</literal>)</entry>
0087 </row>
0088 
0089 </tbody>
0090 </tgroup>
0091 </informaltable>
0092 </para>
0093 
0094 </sect2>
0095 
0096 </sect1>
0097 
0098 <!-- ======================================== -->
0099 <sect1 id="sec-lgspell">
0100 <title>Spell Checking</title>
0101 
0102 <para>Pology can employ various well-known spell-checkers to check the translation in PO files. Currently there is standalone support for <ulink url="http://aspell.net/">Aspell</ulink>, and unified support for many spell-checkers (including Aspell) through <ulink url="http://www.abisource.com/projects/enchant/">Enchant, the spell-checking wrapper library</ulink> (more precisely, through <ulink url="http://pyenchant.sourceforge.net">Python bindings</ulink> for Enchant).</para>
0103 
0104 <para>Spell-checking of one PO file or a collection of PO files can be performed directly by <link linkend="ch-sieve">sieving</link> them through one of <link linkend="sv-check-spell"><command>check-spell</command></link> (Aspell) or <link linkend="sv-check-spell-ec"><command>check-spell-ec</command></link> sieves. The sieve will report each unknown word, possibly with a list of suggestions, and the location of the message (file and line/entry numbers). It can also be requested to show the full message, with unknown words in the translation highlighted.</para>
0105 
0106 <para>Also provided are several <link linkend="hk-spell">spell-checking hooks</link>, which can be used as building blocks in custom translation validation chains. For example, a spell-checking hook can be used to define the spell-checking rule within Pology's <link linkend="sec-lgrules">validation rules</link> collection for a given language.</para>
0107 
0108 <sect2 id="sec-lgspdicts">
0109 <title>Internal Spelling Dictionaries</title>
0110 
0111 <para>Pology collects internal language-specific word lists as supplements
0112 to system spelling dictionaries. One use of internal dictionaries is to record those words which are omitted in the system spelling dictionaries, but are actually proper words in the given language. Such words should be added into internal dictionaries only as an immediate fix for false spelling warnings, with an eye towards integrating them into the upstream spelling dictionaries of respective spell-checkers.</para>
0113 
0114 <para>More importantly, internal dictionaries serve to collect words specific to a given environment, i.e. the words which are deemed too specific to be part of the upstream, general spelling dictionaries for the language. For example, this can be technical jargon, with newly coined terms which are yet to be more widely accepted. Another example could be translation of fiction, in books or computer games, where it is common-place to make up words for fictional objects, animals, places, etc. which are not even intended to be more widely used.</para>
0115 
0116 <para>In Pology source tree, internal spelling dictionaries by language are located in <filename>lang/<replaceable>lang</replaceable>/spell/</filename> directories. This directory can contain arbitrary number of dictionary files, which are all automatically picked up by Pology when spelling-checking for that language is done. Dictionary files directly in this directory are environment-agnostic, and should contain only the words which are standard (or standard derivations) in the language, but happen to be missing from the system spelling dictionary. Subdirectories represent specific environments, they are named with the environment code, and can also contain any number of dictionaries. An example of internal dictionary tree with environments:
0117 <programlisting>
0118 lang/
0119     sr/
0120         spell/
0121             colors.aspell
0122             fruit.aspell
0123             ...
0124             science.aspell
0125             kde/
0126                 general.aspell
0127             wesnoth/
0128                 general.aspell
0129                 propernames.aspell
0130 </programlisting>
0131 When one of Pology's spell-checking routes is applied for a given language without further qualifiers, only the environment-agnostic dictionaries of that language are automatically included. It must be explicitly requested to additionaly include dictionaries from one of the environments (e.g. by <option>env:</option> parameter to <command>check-spell</command> sieve).</para>
0132 
0133 <para>Dictionary files are in the Aspell word list format (regardless of the spell-checker actually used), and must have <filename>.aspell</filename> extension. This is a simple plain text format, listing one word per line. Only the first line is special, the header, which states the language code, number of words in the list, and the encoding. For example:
0134 <programlisting>
0135 personal_ws-1.1 fr 1234 UTF-8
0136 apricot
0137 banana
0138 cherry
0139 ...
0140 </programlisting>
0141 Actually the only significant element of the header is the encoding. Language code and number of words can be arbitrary, as Pology will not use them.</para>
0142 
0143 <para>Pology provides the <command>normalize-aspell-word-list</command> command which sorts word list files alphabetically (and corrects the word count in the header, even if not important), so that you do not have to manually insert new words in proper order. The script is simply run with arbitrary number of word list files as arguments, and modifies them in place. In case of duplicate words, it will report duplicates and eliminate them. In case of words with invalid characters (e.g. space), the script will output a warning, but it will not remove them; automatic removal of invalid words can be requested with <option>-r</option>/<option>--remove-invalid</option> option.</para>
0144 
0145 </sect2>
0146 
0147 <sect2 id="sec-lgspskip">
0148 <title>Skipping Messages and Words</title>
0149 
0150 <para>Sometimes a message or a few words in it should not be spell-checked. This can be, for example, when the message is dense computer input (like a command line synopsis), or when a word is part of a literal phrase (such as an email address). It may be possible to filter the text to remove some of the non-checkable words prior to spell-checking (especially when spell-checking is done as a <link linkend="sec-lgrules">validation rule</link>), but not all such words can be automatically detect. For example, especially problematic are onomatopoeic constructs ("Aaargh! Who released the beast?!").</para>
0151 
0152 <para>For this reason it is possible to manually skip spell-checking on a message, or on certain words within a message, by adding a <link linkend="sec-cmskipcheck">special translator comment</link>. The whole message is skipped by adding the <literal>no-check-spell</literal> translator flag to it:
0153 <programlisting language="po">
0154 # |, no-check-spell
0155 </programlisting>
0156 Words within the message are skipped by listing them in <literal>well-spelled:</literal> translator comment, comma- or space-separated:
0157 <programlisting language="po">
0158 # well-spelled: Aaarg, gaaah, khh
0159 </programlisting>
0160 Which of these two levels of skipping to use depends on the nature of the text. For example, if most of the text is composed of proper words, and there are only a few which should not be checked, it is probably better to list those words explicitly instead of skipping the whole message.</para>
0161 
0162 </sect2>
0163 
0164 </sect1>
0165 
0166 <!-- ======================================== -->
0167 <sect1 id="sec-lggrammar">
0168 <title>Grammar Checking</title>
0169 
0170 <para>With Pology you can use <ulink url="http://www.languagetool.org/">LanguageTool</ulink>, a free grammar and style checker, to check translation in PO files. At the moment LanguageTool is applicable only through <link linkend="sv-check-grammar">the <command>check-grammar</command> sieve</link>, so look up the details in its documentation.</para>
0171 
0172 </sect1>
0173 
0174 <!-- ======================================== -->
0175 <sect1 id="sec-lguirefs">
0176 <title>Automatic Insertion of UI Labels</title>
0177 
0178 <para>In program documentation, but also in help texts in running programs, frequently labels from user interface are mentioned. Here are two such messages, one a UI tooltip, the other a Docbook paragraph:
0179 <programlisting language="po">
0180 #: comic.cpp:466
0181 msgid "Press the \"Get New Comics...\" button to install comics."
0182 msgstr ""
01830184 #: index.docbook:157
0185 msgid ""
0186 "&lt;guimenuitem>Selected files only&lt;/guimenuitem> extracts only "
0187 "the files which have been selected."
0188 msgstr ""
0189 </programlisting>
0190 In the usual translation process, an embedded UI label is manually translated just like the surrounding text. You could directly translate the label, hoping that the original UI message was translated in the same way, but this will frequently not be the case (especially for longer labels). To be thorough, you could look up the UI message in its PO file, or run the program, to see how it was actually translated. There are two problems with being thorough in this way: it takes time to look up original UI messages, and worse, translation of a UI message might change in the future (e.g. after a review) and leave the referencing message out of date.</para>
0191 
0192 <para>An obvious solution to these problems, in principle, would be to leave embedded UI labels untranslated but properly marked (such as with <literal>&lt;gui*&gt;</literal> tags in Docbook), and have an automatic system fetch their translations from original UI messages and insert them into referencing messages. However, there could be many implementational variations of this approach (like at which stage of the translation chain the automatic insertion happens), with some significant details to get right.</para>
0193 
0194 <para>At present, Pology approaches automatic insertion of UI labels in a generalized way, which does not mandate any particular organization of PO files or translation workflow. It defines a syntax for wrapping and disambiguating UI references, for linking referencing and originating PO files, and provides a series of <link linkend="hk-uiref-resolve-ui">hooks to resolve and validate UI references</link>. A UI reference resolving hook will simply replace a properly equipped non-translated UI label with its translation. This implies that PO files which are delivered must not be the same PO files which are directly translated, because resolving UI references in directly translated PO files would preclude their automatic update in the future<footnote>
0195 <para>Another advantage is that original text too will sometimes contain out-of-date UI references, which this process will automatically discover and enable the translation to be more up-to-date than the original. Of course, reporting the problem to the authors would be desireable, or even necessary when the related feature no longer exists.</para>
0196 </footnote>. It is upon the translator or the translation team to establish the separation between delivered and translated PO files. One way is by translating in summit (see <xref linkend="ch-summit"/>), which by definition provides the desired separation, and setting UI reference resolving hooks as <link linkend="sec-sucfghooks">filters on scatter</link>.</para>
0197 
0198 <sect2 id="sec-lguiformat">
0199 <title>Wrapping UI References</title>
0200 
0201 <para>If UI references are inserted into the text informally (even if relying on certain ortographic or typographic conventions), then they must be manually wrapped in the translation using an explicit UI reference directive. For example:
0202 <programlisting language="po">
0203 #: comic.cpp:466
0204 msgid "Press the \"Get New Comics...\" button to install comics."
0205 msgstr "Pritisnite dugme „~%/Get New Comics/“ da instalirate stripove."
0206 </programlisting>
0207 Explicit UI reference directives are of the format <replaceable>head</replaceable>/<replaceable>reference-text</replaceable>/. The directive head is <literal>~%</literal> in this example, which is the default, but another head may be specified as parameter to UI resolving hooks. Delimiting slashes in the UI reference directive can be replaced with any other character consistenly (e.g. if the UI text itself contains a slash). Note that the directive head must be fixed for a collection of PO files (though more than one head can be defined), while delimiting character can be freely chosen from one to another directive.</para>
0208 
0209 <para>The other the type are implicit UI references, which do not require
0210 special directive, made possible when UI text is indicated in the text through formal markup. This is the case, for example, in PO files coming from Docbook documenation:
0211 <programlisting language="po">
0212 #: index.docbook:157
0213 msgid ""
0214 "&lt;guimenuitem>Selected files only&lt;/guimenuitem> extracts only "
0215 "the files which have been selected."
0216 msgstr ""
0217 "&lt;guimenuitem>Selected files only&lt;/guimenuitem> raspakuje samo "
0218 "datoteke koje su izabrane."
0219 </programlisting>
0220 Here the translation contains nothing special, save for the fact that the UI reference is not translated. UI resolving hooks can be given a list of tags to be considered as UI references, and for some common formats (such as Docbook) there are predefined specialized hooks which already list all UI tags.</para>
0221 
0222 <para>If the message of the UI text is unique by its <varname>msgid</varname> string in the originating PO file, then it can be wrapped simply as in previous examples. This means that even if it has the <varname>msgctxt</varname> string, the reference will still be resolved. But, if there are several UI messages with same <varname>msgid</varname> (implying different <varname>msgctxt</varname>), then the <varname>msgctxt</varname> string has to be manually added to the reference. This is done by puting the context into the prefix of the reference, separated by the pipe <literal>|</literal> character. For example, if the PO file has these two messages:
0223 <programlisting language="po">
0224 msgctxt "@title:menu"
0225 msgid "Columns"
0226 msgid "Kolone"
0227 
0228 msgctxt "@action:inmenu View Mode"
0229 msgid "Columns"
0230 msgstr "kolone"
0231 </programlisting>
0232 then the correct one can be selected in an implicit UI reference like this:
0233 <programlisting language="po">
0234 msgid "...&lt;guibutton>Columns&lt;/guibutton>..."
0235 msgstr "...&lt;guibutton>@title:menu|Columns&lt;/guibutton>..."
0236 </programlisting>
0237 In the very unlikely case of <literal>|</literal> character being part of the context string itself, the <literal>¦</literal> character ("broken bar") can be used as the context separator instead.</para>
0238 
0239 <para>If the UI reference equipped with context does not resolve to a message through direct match on context, the given context string will next be tried as regular expression match on <varname>msgctxt</varname> strings of the messages with matching <varname>msgid</varname> (matching will be case-insensitive). If this results in exactly one matched message, the reference is resolved. This matching sequence allows simplification and robustness in case of longer contexts, which would look ungainly in the UI reference and may slightly change over time.</para>
0240 
0241 <para>If two UI messages have equal <varname>msgid</varname> but are not part of the same PO file, that is not a conflict because one of those PO files has the priority (see <xref linkend="sec-lguilink"/>).</para>
0242 
0243 <para>If of UI two messages with equal <varname>msgid</varname> one has <varname>msgctxt</varname> and the other does not, the message without context can be selected by adding the context separator in front of the text with nothing before it (i.e. as if the context is "empty").</para>
0244 
0245 <para>Sometimes, though rarely, it happens that the referenced UI text is not statically complete, that is, that it contains a format directive which is resolved at runtime. In such cases, the reference must be transformed to exactly an existing <varname>msgid</varname>, and the arguments are substituted with special syntax. If the UI message is:
0246 <programlisting language="po">
0247 msgid "Configure %1..."
0248 msgstr "Podesi %1..."
0249 </programlisting>
0250 then it can be used in an implicit UI reference like this:
0251 <programlisting language="po">
0252 msgid "...&lt;guimenuitem>Configure Foobar...&lt;/guimenuitem>..."
0253 msgstr "...&lt;guimenuitem>Configure %1...^%1:Foobar&lt;/guimenuitem>..."
0254 </programlisting>
0255 Substitution arguments follow after the text, separated with the <literal>^</literal> character. Each argument specifies the format directive it replaces and the argument text, separated by <literal>:</literal>. In the unlikely case that <literal>^</literal> is part of the <varname>msgid</varname> itself, the <literal>ª</literal> ("feminine ordinal indicator") can be used instead as the argument separator.</para>
0256 
0257 <para>If there are several format directives in the UI reference, they are by default considered "named". This means that all same format directives will be replaced by the same argument. This is the right thing to do for some formats, e.g. <literal>python-format</literal> or <literal>kde-format</literal> messages, but not for all formats. In <literal>c-format</literal>, if there are two <literal>%s</literal> in the text, to replace just one of them with the current argument, the format directive attached to the argument must be preceded with <literal>!</literal>:
0258 <programlisting language="po">
0259 msgid "...&lt;guilabel>This Foo or that Bar&lt;/guilabel>..."
0260 msgstr "...&lt;guilabel>This %s or that %s.^!%s:foo^!%s:bar&lt;/guilabel>..."
0261 </programlisting>
0262 </para>
0263 
0264 <para>If the referenced UI text contains a format directive and the reference is found in the message of the same format (e.g. UI reference in the text of a tooltip), then using the special syntax to substitute an argument will normally make <command>msgfmt</command> command with <option>-c</option> signal an invalid message. For example:
0265 <programlisting language="po">
0266 #, kde-format
0267 msgid "You can change this behavior under \"Configure Foobar...\"."
0268 msgstr "Ovo ponašanje možete promeniti pod „~%/Configure %1...^%1:Foobar/“."
0269 </programlisting>
0270 is not valid because the <varname>msgstr</varname> now contains the <literal>kde-format</literal>-specific <literal>%1</literal> directive, but the <varname>msgid</varname> does not. To fix this, UI reference resolving hooks can be given an alternative directive start string such as to mask it from <command>msgfmt</command>. In this example:
0271 <programlisting language="po">
0272 #, kde-format
0273 msgid "You can change this behavior under \"Configure Foobar...\"."
0274 msgstr "Ovo ponašanje možete promeniti pod „~%/Configure %~1...^%~1:Foobar/“."
0275 </programlisting>
0276 the alternative directive start is <literal>%~</literal>, which makes it no longer appear as a <literal>kde-format</literal>-specific directive. Of course, before the reference is looked up in an UI catalog, internally the normal form of the directive will be recovered.</para>
0277 
0278 <para>In general, but especially with implicit references, the text wrapped as reference may actually contain several references in form of UI path (<literal>"...go to Foo->Bar->Baz, and click on..."</literal>). To handle such cases, if it is not possible or it is not convenient to wrap each element of the UI path separately, UI reference resolving hooks can be given one or more UI path separators (e.g. <literal>-></literal>) to split and resolve the element references on their own.</para>
0279 
0280 <para>Sometimes the UI reference in the original text is not valid, i.e. such message no longer exists in the program. This can happen due to slight interpunction mismatch, small style changes, etc., such that you can easily locate the correct UI message and use its <varname>msgid</varname> as the reference. However, if the UI reference is not valid due to documentation being outdated, there is no correct UI message to use in translation. This should most certainly be reported to the authors, but up until they fix it, it presents a problem for immediate resolution of UI references. For this reason, a UI reference can be temporarily translated in place, by preceding it with twin context separators:
0281 <programlisting language="po">
0282 msgid "...<guilabel>An Outdated Label</guilabel>..."
0283 msgstr "...<guilabel>||Zastarela etiketa</guilabel>..."
0284 </programlisting>
0285 This will resolve into the verbatim text of the reference (i.e. context separators will simply be removed), without the hook complaining about an unresolvable reference.</para>
0286 
0287 </sect2>
0288 
0289 
0290 <sect2 id="sec-lguinorm">
0291 <title>Normalization of UI Text</title>
0292 
0293 <para>The text of the UI message may contain some characters and substrings which should not be carried over into the text which references the message, or should be modified. To cater for this, UI PO files are normalized after being opened and before UI references are looked up in them. In fact, UI references are written precisely in this normalized form, rather than using the true original <varname>msgid</varname> from the UI PO file. This is both for convenience and for necessity.</para>
0294 
0295 <para>One typical thing to handle in normalization is the accelerator marker. UI reference resolving hooks eliminate accelerator markers automatically, by for that they need to known what the accelerator marker character is. To find this out, hooks will read <link linkend="hdr-x-accelerator-marker">the <literal>X-Accelerator-Marker</literal> header field</link>.</para>
0296 
0297 <para>Another problem is when UI messages contain subsections which would invalidate the target format which is being translated in the referencing PO file, e.g. malformed XML in Docbook catalogs. For example, literal <literal>&amp;</literal> must be represented as <literal>&amp;amp;</literal> in Docbook markup, thus this UI message:
0298 <programlisting language="po">
0299 msgid "Scaled &amp; Cropped"
0300 msgstr ""
0301 </programlisting>
0302 would be referenced as:
0303 <programlisting language="po">
0304 msgid "...&lt;guimenuitem>Scaled &amp;amp; Cropped&lt;/guimenuitem>..."
0305 msgstr "...&lt;guimenuitem>Scaled &amp;amp; Cropped&lt;/guimenuitem>..."
0306 </programlisting>
0307 Resolving hooks have parameters for specifying the type of escaping needed by the target format.</para>
0308 
0309 <para>Normalization may flatten several different messages from the UI PO file into one. Example of this is when <varname>msgid</varname> fields are equal but for the accelerator marker. If this happens and normalized translations are not equal for all flattened messages, a special "tail" is added to their contexts, consisting of a tilde and several alphanumeric characters. The first run of the resolving (or validation) hook will report ambiguities of this kind, as well as assigned contexts, so that proper context can be copied and pasted over into the UI reference. The alphanumeric context tail is computed from the non-normalized <varname>msgid</varname> alone, so it will not change if, for example, messages in the UI PO file get reordered.</para>
0310 
0311 </sect2>
0312 
0313 <sect2 id="sec-lguilink">
0314 <title>Linking to Originating PO Files</title>
0315 
0316 <para>In general, the UI message may not be present in the same PO file in which it is referenced in another messages. This is always the case for documentation PO files. Therefore UI reference resolving hooks need to know two things: the list of all UI PO files (those from which UI references may be drawn), and, for each PO file which contains UI references, the list of PO files from which it may draw UI references.</para>
0317 
0318 <para>The list of UI PO files can be given to resolving hooks explicitly, as list of PO file paths (or directory paths to search for PO files). This can, however, be inconvenient, as it implies either that the resolution script must be invoked in a specific directory (if paths are relative), or that UI PO files must reside in a fixed directory on the system where the resolution script is run (if paths are absolute). Therefore there is another way of specifying paths to UI PO files, through an environment variable which contains a colon-separated list of directory paths. Both the explict list of paths and the environment variable which contains the paths can be given as parameters to hooks.</para>
0319 
0320 <para>By default, for a given PO file, UI references are looked for only in the PO file of the same name, assuming that it is found among UI PO files. This may be sufficient, for example, for UI references in tooltips, but it is frequently not sufficient for documentation PO files, which may have a different names from corresponding UI PO file names. Therefore a PO file can be manually linked to UI PO files from which it draws UI references, through a special header field <literal>X-Associated-UI-Catalogs</literal>. This field specifies only the PO domain names, as space- or comma-separated list:
0321 <programlisting language="po">
0322 msgid ""
0323 msgstr ""
0324 "Project-Id-Version: foobar\n"
0325 "..."
0326 "X-Associated-UI-Catalogs: foobar libfoobar libqwyx\n"
0327 </programlisting>
0328 The order of domain names in the list is important: if the referenced UI message exists in more than one linked PO file, the translation is taken from the one which appears earlier in the list. Knowing PO domain names, resolving hooks can look up the exact file paths in the supplied list of paths.</para>
0329 
0330 </sect2>
0331 
0332 <sect2 id="sec-lguinotes">
0333 <title>Notes on UI Reference Resolution</title>
0334 
0335 <para>When a UI reference cannot be resolved, for whatever reason -- it does not exist, there is a context conflict, the message is not translated, etc. -- resolving hooks will output warnings and fallback to original text.</para>
0336 
0337 <para>For each resolving hook there exists the counterpart validation hook. Validation hooks may be used in a "dry run" before starting to build PO files for delivery, or they may be built into a general translation validation framework (such as Pology's <link linkend="sec-lgrules">validation rules</link>).</para>
0338 
0339 </sect2>
0340 
0341 </sect1>
0342 
0343 <!-- ======================================== -->
0344 <sect1 id="sec-lgrules">
0345 <title>Validation Rules</title>
0346 
0347 <para>There are great many possible mistakes to be made when translating. Some of these mistakes can only be observed and corrected by a human reviewer<footnote>
0348 <para>Taking into account the current level of artificial intelligence development, which, granted, may become more sophisticated in the future.</para>
0349 </footnote>, and review is indeed an important part of the translation workflow. However, many mistakes, especially those more technical in nature, can be fully or partially detected by automatic means.</para>
0350 
0351 <para>A number of tools are available to perform various checks on translation in PO files. The basic one is Gettext's <command>msgfmt</command> command, which, when run with <option>-c</option>/<option>--check</option> option, will detect many "hard" technical problems. These are the kind of problems which may cause the program that uses translation to crash, or that may cause loss of information to the program user. Another is <ulink url="http://translate.sourceforge.net/wiki/toolkit/pofilter">Translate Toolkit's <command>pofilter</command> command</ulink>, which applies heuristic checks to detect common (and not so common) stylistic and semantic slips in translation. Dedicated PO editors may also provide some checks of their own, or make use of external batch tools.</para>
0352 
0353 <para>One commonality of existing validation tools is that they aim for generality, that is, try to apply a fixed battery of checks to all <link linkend="sec-lglangenv">languages and environments</link> (although some differentiation by translation projects may be present, such as in <command>pofilter</command>). Another commonality, unavoidable in heuristic approaches, is wrong detection of valid translation as invalid, the so called "false positives". These two elements produce combined negative effect: since the number and specificity of checks is not that great compared to what a dedicated translator could come up with for given language and environment, and since many reported errors are false positives without possibility for cancelation, the motivation to apply automatic checks sharply decreases; the more so the greater the amount of translation.</para>
0354 
0355 <para>Pology therefore provides a system for users to assemble collections of <emphasis>validation rules</emphasis> adapted to their language and environment, with multi-level facilities for applying or skipping rules in certain contexts, pre-filtering of text before applying rules, and post-filtering and opening problematic messages in PO editors. Rules can be written and tuned in the course of translation, and false positives can be systematically canceled, such that over time the collection of rules becomes both highly specific and highly accurate. Since Pology supports language and environment variations from the ground up, such rule collections can be committed to Pology source distribution, so that anyone may use them when applicable.</para>
0356 
0357 <para>Validation rules are primarily based on pattern matching with <link linkend="sec-cmregex">regular expressions</link>, but they can in principle contain any Python code through Pology's <link linkend="sec-cmhooks">hook system</link>. For example, since there are spell-checking hooks provided, spell-checking can be easily made into one validation rule. One could even aim to integrate every available check into the validation rule system, such that it becomes the single and uniform source of all automatic checks in the translation workflow.</para>
0358 
0359 <para>The primary tool in Pology for applying validation rules is <link linkend="sv-check-rules">the <command>check-rules</command> sieve</link>. This section describes how to write rules, how to organize rule collections, and, importantly, how to handle false positives.</para>
0360 
0361 <sect2 id="sec-lgrltour">
0362 <title>Guided Tour of the Rule System</title>
0363 
0364 <para>There are many nuances to the validation rule system in Pology, so it is best to start off with an example-based exposition of the main elements. Subsequent sections will then look into each element in detail.</para>
0365 
0366 <para>Rules are defined in rule files, with flat structure and minimalistic syntax, since the idea is to write the rules during the translation (or the translation review). Here is one rule file with two rules:
0367 <programlisting>
0368 # Personal rules of Horatio the Indefatigable.
0369 
0370 [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i
0371 id="gram-contr"
0372 hint="Do not use contractions."
0373 
0374 {elevator}i
0375 id="term-elevator"
0376 hint="Translate 'elevator' as 'lift'."
0377 valid msgstr="lift"
0378 </programlisting>
0379 A rule file should begin with a comment telling something about the rules defined in the file. Then the rules follow, normally separated by one or more blank lines. Each rule starts with a <emphasis>trigger pattern</emphasis>, of which there are several types. The trigger pattern can sometimes be everything there is to the rule, but it is usually followed by a number of <emphasis>subdirectives</emphasis>.</para>
0380 
0381 <para>The first rule above starts with a regular expression pattern on the translation, which is denoted by the <literal>[...]</literal> syntax. The regular expression matches English contractions, case-insensitively as indicated by trailing <literal>i</literal> flag. The trigger pattern is followed by the <literal>id</literal> subdirective, which specifies an identifier for the rule (here <literal>gram-contr</literal> is short for "grammar, contractions"). The identifier does not have to be present, and does not even have to be unique if present (uses of rule identifiers will be explained later). If the rule matches a message, the message is reported to the user as problematic, along with a note provided in the <literal>hint</literal> subdirective.</para>
0382 
0383 <para>The second rule starts with a regular expression pattern on the original (rather than the translation), for which the <literal>{...}</literal> syntax is reserved. Then the <literal>id</literal> and <literal>hint</literal> subdirectives follow, as in the first rule. But unlike the first rule, up to this point the second rule would be somewhat strange: report a problem whenever the word "elevator" is found in the original text? That is where the final <literal>valid</literal> subdirective comes in, by specifying a condition on translation (<literal>msgstr=</literal>) which cancels the trigger pattern. So this rule efectively states "report every message which has the word 'elevator' in the original, but not the word 'lift' in the translation", making it a terminology assertion rule.</para>
0384 
0385 <para>If the given example rule file is saved as <filename>personal.rules</filename>, it can be applied to a collection of PO files by the <command>check-rules</command> sieve in the following way:
0386 <programlisting language="bash">
0387 $ posieve check-rules -s rfile:<replaceable>pathto</replaceable>/personal.rules <replaceable>PATHS...</replaceable>
0388 </programlisting>
0389 The path to the rule file to apply is given by the <option>rfile:</option> sieve parameter. All messages which are "failed" by rules will be output to the terminal, with spans of the text that triggered the rule highlighted and the note attached to the rule displayed after the message. Additionally, one of the parameters for automatically opening messages in the PO editor can be issued, to make correcting problems (or canceling false positives) that more comfortable.</para>
0390 
0391 <para>The <option>rfile:</option> sieve parameter can be repeated to add several rule files. If all rule files put into one directory (and its subdirectories), a single <option>rdir:</option> parameter can be used to specify the path to that directory, and all files with <filename>.rules</filename> extension will be recursively collected from it and applied. Finally, if rule files are put into Pology's rule directory for the given language, at <filename>lang/<replaceable>lang</replaceable>/rules/</filename>, then <command>check-rules</command> will automatically pick them up when neither <option>rfile:</option> nor <option>rdir:</option> parameters are issued. This is a simple way to test the rules if the intention is to include them into Pology distribution.</para>
0392 
0393 <para>Instead of applying all defined rules, parameters <option>rule:</option>, <option>rulerx:</option>, <option>norule:</option>, <option>norulerx:</option> of <command>check-rules</command> can be used to select specific rules to apply or to not apply, by their identifiers. To apply only the no-contractions rule:
0394 <programlisting language="bash">
0395 $ posieve check-rules -s rfile:<replaceable>pathto</replaceable>/personal.rules -s rule:gram-contr <replaceable>PATHS...</replaceable>
0396 </programlisting>
0397 and to apply all but terminology rules, assuming that their identifiers start with <literal>term-</literal>:
0398 <programlisting language="bash">
0399 $ posieve check-rules -s rfile:<replaceable>pathto</replaceable>/personal.rules -s norulerx:term-.* <replaceable>PATHS...</replaceable>
0400 </programlisting>
0401 </para>
0402 
0403 <para>When the rule trigger pattern is a regular expression, it can always be made more or less specific. The previous example of matching English contractions could be generalized like this:
0404 <programlisting>
0405 [\w+'t\b]i
0406 </programlisting>
0407 This regular expression will match one or more word-characters (<literal>\w+</literal>) followed by 't (<literal>'t</literal>) which is positioned at the word boundary (<literal>\b</literal>). More general patterns increase the likelyhood of false positives, but this is not really a problem, since tweaking the rules in the course of translation is expected. It is a bigger problem if the pattern is made too specific at first, such that it misses out some cases. It is therefore recommended to start with "greedy" patterns, and then constrain them as false positivies are observed.</para>
0408 
0409 <para>However, tweaking trigger patterns can only go so far.<footnote>
0410 <para>And cause regular expressions to become horribly complicated.</para>
0411 </footnote> The workhorse of rule flexibility is instead the mentioned <literal>valid</literal> subdirective. Within a single <literal>valid</literal> directive there may be several tests, and many types of tests are provided. The trigger will be canceled if all the tests in the <literal>valid</literal> subdirective are satisfied (boolean AND linking). There may be several <literal>valid</literal> subdirectives, each with its own battery of tests, and then the trigger is canceled if any of the <literal>valid</literal> subdirectives are satisfied (boolean OR linking). For example, to disallow a certain word in translation unless used in few specific constructs, the following set of <literal>valid</literal> subdirectives can be used:
0412 <programlisting>
0413 [foo]i
0414 id="style-nofoo"
0415 hint="The word 'foo' is allowed only in '*goo foo' and 'foo bar*' constructs."
0416 valid after="goo "
0417 valid before=" bar"
0418 </programlisting>
0419 The first <literal>valid</literal> subdirective cancels the rule if the trigger pattern matched just after a "goo " segment, and the second if it matched just before a " bar" segment. Another example would be a terminology assertion rule where a certain translation is expected in general, but another translation as well is allowed in a specific PO file:
0420 <programlisting>
0421 {foobar}i
0422 id="term-foobar"
0423 hint="Translate 'foobar' as 'froobaz' (somewhere 'groobaz' allowed too)."
0424 valid msgstr="froobaz"
0425 valid msgstr="groobaz" cat="gfoo"
0426 </programlisting>
0427 Here the second <literal>valid</literal> subdirective uses the <literal>cat=</literal> test to specify the other possible translation in the specific PO file. Tests can be negated by prepending <literal>!</literal> to them, so to require the specific PO file to have <emphasis>only</emphasis> the other translation:
0428 <programlisting>
0429 valid msgstr="froobaz" !cat="gfoo"
0430 valid msgstr="groobaz" cat="gfoo"
0431 </programlisting>
0432 </para>
0433 
0434 <para>When a regular expression is not sufficient as the rule trigger, a validation hook can be used instead (one of V* hook types). See <xref linkend="sec-cmhooks"/> for general discussion on hooks in Pology. For example, since there are spell-checking hooks already available, the complete rule for spell-checking could be:
0435 <programlisting>
0436 *hook name="spell/check-spell-ec-sp" on="msgstr"
0437 id="spelling"
0438 hint="Misspelled words detected."
0439 </programlisting>
0440 The <literal>name=</literal> field specifies the hook, and the <literal>on=</literal> field what parts of the message it should operate on. The parts given by <literal>on=</literal> field must be appropriate for the hook type; since <literal>spell/check-spell-ec-sp</literal> is a V3A hook, it can operate on any string in the message, including the translation as requested here. Validation hooks can provide some notes of their own (here list of replacement suggestions for a faulty word), which will be shown next to the note given by rule's <literal>hint=</literal> subdirective.</para>
0441 
0442 <para>Examples so far all suffer from one basic problem: the trigger pattern will fail to match a word which has an accelerator marker inside it.<footnote>
0443 <para>Why not remove accelerator markers automatically before applying rules? Because some rules might be exactly about accelerator markers, e.g. if it should not be put next to certain letters.</para>
0444 </footnote> This is actually an instance of a broader problem, that some rules should operate on a somewhat modified, filtered text, instead on the original text. This is why the rule system in Pology also provides extensive filtering capabilities. If the accelerator marker is <literal>_</literal> (the underscore), here is how it could be removed before applying the rules:
0445 <programlisting>
0446 # Personal rules of Horatio the Indefatigable.
0447 
0448 addFilterRegex match="_" repl="" on="pmsgid,pmsgstr"
0449 
0450 # Rules follow...
0451 </programlisting>
0452 The <literal>addFilterRegex</literal> directive sets a regular expression filter that will be applied to messages before applying any of the rules that follow. <literal>match=</literal> field provides the pattern, <literal>repl=</literal> what to replace it with, and <literal>on=</literal> which parts of the message to filter.</para>
0453 
0454 <para>The accelerator marker filter from the previous example is quite crude. It fixes the accelerator marker character, and it will simply remove all of them from the text. Filters too can be hooks instead of regular expressions, and in this case it is better to use the dedicated accelerator marker removal hook:
0455 <programlisting>
0456 # Personal rules of Horatio the Indefatigable.
0457 
0458 addFilterHook name="remove/remove-accel-msg" on="msg"
0459 
0460 # Rules follow...
0461 </programlisting>
0462 <link linkend="hk-remove-remove-accel">The <literal>remove/remove-accel-msg</literal> hook</link> is an F4A hook, and therefore the <literal>on=</literal> field specifies the whole message as the target of filtering. This hook will use information from PO file headers and respect command line overrides to determine the accelerator marker character, and then remove them only from valid accelerator positions.</para>
0463 
0464 <para>Filters to not have to be given as global directives, influencing all the rules below them, but they can be defined for a single rule, using one of rule subdirectives. The other way around, global filters can also have a handle assigned (using the <literal>handle=</literal> field), and then this handle can be used to remove the filter on a specific rule.</para>
0465 
0466 <para>The last important concept in the Pology's validation rule system are rule environments. The examples so far defined rules for a given language, which means that they in principle apply to any PO file of that language. This is generally insufficient (e.g. terminology differences between translation projects), so rules too can be made to support Pology's <link linkend="sec-lglangenv">language and environment</link> hierarchy. Going back to the initial rule file example, let us assume that while "elevator" should always become "lift", but that English contractions are not accepted only in more formal translations. Then, the rule file could be modified to:
0467 <programlisting>
0468 # Personal rules of Horatio the Indefatigable.
0469 
0470 [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i
0471 environment formal
0472 ...
0473 
0474 {elevator}i
0475 ...
0476 </programlisting>
0477 The first rule now has the <literal>environment</literal> subdirective, which sets this rule's environment to <literal>formal</literal>. If <command>check-rules</command> is now run as before, only the second rule will be applied, as it is environment-agnostic. To apply the first rule as well, the <literal>formal</literal> environment must be requested through the <option>env:</option> sieve parameter:
0478 <programlisting language="bash">
0479 $ posieve check-rules -s rfile:<replaceable>pathto</replaceable>/personal.rules -s env:formal <replaceable>PATHS...</replaceable>
0480 </programlisting>
0481 Another way to request the environment is to specify it inside the PO file itself, through the <link linkend="hdr-x-environment">the <literal>X-Environment:</literal> header field</link>. This is generally preferable, because it both reduces the amount of command line arguments (which may be accidentaly omitted sometimes), other parts of Pology too can make use of the environment information in the PO header, and, most importantly, it makes possible that not all PO files processed in a single run belong to the same environment.</para>
0482 
0483 <para>If all the rules which belong to the formal environment are grouped at the end of the rule file, then the global <literal>environment</literal> directive can be used to set the environment for all of them, instead of the subdirective on each of them:
0484 <programlisting>
0485 # Personal rules of Horatio the Indefatigable.
0486 
0487 {elevator}i
0488 ...
0489 
0490 environment formal
0491 
0492 [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i
0493 ...
0494 </programlisting>
0495 A more usual application of the global <literal>environment</literal> directive is to split environment-specific rules into a separate file, and then put the <literal>environment</literal> directive at the top. Most flexibly, <literal>valid</literal> subdirectives provide the <literal>env=</literal> test, so that the rule trigger can be canceled in a condition including the environment. In the running example, this could be used as:
0496 <programlisting>
0497 # Personal rules of Horatio the Indefatigable.
0498 
0499 [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i
0500 ...
0501 valid !env="formal"
0502 
0503 {elevator}i
0504 ...
0505 </programlisting>
0506 It depends on the particular organization of rule files, and on types of rules, which method of environment-sensitivity should be used. Filters too are sensitive to environments, either conforming to global environment directives same as rules, or using their own <literal>env=</literal> fields.</para>
0507 
0508 <para>When requesting environments in validation runs (through <option>env:</option> sieve parameter or <literal>X-Environment:</literal> header field), more than one environment can be specified. Then the rules from all those environments, plus the environment-agnostic rules, will be applied. Here comes another function of rule identifiers (provided with the <literal>id=</literal> rule subdirective): if two rules in different environments have same identifier, then the rule from the more specific environment overrides the rule from the less specific environment. The more specific environment is normally taken to be the one encountered later in the requested environment list.</para>
0509 
0510 </sect2>
0511 
0512 <sect2 id="sec-lgrlfiles">
0513 <title>Layout of Rule Files</title>
0514 
0515 <para>Rule files are kept simple, to facilitate easy editing without
0516 verbose syntax getting in the way. A rule file has the following layout:
0517 <programlisting>
0518 # Title of the rule collection.
0519 # Author name.
0520 # License.
0521 
0522 # Directives affecting all the rules.
0523 <replaceable>global-directive</replaceable>
0524 ...
0525 <replaceable>global-directive</replaceable>
0526 
0527 # Rule 1.
0528 <replaceable>trigger-pattern</replaceable>
0529 <replaceable>subdirective-1</replaceable>
0530 ...
0531 <replaceable>subdirective-n</replaceable>
0532 
0533 # Rule 2.
0534 <replaceable>trigger-pattern</replaceable>
0535 <replaceable>subdirective-1</replaceable>
0536 ...
0537 <replaceable>subdirective-n</replaceable>
0538 
0539 ...
0540 
0541 # Rule N.
0542 <replaceable>trigger-pattern</replaceable>
0543 <replaceable>subdirective-1</replaceable>
0544 ...
0545 <replaceable>subdirective-n</replaceable>
0546 </programlisting>
0547 The rather formal top comment (licence, etc.) is required for the rule files inside Pology distribution. In most contexts rule files are expected to have the <filename>.rules</filename> extension, so it is best to always use it (mandatory for internal rules files). Rule files must be UTF-8 encoded.</para>
0548 
0549 </sect2>
0550 
0551 <sect2 id="sec-lgrltrigpat">
0552 <title>Rule Triggers</title>
0553 
0554 <para>The rule trigger is most often a regular expression pattern, given within curly or square brackets, <literal>{...}</literal> or <literal>[...]</literal>, to match the original or the translation part of the message, respectively. The closing bracket may be followed by single-character matching modifiers, as follows:
0555 <itemizedlist>
0556 <listitem>
0557 <para><literal>i</literal>: case-sensitive matching for <emphasis>all</emphasis> patterns in the rule, including but not limited to the trigger pattern. Default matching is case-insensitive.</para>
0558 </listitem>
0559 </itemizedlist>
0560 </para>
0561 
0562 <para>Bracketed patterns are the shorthand notation, which are sufficient most of the time. There is also the more verbose notation <literal>*<replaceable>message-part</replaceable>/<replaceable>regex</replaceable>/<replaceable>modifiers</replaceable></literal>, where instead of <literal>/</literal> any other non-letter character can be used consistently as separator. The verbose notation is needed when some part of the message other than the original or the translation should be matched, or when brackets would cause balancing issues (e.g. when a closing curly bracket without the opening bracket is a part of the match for the original text). For all messages, <literal><replaceable>message-part</replaceable></literal> can be one of the following keywords:
0563 <itemizedlist>
0564 <listitem>
0565 <para><literal>msgid</literal>: match on original</para>
0566 </listitem>
0567 <listitem>
0568 <para><literal>msgstr</literal>: match on translation</para>
0569 </listitem>
0570 <listitem>
0571 <para><literal>msgctxt</literal>: match on disambiguating context</para>
0572 </listitem>
0573 </itemizedlist>
0574 For example, <literal>{foobar}i</literal> is equivalent to <literal>*msgid/foobar/i</literal>.</para>
0575 
0576 <para>For plural messages, <literal>msgid/.../</literal> (and conversely <literal>{...}</literal>) tries to match either the <varname>msgid</varname> or the <varname>msgid_plural</varname> string, whereas <literal>msgstr/.../</literal> (and <literal>[...]</literal>) try to match any <varname>msgstr</varname> string. If only particular of these strings should be matched, the following keywords can be used as well:
0577 <itemizedlist>
0578 <listitem>
0579 <para><literal>msgid_singular</literal>: match only the <varname>msgid</varname> string</para>
0580 </listitem>
0581 <listitem>
0582 <para><literal>msgid_plural</literal>: match only the <varname>msgid_plural</varname> string</para>
0583 </listitem>
0584 <listitem>
0585 <para><literal>msgstr_<replaceable>N</replaceable></literal>: match only the <varname>msgstr</varname> string with index <literal><replaceable>N</replaceable></literal></para>
0586 </listitem>
0587 </itemizedlist>
0588 </para>
0589 
0590 <para id="p-triggerhooks">When regular expressions on message strings are not sufficient as rule triggers, a hook can be used instead. Hooks are described in <xref linkend="sec-cmhooks"/>. Since hooks are Python functions, in principle any kind of test can be performed by them. A rule with the hook trigger is defined as follows:
0591 <programlisting>
0592 *hook name="<replaceable>hookspec</replaceable>" on="<replaceable>part</replaceable>" casesens="[yes|no]"
0593 # Rule subdirectives follow...
0594 </programlisting>
0595 The <literal>name=</literal> field provides the hook specification. Only V* type (validation) hooks can be used in this context. The <literal>on=</literal> field defines on which part of the message the hook will operate, and needs to conform to the hook type. The following message parts can be specified, with associated hook types:
0596 <itemizedlist>
0597 
0598 <listitem>
0599 <para><literal>msg</literal>: the hook applies to the complete message; for type V4A hooks.</para>
0600 </listitem>
0601 
0602 <listitem>
0603 <para><literal>msgid</literal>: the hook applies to the original text (<varname>msgid</varname>, <varname>msgid_plural</varname>), but considering other parts of the message; for type V3A and V3B hooks.</para>
0604 </listitem>
0605 
0606 <listitem>
0607 <para><literal>msgstr</literal>: the hook applies to the translation text (all <varname>msgstr</varname> strings), but considering other parts of the message; for type V3A and V3C hooks.</para>
0608 </listitem>
0609 
0610 <listitem>
0611 <para><literal>pmsgid</literal>: the hook applies to the original text, without considering the rest of the message; for type V1A hooks.</para>
0612 </listitem>
0613 
0614 <listitem>
0615 <para><literal>pmsgstr</literal>: the hook applies to the translation, without considering the rest of the message; for type V1A hooks.</para>
0616 </listitem>
0617 
0618 </itemizedlist>
0619 The <literal>casesens=</literal> field in trigger hook specification controls whether the patterns in the rest of the rule (primarily in <literal>valid</literal> subdirectives) are case-sensitive or not. This field can be omitted, and then patterns are case-sensitive.</para>
0620 
0621 <para>If the rule trigger pattern matches (or the trigger hook reports some problems), the message is by default considered "failed" by the rule. The message may be still passed by subdirectives that follow, which test if some additional conditions hold.</para>
0622 
0623 </sect2>
0624 
0625 <sect2 id="sec-lgrlsubdirs">
0626 <title>Rule Subdirectives</title>
0627 
0628 <para>There are several types of rule subdirectives. The main subdirective is <literal>valid</literal>, which provides additional tests to pass the message failed by the trigger pattern. The tests are given by a list of <literal><replaceable>name</replaceable>="<replaceable>pattern</replaceable>"</literal> entries. For a <literal>valid</literal> directive to pass the message, all its tests must hold, and if any of the <literal>valid</literal> directives passes the message, then the rule as whole passes it. Effectively, this means the boolean AND relationship within a directive, and OR across directives.</para>
0629 
0630 <para>The following tests are currently available in <literal>valid</literal> subdirectives:
0631 <variablelist>
0632 
0633 <varlistentry>
0634 <term><literal>msgid="<replaceable>REGEX</replaceable>"</literal></term>
0635 <listitem>
0636 <para>The original text (<varname>msgid</varname> or <varname>msgid_plural</varname> string) must match the regular expression.</para>
0637 </listitem>
0638 </varlistentry>
0639 
0640 <varlistentry>
0641 <term><literal>msgstr="<replaceable>REGEX</replaceable>"</literal></term>
0642 <listitem>
0643 <para>The translation (any <varname>msgstr</varname> string) must match the regular expression.</para>
0644 </listitem>
0645 </varlistentry>
0646 
0647 <varlistentry>
0648 <term><literal>ctx="<replaceable>REGEX</replaceable>"</literal></term>
0649 <listitem>
0650 <para>The disambiguating context (<varname>msgctxt</varname> string) must match the regular expression.</para>
0651 </listitem>
0652 </varlistentry>
0653 
0654 <varlistentry>
0655 <term><literal>srcref="<replaceable>REGEX</replaceable>"</literal></term>
0656 <listitem>
0657 <para>The file path of one of the source references (in <literal>#: ...</literal> comment) must match the regular expression</para>
0658 </listitem>
0659 </varlistentry>
0660 
0661 <varlistentry>
0662 <term><literal>comment="<replaceable>REGEX</replaceable>"</literal></term>
0663 <listitem>
0664 <para>One of the extracted or translator comments (<literal>#. ...</literal> or <literal># ...</literal>) must match the regular expression.</para>
0665 </listitem>
0666 </varlistentry>
0667 
0668 <varlistentry>
0669 <term><literal>span="<replaceable>REGEX</replaceable>"</literal></term>
0670 <listitem>
0671 <para>The text segment matched by the trigger pattern must match this regular expression as well.</para>
0672 </listitem>
0673 </varlistentry>
0674 
0675 <varlistentry>
0676 <term><literal>before="<replaceable>REGEX</replaceable>"</literal></term>
0677 <listitem>
0678 <para>The text segment matched by the trigger pattern must be placed exactly before one of the text segments matched by this regular expression.</para>
0679 </listitem>
0680 </varlistentry>
0681 
0682 <varlistentry>
0683 <term><literal>after="<replaceable>REGEX</replaceable>"</literal></term>
0684 <listitem>
0685 <para>The text segment matched by the trigger pattern must be placed exactly after one of the text segments matched by this regular expression.</para>
0686 </listitem>
0687 </varlistentry>
0688 
0689 <varlistentry>
0690 <term><literal>cat="<replaceable>DOMAIN1</replaceable>,<replaceable>DOMAIN2</replaceable>,..."</literal></term>
0691 <listitem>
0692 <para>The PO domain name (i.e. MO file name without <filename>.mo</filename> extension) must be contained in the given comma-separated list of domain names.</para>
0693 </listitem>
0694 </varlistentry>
0695 
0696 <varlistentry>
0697 <term><literal>catrx="<replaceable>REGEX</replaceable></literal></term>
0698 <listitem>
0699 <para>The PO domain name must match the regular expression.</para>
0700 </listitem>
0701 </varlistentry>
0702 
0703 <varlistentry>
0704 <term><literal>env="<replaceable>ENV1</replaceable>,<replaceable>ENV2</replaceable>,..."</literal></term>
0705 <listitem>
0706 <para>The operating environment must be contained in the given comma-separated list of environment keywords.</para>
0707 </listitem>
0708 </varlistentry>
0709 
0710 <varlistentry>
0711 <term><literal>head="/<replaceable>FIELD-REGEX</replaceable>/<replaceable>VALUE-REGEX</replaceable>"</literal></term>
0712 <listitem>
0713 <para>The PO file header must contain the field and value combination, each specified by a regular expression pattern. Instead of <literal>/</literal>, any other character may be used consistently as delimiter for the field regular expression.</para>
0714 </listitem>
0715 </varlistentry>
0716 
0717 </variablelist>
0718 </para>
0719 
0720 <para>Each test can be negated by prefixing it with <literal>!</literal>. For example, <literal>!cat="foo,bar"</literal> will match if the PO domain name is neither <literal>foo</literal> nor <literal>bar</literal>. Tests are "short-circuiting", so it is good for performance to put simple direct matching tests (e.g. <literal>cat=</literal>, <literal>env=</literal>) before more more expensive regular expression tests (<literal>msgid=</literal>, <literal>msgstr=</literal>, etc.).</para>
0721 
0722 <para>Subdirectives other than <literal>valid</literal> set states and properties of the rule. Property directives are written simply as <literal><replaceable>property</replaceable>="<replaceable>value</replaceable>"</literal>. These include:
0723 <variablelist>
0724 
0725 <varlistentry>
0726 <term><literal>hint="<replaceable>TEXT</replaceable>"</literal></term>
0727 <listitem>
0728 <para>A note to show to the user when the rule fails a message.</para>
0729 </listitem>
0730 </varlistentry>
0731 
0732 <varlistentry>
0733 <term><literal>id="<replaceable>IDENT</replaceable>"</literal></term>
0734 <listitem>
0735 <para>An "almost unique" identifier for the rule (see <xref linkend="sec-lgrlenvs"/>).</para>
0736 </listitem>
0737 </varlistentry>
0738 
0739 </variablelist>
0740 State directives are given by the directive name, possibly followed by
0741 keyword parameters: <literal><replaceable>directive</replaceable> <replaceable>arg1</replaceable> ...</literal>. These can be:
0742 <variablelist>
0743 
0744 <varlistentry>
0745 <term><literal>validGroup <replaceable>GROUPNAME</replaceable></literal></term>
0746 <listitem>
0747 <para>Includes a previously defined standalone group of <literal>valid</literal> subdirectives.</para>
0748 </listitem>
0749 </varlistentry>
0750 
0751 <varlistentry>
0752 <term><literal>environment <replaceable>ENVNAME</replaceable></literal></term>
0753 <listitem>
0754 <para>Sets the environment in which the rule is applied.</para>
0755 </listitem>
0756 </varlistentry>
0757 
0758 <varlistentry>
0759 <term><literal>disabled</literal></term>
0760 <listitem>
0761 <para>Disables the rule, so that it is no longer applied to messages. Disabled rule can still be applied by explicit request (e.g. using the <option>rule:</option> parameter of <command>check-rules</command> sieve).</para>
0762 </listitem>
0763 </varlistentry>
0764 
0765 <varlistentry id="p-rulemanual">
0766 <term><literal>manual</literal></term>
0767 <listitem>
0768 <para>Makes it necessary to manually apply the rule to a message, by using one of <link linkend="p-rulemanappcmnt">special translator comments</link> (e.g. <literal>apply-rule:</literal>).</para>
0769 </listitem>
0770 </varlistentry>
0771 
0772 <varlistentry>
0773 <term><literal>addFilterRegex</literal>, <literal>addFilterHook</literal>, <literal>removeFilter</literal></term>
0774 <listitem>
0775 <para>A group of subdirectives to define filters which are applied to messages before the rule is applied to them. See <xref linkend="sec-lgrlfilter"/>.</para>
0776 </listitem>
0777 </varlistentry>
0778 
0779 </variablelist>
0780 </para>
0781 
0782 </sect2>
0783 
0784 <sect2 id="sec-lgrlglobdirs">
0785 <title>Global Directives in Rule Files</title>
0786 
0787 <para>Global directives are typically placed at the beginning of a rule file, before any rules. They define common elements for all rules to use, or set state for all rules below them. A global directive can also be placed in the middle of the rule file, between two rules, when it will affect all the rules that follow it, but not those that precede it. The following global directives are defined:
0788 <variablelist>
0789 
0790 <varlistentry>
0791 <term><literal>validGroup</literal></term>
0792 <listitem>
0793 <para>Defines common groups of <literal>valid</literal> subdirectives, which can be included by any rule using the <literal>validGroup</literal> subdirective:
0794 <programlisting>
0795 # Global validity group.
0796 validGroup passIfQuoted
0797 valid after="“" before="”"
0798 valid after="‘" before="’"
0799 
0800 ....
0801 
0802 # Rule X.
0803 {...}
0804 validGroup passIfQuoted
0805 valid ...
0806 ...
0807 
0808 # Rule Y.
0809 {...}
0810 validGroup passIfQuoted
0811 valid ...
0812 ...
0813 </programlisting>
0814 </para>
0815 </listitem>
0816 </varlistentry>
0817 
0818 <varlistentry>
0819 <term><literal>environment</literal></term>
0820 <listitem>
0821 <para>Sets a specific environment for the rules that follow, unless overriden with the namesake rule subdirective:
0822 <programlisting>
0823 # Global environment.
0824 environment FOO
0825 
0826 ...
0827 
0828 # Rule X, belongs to FOO.
0829 {...}
0830 ...
0831 
0832 # Rule Y, overrides to BAR.
0833 {...}
0834 environment BAR
0835 ...
0836 </programlisting>
0837 See <xref linkend="sec-lgrlenvs"/> for details on use of environments.</para>
0838 </listitem>
0839 </varlistentry>
0840 
0841 <varlistentry>
0842 <term><literal>include</literal></term>
0843 <listitem>
0844 <para>Used to include files into rule files:
0845 <programlisting>
0846 include file="foo.something"
0847 </programlisting>
0848 If the file to include is specified by relative path, it is taken as relative to the file which includes it.</para>
0849 
0850 <para>The intent behind <literal>include</literal> directive is not to include one rule file into another (files with <filename>.rules</filename> extension), because normally all rule files in a directory are automatically included by the rule applicator (e.g. <command>check-rules</command> sieve). Instead, files that are included should have an extension different from <filename>.rules</filename>, and contain a number of directives needed in several rule files; for example, a set of <link linkend="sec-lgrlfilter">filters</link>.</para>
0851 </listitem>
0852 </varlistentry>
0853 
0854 <varlistentry>
0855 <term><literal>addFilterRegex</literal>, <literal>addFilterHook</literal>, <literal>removeFilter</literal></term>
0856 <listitem>
0857 <para>A group of directives to define filters which are applied to messages before the rules are applied. See <xref linkend="sec-lgrlfilter"/>.</para>
0858 </listitem>
0859 </varlistentry>
0860 
0861 </variablelist>
0862 </para>
0863 
0864 </sect2>
0865 
0866 <sect2 id="sec-lgrlenvs">
0867 <title>Effect of Rule Environments</title>
0868 
0869 <para>When there are no <literal>environment</literal> directives in a rule file, either global or as rule subdirectives, then all rules in that rule file are considered as being "environment-agnostic". When applying a rule set (e.g. with the <command>check-rules</command> sieve), the applicator may be put into one or more <emphasis>operating environments</emphasis>, either by specifying them as arguments (e.g. in command line) or in PO file headers. If one or more operating environments are given and the rule is environment-agnostic, it will be applied to the message irrespective of the operating environments. However, if there were some <literal>environment</literal> directives in the rule file, some rules will be environment-specific. An environment-specific rule will be applied only if its environment matches one of the set operating environments.</para>
0870 
0871 <para>Rule environments are used to control application of rules between different translation environments (projects, teams, people). Some rules may be common to all environments, some may be somewhat common, and some not common at all. Common rules would than be made environment-agnostic (i.e. not covered by
0872 any <literal>environment</literal> directive), while entirely non-common rules would be provided in separate rule files per environment, with one global
0873 <literal>environment</literal> directive in each.</para>
0874 
0875 <para>How to handle "somewhat" common rules depends on circumstances. They could simply be defined as environment-specific, just like non-common rules, but this may reduce the amount of common rules too much for the sake of perculiar environments. Another way would be to define them as environment-agnostic, and then override them in certain environments. This is done by giving the environment-specific rule the same identifier (<literal>id</literal> subdirective) as that of the environment-agnostic rule. It may also happen that the bulk of the rule is environment-agnostic, except for a few tests in <literal>valid</literal> subdirectives which are not. In this case, <literal>env=</literal> and <literal>!env=</literal> tests can be used to differentiate between environments.</para>
0876 
0877 </sect2>
0878 
0879 <sect2 id="sec-lgrlfilter">
0880 <title>Filtering Messages</title>
0881 
0882 <para>It is frequently advantageous to apply a set of rules not on the message as it is, but on a suitably filtered variant. For example, if rules are used for terminology checks, it would be good to remove any markup from the text; otherwise, an <literal>&lt;email&gt;</literal> tag in the original could be understood as a real word, and a warning issued for missing the expected counterpart in the translation.</para>
0883 
0884 <para>Filters sets are created using <literal>addFilter*</literal> directives, global or within rules:
0885 <programlisting>
0886 # Remove XML-like tags.
0887 addFilterRegex match="&lt;.*?>" on="pmsgid,pmsgstr"
0888 # Remove long command-line options.
0889 addFilterRegex match="--[\w-]+" on="pmsgid,pmsgstr"
0890 
0891 # Rule A will act on a message filtered by previous two directives.
0892 {...}
0893 ...
0894 
0895 # Remove function calls like foo(x, y).
0896 addFilterRegex match="\w+\(.*?\)" on="pmsgid,pmsgstr"
0897 
0898 # Rule B will act on a message filtered by previous three directives.
0899 {...}
0900 ...
0901 </programlisting>
0902 Filters are added cumulatively to the filter set, and the current set
0903 is affecting all the rules below it.<footnote>
0904 <para>These filtering examples are only for illustrative purposes, as there are more precise methods to remove markup, or literals such as command line options.</para>
0905 </footnote> If a <literal>addFilter*</literal> directive appears within the rule, it adds a filter only to the filter set of that rule:
0906 <programlisting>
0907 # Rule C, with an additional filter just for itself.
0908 {...}
0909 addFilterRegex match="grep\(1\)" on="pmsgstr"
0910 ...
0911 
0912 # Rule D, sees only previous global filter additions.
0913 {...}
0914 ...
0915 </programlisting>
0916 These examples illustrate use of the <literal>addFilterRegex</literal> directive, which is described in more detail below, as well as other <literal>addFilter*</literal> directives.</para>
0917 
0918 <para>All <literal>addFilter*</literal> have the <literal>on=</literal> field. It specifies the message part on which the filter should operate, similar to the <literal>on=</literal> field in hook rule triggers. Unlike in triggers, in filters it is possible to state several parts to filter, as comma-separated list. The following message parts are exposed for filtering:
0919 <itemizedlist>
0920 
0921 <listitem>
0922 <para><literal>msg</literal>: filter the "complete" message. What this means exactly depends on the particular filter directive.</para>
0923 </listitem>
0924 
0925 <listitem>
0926 <para><literal>msgid</literal>: filter the original text (<varname>msgid</varname>, <varname>msgid_plural</varname>), but possibly taking into account other parts of the message.</para>
0927 </listitem>
0928 
0929 <listitem>
0930 <para><literal>msgstr</literal>: filter the translation (all <varname>msgstr</varname> strings), but possibly taking into account other parts of the message.</para>
0931 </listitem>
0932 
0933 <listitem>
0934 <para><literal>pmsgid</literal>: filter the original text.</para>
0935 </listitem>
0936 
0937 <listitem>
0938 <para><literal>pmsgstr</literal>: filter the translation.</para>
0939 </listitem>
0940 
0941 <listitem>
0942 <para><literal>pattern</literal>: a quasi-part, to filter not the message, but all matching patterns (regular expressions, substring tests, equality tests) in the rules themselves.</para>
0943 </listitem>
0944 
0945 </itemizedlist>
0946 Not all filter directives can filter on all of these parts. Admissible parts are listed with each filter directive.</para>
0947 
0948 <para>To remove a filter from the current filter set, <literal>addFilter*</literal> directives can define the filter <emphasis>handle</emphasis>, which can then be given to a <literal>removeFilter</literal> directive:
0949 <programlisting>
0950 addFilterRegex match="&lt;.*?>" on="pmsgid,pmsgstr" handle="tags"
0951 
0952 # Rule A, "tags" filter applies to it.
0953 {...}
0954 ...
0955 
0956 # Rule B, removes "tags" filter only for itself.
0957 {...}
0958 removeFilter handle="tags"
0959 ...
0960 
0961 # Rule C, "tags" filter applies to it again.
0962 {...}
0963 ...
0964 
0965 removeFilter handle="tags"
0966 
0967 # Rule D, "tags" filter does not apply to it and any following rule.
0968 {...}
0969 ...
0970 </programlisting>
0971 Several filters may share the same handle, in which case the <literal>removeFilter</literal> directive removes all of them from the current filter set. One filter can have more than one handle, given as comma-separated list of handles in <literal>handle=</literal> field, and then it can be removed from the filter set by any of those handles. Likewise, the <literal>handle=</literal> field in <literal>removeFilter</literal> directive can state several handles by which to remove filters. <literal>removeFilter</literal> as rule subdirective influences the complete rule, regardless of its position among other subdirectives.</para>
0972 
0973 <para><literal>clearFilters</literal> directive is used to completely clear the filter set. It has no fields. Like <literal>removeFilter</literal>, it can be issued either globally, or as rule subdirective.</para>
0974 
0975 <para>A filter may be added or removed only in certain environments, specified by the <literal>env=</literal> field in <literal>addFilter*</literal> and <literal>removeFilter</literal> directives.</para>
0976 
0977 <sect3 id="sec-lgrlfiltdirs">
0978 <title>Filter Directives</title>
0979 
0980 <para>Currently the following directives for adding filters are available:
0981 <variablelist>
0982 
0983 <varlistentry>
0984 <term><literal>addFilterRegex</literal></term>
0985 <listitem>
0986 <para>Parts of the text to remove are determined by a regular expression match. The pattern is given by the <literal>match=</literal> field. If instead of simple removal of the matched segment the replacement is wanted, the <literal>repl=</literal> field is used to specify the replacement string (it can include backreferences to regex groups in the pattern):
0987 <programlisting>
0988 # Replace in translation the %&lt;number> format directives with a tilde.
0989 addFilterRegex match="%\d+" repl="~" on="pmsgstr"
0990 </programlisting>
0991 Case-sensitivity of matching can be changed by adding the <literal>casesens=[yes|no]</literal> field; default is case-sensitive matching.</para>
0992 
0993 <para>Applicable (<literal>on=</literal> field) to <literal>pmsgid</literal>, <literal>pmsgstr</literal>, and <literal>pattern</literal>.</para>
0994 </listitem>
0995 </varlistentry>
0996 
0997 <varlistentry>
0998 <term><literal>addFilterHook</literal></term>
0999 <listitem>
1000 <para>Text is processed with a filtering hook (F* hook types). The hook specification is given by the <literal>name=</literal> field. For example, to remove accelerator markers from UI messages in a smart way, while checking various sources for the exact accelerator marker character (command line, PO file header), this filter can be set:
1001 <programlisting>
1002 addFilterHook name="remove/remove-accel-msg" on="msg"
1003 </programlisting>
1004 </para>
1005 
1006 <para>Applicable (<literal>on=</literal> field) to <literal>msg</literal> (for F4A hooks), <literal>msgid</literal> (F3A, F3B), <literal>msgstr</literal> (F3A, F3C), <literal>pmsgid</literal> (F1A), <literal>pmsgstr</literal> (F1A), and <literal>pattern</literal> (F1A).</para>
1007 </listitem>
1008 </varlistentry>
1009 
1010 </variablelist>
1011 </para>
1012 
1013 </sect3>
1014 
1015 <sect3 id="sec-lgrlfiltercost">
1016 <title>Cost of Filtering</title>
1017 
1018 <para>Filtering may be run-time expensive, and it normally is in practical uses. Therefore the rule applicator will try to create and apply as few unique filter sets as possible, by considering their signatures -- a hash of ordering, type, and fields in the filter set for the given rule. Each message will be filtered only as many times as there are different filter sets, rather than once for every rule. The appropriate filtered version of the message will be given to each rule according to its filter set.</para>
1019 
1020 <para>This means that you should be careful when adding and removing filters, in order to have as few filter sets as really necessary. For example, you may know that filters P and Q can be applied in any order, and in one rule file specify P followed by Q, but in another rule file Q followed by P. However, the rule applicator must assume that the order of filters is significant, so it will create two filter sets, PQ and QP, and spend twice as much time in filtering.</para>
1021 
1022 <para>For big filter sets which are needed in several rule files, the best is to split them out in a separate file and use the <literal>include</literal> global directive to include them at the beginning of rule files.</para>
1023 
1024 </sect3>
1025 
1026 </sect2>
1027 
1028 <sect2 id="sec-lgrlquotesc">
1029 <title>Quoting and Escaping</title>
1030 
1031 <para>In all the examples so far, ASCII double quotes were used as value delimiters (<literal>"..."</literal>). However, just as in the verbose notation for trigger patterns (<literal>*msgid/.../</literal>, etc.), all quoted values can in fact consistently use any other non-alphanumeric character (e.g. single quote, slash, etc.). On the other hand, literal quotes inside a value can be escaped by prefixing them with <literal>\</literal> (backslash). Values which are regular expression are sent to the regular expression engine without resolving any escapes other than for the quote character itself.</para>
1032 
1033 <para>The general statement terminator in a rule file is the newline, but if a line would be too long, it can be continued into the next line by putting <literal>\</literal> (backslash) in the last column.</para>
1034 
1035 </sect2>
1036 
1037 <sect2 id="sec-lgrlfalse">
1038 <title>Canceling False Positives</title>
1039 
1040 <para>As it was explained earlier, it is very important to have a through system of handling false positives in validation rules. There are several levels on which false positives can be canceled, and they will be described in the following, going from the nearest to the furthest from the rule definition itself. Some guidelines on when to use which level will also be provided, but you should keep in mind that this is far from a well-examined topic.</para>
1041 
1042 <sect3 id="sec-lgrlfpdisable">
1043 <title>Disabling a Rule</title>
1044 
1045 <para>The <literal>disable</literal> subdirective can be added to the rule to disable its application. This may seem a quaint method of "handling false positivies", but it is not outright ridiculous, because a disabled rule can still be applied by directly requesting it (e.g. <option>rule:</option> parameter of <command>check-rules</command>). This is useful for rules which produce too many false positivies to be applied as part of a rule set, but which are still better than ad-hoc searches. In other words, such rules can be understood as codified special searches, which you would round only when you have enough time to wade through all the false positives in search for the few real problems.</para>
1046 
1047 </sect3>
1048 
1049 <sect3 id="sec-lgrlfptrigger">
1050 <title>Restricting the Rule Trigger</title>
1051 
1052 <para>The first real way of canceling false positives is by making the regular expression pattern for the rule trigger less greedy. For example, the trigger pattern for the terminology rule on "tool" could be written at first as:
1053 <programlisting>
1054 {\btool}i
1055 </programlisting>
1056 This will match any word that starts with <literal>tool</literal>, due to <literal>\b</literal> word boundary token at pattern start. The word boundary is not repeated at the end with the intention to also catch the plural form of the word, "tools". But, this pattern will also match the word "toolbar", which may have its own rule. Then, the pattern can be restricted to really match only "tool" and "tools", in several ways, for example:
1057 <programlisting>
1058 {\btools?\b}i
1059 </programlisting>
1060 Now the word boundary is placed at the end as well, but also the optional letter 's' is inserted (<literal>?</literal> means "zero or one appearance of the preceding element"). Another way would be to write out both forms in full:
1061 <programlisting>
1062 {\b(tool|tools)\b}i
1063 </programlisting>
1064 The brackets are needed because the OR-operator <literal>|</literal> has lower priority than word boundary <literal>\b</literal>, so without brackets the meaning would be "word which starts with 'tool' or ends with 'tools'".</para>
1065 
1066 </sect3>
1067 
1068 <sect3 id="sec-lgrlfpvaldir">
1069 <title>Adding <literal>valid</literal> Subdirectives to the Rule</title>
1070 
1071 <para>Python's regular expressions, used in rule patterns, have rich special features, but which are frequently better not used in rules. For example, the trigger for the terminology rule on "line" (of text) could be written at first as:
1072 <programlisting>
1073 {\blines?\b}i
1074 </programlisting>
1075 But this would also catch the phrase "command line", which as a standalone concept, may have its own rule. To avoid this match, a proficient user of regular expressions may think of adding a <emphasis>negative lookbehind</emphasis> to the trigger pattern:
1076 <programlisting>
1077 {(?&lt;!command )\blines?\b}i
1078 </programlisting>
1079 However, it is much less cryptic and more extensible to add a <literal>valid</literal> subdirective instead:
1080 <programlisting>
1081 {\blines?\b}i
1082 valid after="command "
1083 </programlisting>
1084 This cancels the rule if the word "line" was matched just after the word "command", while clearly showing the special-case context.</para>
1085 
1086 <para><literal>valid</literal> subdirectives are particularly useful for wider rule cancelations, such as by PO domain (catalog) name. For example, the word "wizard" could be translated differently when denoting a step-by-step dialog in a utilitarian program and a learned magic wielding character in a computer game. Then the <literal>cat=</literal> test can be used to allow the other term in the game's PO file:
1087 <programlisting>
1088 {\bwizard}i
1089 valid msgstr="<replaceable>term-for-step-by-step-dialog</replaceable>"
1090 valid cat="foodungeon" msgstr="<replaceable>term-for-magician</replaceable>"
1091 </programlisting>
1092 This requires specifying the domain names of all games with wizard characters to which the rule set is applied, which may not be that comfortable. Another way could be to introduce the <literal>fantasy</literal> environment and use the <literal>env=</literal> test:
1093 <programlisting>
1094 {\bwizard}i
1095 valid msgstr="<replaceable>term-for-step-by-step-dialog</replaceable>"
1096 valid env="fantasy" msgstr="<replaceable>term-for-magician</replaceable>"
1097 </programlisting>
1098 and to add the <literal>fantasy</literal> environment <link linkend="hdr-x-environment">into the header</link> of the PO file that needs it.</para>
1099 
1100 </sect3>
1101 
1102 <sect3 id="sec-lgrlfpskip">
1103 <title>Skipping and Manually Applying The Rule on A Message</title>
1104 
1105 <para>Sometimes there is just a single strange message that falsely triggers the rule, such that there is nothing to generalize about the false positive. You could still cancel this false positivie in the rule definition itself, by adding a <literal>valid</literal> directive with the <literal>cat=</literal> test for the PO domain name and <literal>msgid=</literal> test to single out the troublesome message:
1106 <programlisting>
1107 {\bfroobaz}i
1108 id="term-frobaz"
1109 valid msgstr="..."
1110 valid cat="foo" msgid="the amount of froobaz-HX which led to"
1111 </programlisting>
1112 However, rules are supposed to be at least somewhat general, and singling out a particular message in the rule is as excessive non-generality as it gets. It is also a maintenance problem: the message may dissapear in the future, leaving cruft in the rule file, or it may change slightly, but enough for the <literal>msgid=</literal> test not to match it any more.</para>
1113 
1114 <para>A much better way of skipping a rule on a particular message is by adding a special translator comment to that message, in the PO file:
1115 <programlisting language="po">
1116 # skip-rule: term-froobaz
1117 msgid "...the amount of froobaz-HX which led to..."
1118 msgstr "..."
1119 </programlisting>
1120 The comment starts with <literal>skip-rule:</literal>, and is followed by a comma-separated list of rules to skip, by their identifiers (defined by <literal>id=</literal> in the rule).</para>
1121 
1122 <para id="p-rulemanappcmnt">The other way around, a rule can be set for manual application only, by adding the <literal>manual</literal> subdirective to it. Then the <literal>apply-rule:</literal> translator comment must be added to apply that rule to a particular message:
1123 <programlisting language="po">
1124 # apply-rule: term-froobaz
1125 msgid "...the amount of froobaz-HX which led to..."
1126 msgstr "..."
1127 </programlisting>
1128 There is a pattern where an automatic rule and a manual rule are somehow closely related, so that on a particular message the automatic one should be skipped and the manual one applied. To make this pattern obvious and avoid adding two translator comments (both <literal>skip-rule:</literal> and <literal>apply-rule:</literal>), a single <literal>switch-rule:</literal> comment can be added instead:
1129 <programlisting language="po">
1130 # switch-rule: term-froobaz > term-froobaz-chem
1131 msgid "...the amount of froobaz-HX which led to..."
1132 msgstr "..."
1133 </programlisting>
1134 The rule before <literal>&gt;</literal> is skipped, and the rule after <literal>&gt;</literal> is applied. Several rules can be stated as a comma-separated list, on both sides of <literal>&gt;</literal>.</para>
1135 
1136 <para>There is a catch to the translator comment approach, though. When the message becomes fuzzy, it depends on the new text whether the rule application comment should be kept or removed. This means that on fuzzy messages translators have to observe and adapt translator comments just as they adapt the <varname>msgstr</varname> strings. Unfortunately, some translators do not pay sufficient attention to translator comments, which is further exacerbated by some PO editors not presenting translator comments conspicuously enough (or even do not enable editing them). However, from the point of view of PO translation workflow, not giving full attention to translator comments is plainly an error: unwary translators should be told better, and deficient PO editors should be upgraded.<footnote>
1137 <para>Until that is sufficiently satisfied, one simple safety measure is to remove rule application comments from fuzzy messages just after the PO file is merged with template. This will sometimes cause false positive to reappear, but, after all, this is only a tertiary element in the translation workflow (after translation and review).</para>
1138 </footnote></para>
1139 
1140 </sect3>
1141 
1142 <sect3 id="sec-lgrlfpreworig">
1143 <title>Rewriting Original Text of a Message</title>
1144 
1145 <para>Sometimes it is possible to do better than plainly skipping a rule on a message. Consider the following message:
1146 <programlisting language="po">
1147 #: dialogs/ScriptManager.cpp:498
1148 msgid "Please refer to the console debug output for more information."
1149 msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
1150 </programlisting>
1151 An observant translator could conclude that "console" is not the best choice of term in the original text, that "shell" (or "terminal") would be more accurate, and translate the message as if the more accurate term was used in the original. However, this could cause the terminology rule for "console" (in its accurate meaning) to complain about the proper term missing in translation. Adding <literal>skip-rule: term-console</literal> comment would indeed cancel this false positive, but what about the terminology rule on "shell"? There is nothing in the original text to trigger it and check for the proper term in translation.</para>
1152 
1153 <para>This example is an instance of the general case where the translator would formulate the original text somewhat differently, and make the translation based on that reformulation. Or, when the mere style of the original causes a rule to be falsely triggered, while diferently worded original would be just fine. In such cases, instead of adding a comment to crudely skip a rule, translator can add a comment to <emphasis>rewrite</emphasis> the original text before applying rules to it:
1154 <programlisting language="po">
1155 # rewrite-msgid: /console/shell/
1156 #: dialogs/ScriptManager.cpp:498
1157 msgid "Please refer to the console debug output for more information."
1158 msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
1159 </programlisting>
1160 The rewrite directive comment starts with <literal>rewrite-msgid:</literal> and is followed by search regular expression and replacement strings, delimited with <literal>/</literal> or another non-alphanumeric character. With this rewrite, the wrong terminology rule, for "console", will not be triggered, while the correct rule, for "shell", will be.</para>
1161 
1162 <para>At the moment, unlike <literal>skip-rule:</literal>, <literal>rewrite-msgid:</literal> is not an integral part of the rule system. It is instead implemented as a filtering hook. So to use it, add this filter into rule files (or into the filter set file included by rule files):
1163 <programlisting>
1164 addFilterHook name="remove/rewrite-msgid" on="msg"
1165 </programlisting>
1166 </para>
1167 
1168 <para>Sometimes it is not quite clear whether to skip a rule or rewrite the original, that is, whether to use <literal>skip-rule:</literal> or <literal>rewrite-msgid:</literal> comment. A guideline could be as follows. If the concept covered by the falsely triggered rule is present but somewhat camouflaged in the original, or one concept is switched for another (such as "console" with "shell" in the example above), then <literal>rewrite-msgid:</literal> should be used to "normalize" the original text. If the original text has nothing to do with the concept covered by the triggered rule, then <literal>skip-rule:</literal> should be used. An example of the latter would be such a message from a game:
1169 <programlisting language="po">
1170 # skip-rule: term-shell
1171 # src/tanks_options.cpp:249
1172 msgid "Fire shells upward"
1173 </programlisting>
1174 Here the word "shell" denotes a cannon shell, which has nothing to do with <literal>term-shell</literal> rule for the operating system shell, and the rule is therefore skipped.</para>
1175 
1176 </sect3>
1177 
1178 </sect2>
1179 
1180 </sect1>
1181 
1182 <!-- ======================================== -->
1183 <sect1 id="sec-lgsynder">
1184 <title>Syntagma Derivation</title>
1185 
1186 <para>Consider a message extracted from a .desktop file, representing the name of a GUI utility:
1187 <programlisting language="po">
1188 #. field: Name
1189 #: data/froobaz.desktop:5
1190 msgid "Froobaz Image Examiner"
1191 msgstr ""
1192 </programlisting>
1193 Program names from .desktop files can be read and presented to the user by any other program. For example, when an image is right-clicked in a file browser, it could offer to open the file with the utility named with this message. In the PO file of that file browser, the message for the menu item could be:
1194 <programlisting language="po">
1195 #. TRANSLATORS: %s is a program name, to open a file with.
1196 #: src/contextmenu.c:5
1197 msgid "Open with %s"
1198 msgstr ""
1199 </programlisting>
1200 In languages featuring noun inflection, it is likely that the program name in this message should have the grammar case different from the nominative (basic) case. This means that simply inserting the name read from the .desktop file, into directly translated text, will produce a grammatically incorrect phrase. Translator may try to adapt the message to the nominative form of the name (by shuffling words, adding "helper" words, adding punctuation), but this will produce stylistically suboptimal phrase. That is, style will be sacrificed for grammar. In order not to have to make such compromises, now and in the future certain <emphasis>translation scripting</emphasis> systems may be available atop the PO format<footnote>
1201 <para>As of this writting, one currently operative translation scripting system is <ulink url="http://techbase.kde.org/Localization/Concepts/Transcript">KDE's Transcript</ulink>. Another one being developed, albeit not with PO format as base, is <ulink url="http://wiki.mozilla.org/L20n">Mozilla's L20n</ulink>.</para>
1202 </footnote>, which would, in this example, enable the translator to specify which non-nominative form of the program name to fetch and insert.</para>
1203 
1204 <para>Whatever the shape the translation scripting system takes, different forms of phrases have to be derived somehow for use by that system. Given the nuances of spoken languages, fully automatic derivation is probably not going to be possible<footnote>
1205 <para>An exception would be <ulink url="http://en.wikipedia.org/wiki/Constructed_language">constructed languages</ulink> with regular grammar, such as Esperanto.</para>
1206 </footnote>. Pology therefore provides the <emphasis>syntagma<footnote>
1207 <para>A combination of words having a certain meaning, possibly greater than the sum of meanings of each word.</para>
1208 </footnote> derivator</emphasis> system (<emphasis>synder</emphasis> for short), which allows manual derivation of phrase forms and properties with minimal verbosity, using macro expansion based on partial regularities in the grammar.</para>
1209 
1210 <para>Syntagma derivations can be written and maintained in a standalone plain text file, although currently Pology provides no end-user functionality to convert such files (i.e. derive all forms defined by them) to formats which a target translation system could consume. Instead, one can make use of the <classname>Synder</classname> class from the <package>pology.synder</package> module to construct a custom converters. Of course, in the future, such converters may become part of Pology. There are already syntax highlighting definitions for the synder file format, for some text editors, in the <filename>syntax/</filename> directory of Pology distribution.</para>
1211 
1212 <para>What is provided right now in terms of end-user functionality is <link linkend="sv-collect-pmap">the <command>collect-pmap</command> sieve</link>. It enables translators to write syntagma derivations in translator comments in PO messages, and then extract them (deriving all forms) into a file in the appropriate format for the target translation system. The example message above from the .desktop file could be equipped with a synder entry like this:
1213 <programlisting language="po">
1214 # synder: Frubaz|ov ispitiv|ač slika
1215 #. field: Name
1216 #: data/froobaz.desktop:5
1217 msgid "Froobaz Image Examiner"
1218 msgstr "Frubazov ispitivač slika"
1219 </programlisting>
1220 The translator comment starts with the keyword <literal>synder:</literal>, and is followed by the synder entry which defines all the needed forms of the translated name. What you can see is that the synder entry is quite compact, exactly two characters longer than the pure translated name, and yet it defines over a dozen forms and some properties (gender, number) of the name.</para>
1221 
1222 <para>The rest of this section describes the syntax of synder entries, and the layout and organization of synder files. As an example application, we consider a dictionary of proper names, where for each name in the source language we want to define the basic name and some of its forms and properties in the target language.</para>
1223 
1224 <sect2 id="sec-lgsdbasic">
1225 <title>Basic Derivations</title>
1226 
1227 <para>For the name in source language <emphasis>Venus</emphasis> and in target language <emphasis>Venera</emphasis>, we could write the following simplest derivation, which defines only the basic form in the target language:
1228 <programlisting>
1229 Venus: =Venera
1230 </programlisting>
1231 <literal>Venus</literal> is the <emphasis>key syntagma</emphasis> or the <emphasis>derivation key</emphasis>, and it is separated by the colon (<literal>:</literal>) from the properties of the syntagma. Properties are written as <literal><replaceable>key</replaceable>=<replaceable>value</replaceable></literal> pairs, and separated by commas; in <literal>=Venera</literal>, the <emphasis>property key</emphasis> is the empty string, and the <emphasis>property value</emphasis> is <literal>Venera</literal>.</para>
1232 
1233 <para>We would now like to define some grammar cases in the target language. <emphasis>Venera</emphasis> is the nominative (basic) case, so instead of the empty string we set <literal>nom</literal> as its property key. Other cases that we want to define are genitive (<literal>gen</literal>) <emphasis>Venere</emphasis>, dative (<literal>dat</literal>) <emphasis>Veneri</emphasis>, and accusative (<literal>acc</literal>) <emphasis>Veneru</emphasis>. Then we can write:
1234 <programlisting>
1235 Venus: nom=Venera, gen=Venere, dat=Veneri, acc=Veneru
1236 </programlisting>
1237 </para>
1238 
1239 <para>By this point, everything is written out manually, there are no "macro derivations" to speak of. But observe the difference between different grammar cases of <emphasis>Venera</emphasis> -- only the final letter is changing. Therefore, we first write the following <emphasis>base derivation</emphasis> for
1240 this system of case endings alone, called <literal>declension-1</literal>:
1241 <programlisting>
1242 |declension-1: nom=a, gen=e, dat=i, acc=u
1243 </programlisting>
1244 A base derivation is normally also <emphasis>hidden</emphasis>, by prepending <literal>|</literal> (pipe) to its key syntagma. We make it hidden because it should be used only in other derivations, and does not represent a proper entry
1245 in our dictionary example. In the processing stage, derivations with hidden key syntagmas will not be offered on queries into dictionary. We can now use this base derivation to shorten the derivation for <emphasis>Venus</emphasis>:
1246 <programlisting>
1247 Venus: Vener|declension-1
1248 </programlisting>
1249 Here <literal>Vener</literal> is the <emphasis>root</emphasis>, and <literal>|declension-1</literal> is the <emphasis>expansion</emphasis>, which references the previously defined base derivation. The final forms are derived by inserting the property values found in the expansion (<literal>a</literal> from <literal>nom=a</literal>, <literal>e</literal> from <literal>gen=e</literal>, etc.) at the position where the expansion occurs, for each of the property keys found in the expansion, thus obtaining the desired properties (<literal>nom=Venera</literal>, <literal>gen=Venere</literal>, etc.) for the current derivation.</para>
1250 
1251 <para>Note that <literal>declension-1</literal> may be a too verbose name for the base derivation. If the declension type can be identified by the stem of the nominative case (here <literal>a</literal>), to have much more natural derivations we could write:
1252 <programlisting>
1253 |a: nom=a, gen=e, dat=i, acc=u
1254 Venus: Vener|a
1255 </programlisting>
1256 Now the derivation looks just like the nominative case alone, only having the root and the stem separated by <literal>|</literal>.</para>
1257 
1258 <para>The big gain of this transformation is, of course, when there are many syntagmas having the same declension type. Other such source-target pairs could be <emphasis>Earth</emphasis> and <emphasis>Zemlja</emphasis>, <emphasis>Europe</emphasis> and <emphasis>Evropa</emphasis>, <emphasis>Rhea</emphasis> and <emphasis>Reja</emphasis>, so we can write:
1259 <programlisting>
1260 |a: nom=a, gen=e, dat=i, acc=u
1261 Venus: Vener|a
1262 Earth: Zemlj|a
1263 Europe: Evrop|a
1264 Rhea: Rej|a
1265 </programlisting>
1266 From this it can also be seen that derivations are terminated by newline. If necessary, single derivation can be split into several lines by putting a <literal>\</literal> character (backslash) at the end of each line but the last.</para>
1267 
1268 <para>Expansions are implicitly terminated by a whitespace or a comma, or by another expansion. If these characters are part of the expansion itself (i.e. of the key syntagma of the derivation that the expansion refers to), or the text continues right after the expansion without a whitespace, curly brackets can be used to explicitly delimit the expansion:
1269 <programlisting>
1270 Alpha Centauri: Alf|{a}-Kentaur
1271 </programlisting>
1272 </para>
1273 
1274 <para>Any character which is special in the current context may be escaped with a backslash. Only the second colon here is the separator:
1275 <programlisting>
1276 Destination\: Void: Odredišt|{e}: ništavilo
1277 </programlisting>
1278 because the first colon is escaped, and the third colon is not in the context where colon is a special character.</para>
1279 
1280 <para>A single derivation may state more than one key syntagma, comma-separated. For example, if the syntagma in source language has several spellings:
1281 <programlisting>
1282 Iapetus, Japetus: Japet|
1283 </programlisting>
1284 The key syntagma can also be an empty string. This is useful for base derivations when the stem-naming is used and the stem happens to be null
1285 -- such as in the previous example. The derivation to which this empty expansion refers to would be:
1286 <programlisting>
1287 |: nom=, gen=a, dat=u, acc=
1288 </programlisting>
1289 </para>
1290 
1291 <para>Same-valued properties do not have to be repeated, but instead
1292 several property keys can be linked to one value, separated with <literal>&amp;</literal> (ampersand). In the previous base derivation, <literal>nom=</literal> and <literal>acc=</literal> properties could be unified in this way, resulting in:
1293 <programlisting>
1294 |: nom&amp;acc=, gen=a, dat=u
1295 </programlisting>
1296 </para>
1297 
1298 <para>Synder files may contain comments, starting with <literal>#</literal> and continuing to the end of line:
1299 <programlisting>
1300 # A comment.
1301 Venus: Vener|a # another comment
1302 </programlisting>
1303 </para>
1304 
1305 </sect2>
1306 
1307 <sect2 id="sec-lgsdmulexp">
1308 <title>Multiple Expansions</title>
1309 
1310 <para>A single derivation may contain more than one expansion. There are two distinct types of multiple expansion, <emphasis>outer</emphasis> and <emphasis>inner</emphasis>.</para>
1311 
1312 <para>Outer multiple expansion is used when it is advantageous to split derivations by grammar classes. The examples so far were only deriving grammar cases of nouns, but we may also want to define possesive adjectives per noun. For <emphasis>Venera</emphasis>, the possesive adjective in nominative case is <emphasis>Venerin</emphasis>. Using the stem-naming of base derivations, we could write:
1313 <programlisting>
1314 |a: …  # as above
1315 |in: …  # posessive adjective
1316 Venus: Vener|a, Vener|in
1317 </programlisting>
1318 Expansions are resolved from left to right, with the expected effect of derived properties accumulating along the way. The only question is what happens if two expansions produce properties with same keys but different values. In this case, the value produced by the last (rightmost) expansion overrides previous values.</para>
1319 
1320 <para>Inner multiple expansion is used on multi-word syntagmas, when more than one word needs expansion. For example, the source syntagma <emphasis>Orion Nebula</emphasis> has the target pair <emphasis>Orionova maglina</emphasis>, in which the first word is a possesive adjective, and the second word a noun. The derivation for this is:
1321 <programlisting>
1322 |a: …  # as above
1323 |ova>: …  # posessive adjective as noun, > is not special here
1324 Orion Nebula: Orion|ova> maglin|a
1325 </programlisting>
1326 Inner expansions are resolved from left to right, such everything on the right of the expansion currently resolved is treated as literal text. If all expansions define same properties by key, then the total derivation will have all those properties, with values derived as expected. However, if there is some difference in property sets, then the total derivation will get their intersection, i.e. only those properties found in all expansions.</para>
1327 
1328 <para>Both outer and inner expansion may be used in a single derivation.</para>
1329 
1330 </sect2>
1331 
1332 <sect2 id="sec-lgsdexpmask">
1333 <title>Expansion Masks</title>
1334 
1335 <para>An expansion can be made not to include all the properties defined in the refered to derivation, but only a subset of them. It can also be made to modify the property keys from the refered to derivation.</para>
1336 
1337 <para>Recall the example of <emphasis>Orion Nebula</emphasis> and <emphasis>Orionova maglina</emphasis>. Here the possesive adjective <emphasis>Orionova</emphasis> has to be matched in both case and gender to the noun <emphasis>maglina</emphasis>, which is of feminine gender. Earlier we defined a special adjective-as-noun derivation <literal>|ova></literal>, specialized for feminine gender nouns, but now we want to make use of the full posessive adjective derivation, which is not specialized to any gender. Let the property keys of this derivation be of the form <literal>nommas</literal>
1338 (nominative masculine), <literal>genmas</literal> (genitive masculine), …, <literal>nomfem</literal> (nominative feminine), <literal>genfem</literal> (genitive feminine), …. If we use the stem of nominative masculine form, <emphasis>Orionov</emphasis>, to name the possesive adjective base derivation, we get:
1339 <programlisting>
1340 |ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, …
1341 Orion Nebula: Orion|ov~...fem maglin|a
1342 </programlisting>
1343 <literal>|ov~...fem</literal> is a <emphasis>masked</emphasis> expansion. It specifies to include only those properties with keys starting with any three characters and ending in <literal>fem</literal>, as well as to drop <literal>fem</literal> (being a constant) from the resulting property keys. This precisely selects only the feminine forms of the possesive adjective and transforms their keys into noun keys needed to match with those of <literal>|a</literal> expansion.</para>
1344 
1345 <para>We could also use this same masked expansion as the middle step, to produce the feminine-specialized adjective-as-noun base derivation:
1346 <programlisting>
1347 |ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, …
1348 |ova>: |ov~...fem
1349 Orion Nebula: Orion|ova> maglin|a
1350 </programlisting>
1351 </para>
1352 
1353 <para>A special case of masked expansion is when there are no variable characters in the mask (no dots). In the pair <emphasis>Constellation of Cassiopeia</emphasis> and <emphasis>Sazvežđe Kasiopeje</emphasis>, the <emphasis>of Cassiopeia</emphasis> is translated as single word in genitive case, <emphasis>Kasiopeje</emphasis>, avoiding the need for preposition. If standalone <emphasis>Cassiopeia</emphasis> has its own derivation, then we can use it like this:
1354 <programlisting>
1355 Cassiopeia: Kasiopej|a
1356 Constellation of Cassiopeia: Sazvežđ|e |Cassiopeia~gen
1357 </programlisting>
1358 <literal>|e</literal> is the usual nominative-stem expansion. The <literal>|Cassiopeia~gen</literal> expansion produces only the genitive form of <emphasis>Cassiopeia</emphasis>, but with the empty property key. If this expansion would be treated as normal inner expansion, it would cancel all properties produced by <literal>|e</literal> expansion, since none of them has an empty key. Instead, when an expansion produces a single property with empty key, its value is treated as literal text and concatenated to all property values produced up to that point. Just as if we had written:
1359 <programlisting>
1360 Constellation of Cassiopeia: Sazvežđ|e Kasiopeje
1361 </programlisting>
1362 </para>
1363 
1364 <para>Sometimes the default modification of propety keys, removal of all fixed characters in the mask, is not what we want. This should be a rare case, but if it happens, the mask can also be given a <emphasis>key extender</emphasis>. For example, if we would want to select only feminine forms of the <literal>|ov</literal> expansion, but preserve the <literal>fem</literal> ending of the resulting keys, we would write:
1365 <programlisting>
1366 Foobar: Fubar|ov~...fem%*fem
1367 </programlisting>
1368 The key extender in this expansion is <literal>%*fem</literal>. For each resulting property, the final key is constructed by substituting every <literal>*</literal> with the key resulting from the <literal>~...fem</literal> mask. Thus, the <literal>fem</literal> ending is readded to every key, as desired.</para>
1369 
1370 <para>Expanded values can have their capitalization changed. By prepending <literal>^</literal> (circumflex) or <literal>`</literal> (backtick) to the syntagma key of the expansion, the first letter in fetched values is uppercased or lowercased, respectively. We could derive the pair <emphasis>Distant Sun</emphasis> and <emphasis>Udaljeno sunce</emphasis> by using the pair <emphasis>Sun</emphasis> and <emphasis>Sunce</emphasis> (note the case difference in <emphasis>Sunce</emphasis>/<emphasis>sunce</emphasis>) like this:
1371 <programlisting>
1372 Sun: Sunc|e  # this defines uppercase first letter
1373 Distant Sun: Dalek|o> |`Sun  # this needs lowercase first letter
1374 </programlisting>
1375 </para>
1376 
1377 </sect2>
1378 
1379 <sect2 id="sec-lgsdspecprop">
1380 <title>Special Properties</title>
1381 
1382 <para>Property keys may be given several endings, to make these properties behave differently from what was described so far. These ending are not treated as part of the property key itself, so they should not be given when querying derivations by syntagma and property key.</para>
1383 
1384 <para><emphasis>Cutting</emphasis> properties are used to avoid the normal value concatenation on expansion. For example, if we want to define the gender of nouns through base expansions, we could come up with:
1385 <programlisting>
1386 |a: nom=a, gen=e, dat=i, acc=u, gender=fem
1387 Venus: Vener|a
1388 </programlisting>
1389 However, this will cause the <literal>gender</literal> property in expansion to become <literal>Venerafem</literal>. For the <literal>gender</literal> property to be taken verbatim, without concatenting segments from the calling derivation, we make it a cutting property by appending <literal>!</literal> (exclamation mark) to its key:
1390 <programlisting>
1391 |a: nom=a, gen=e, dat=i, acc=u, gender!=fem
1392 </programlisting>
1393 Now when dictionary is queried for <literal>Venus</literal> syntagma and <literal>gender</literal> property, we will get the expected <literal>fem</literal> value.</para>
1394 
1395 <para>Cutting properties also behave differently in multiple inner expansions. Instead of being canceled when not all inner expansions define it, simply the rightmost value is taken -- just like in outer expansions.</para>
1396 
1397 <para><emphasis>Terminal</emphasis> properties are those hidden with respect to expansion, i.e. they are not taken into the calling derivation. A property is made terminal by appending <literal>.</literal> (dot) to its key. For example, if some derivations have the short description property <literal>desc</literal>, we typically do not want it to propagate into calling derivations which happen not to override it by outer expansion:
1398 <programlisting>
1399 Mars: Mars|, desc.=planet
1400 Red Mars: Crven|i> |Mars  # a novel
1401 </programlisting>
1402 </para>
1403 
1404 <para><emphasis>Canceling</emphasis> properties will cause a previously defined property with the same key to be removed from the collection of properties. Canceling property is indicated by ending its key with <literal>^</literal> (circumflex). The value of canceling property has no meaning, and can be anything. Canceling is useful in expansions and alternative derivations (more on that later), where some properties introduced by expansion or alternative fallback should be removed from the final collection of properties.</para>
1405 
1406 </sect2>
1407 
1408 <sect2 id="sec-lgsdtags">
1409 <title>Text Tags</title>
1410 
1411 <para>Key syntagmas and property values can be equipped with arbitrary simple tags, which start with the tag name in the form <literal>~<replaceable>tag</replaceable></literal> and extend to the next tag
1412 or the end of syntagma. For example, when deriving people names, we may want to tag their first and last name, using tags <literal>~fn</literal> and <literal>~ln</literal> respectively:
1413 <programlisting>
1414 ~fn Isaac ~ln Newton: ~fn Isak| ~ln Njutn|
1415 </programlisting>
1416 In default queries to the dictionary, tags are simply ignored, syntagmas and property values are reported as if there were no tags. However, custom derivators (based on the <classname>Synder</classname> class from <package>pology.synder</package>) can define transformation functions, to which tagged text segments will be passed, so that they can treat them specially when producing the final text.</para>
1417 
1418 <para>Tag is implicitly terminated by whitespace or comma (or colon in key syntagmas), but if none of these characters can be put after the tag, the tag name can be explicitly delimited with curly brackets, as <literal>~{<replaceable>tag</replaceable>}</literal>.</para>
1419 
1420 </sect2>
1421 
1422 <sect2 id="sec-lgsdalts">
1423 <title>Alternative Derivations</title>
1424 
1425 <para>Sometimes there may be several alternative derivations to the given syntagma. The default derivation (in some suitable sense) is written as explained so far, and alternative derivations are written under named <emphasis>environments</emphasis>.</para>
1426 
1427 <para>For example, if deriving a transcribed person's name, there may be several versions of the transcription. <emphasis>Isaac Newton</emphasis>, as the name of the Renaissance scientist, may be normally used in its traditional transcription <emphasis>Isak Njutn</emphasis>, while a contemporary person of that name would be transcribed in the modern way, as <emphasis>Ajzak Njuton</emphasis>. Then, in the entry of Newton the scientist, we could also mention what the modern transcription would be, under the environment <literal>modern</literal>:
1428 <programlisting>
1429 Isaac Newton: Isak| Njutn|
1430     @modern: Ajzak| Njuton|
1431 </programlisting>
1432 Alternative derivations are put on their own lines after the default derivation, and instead of the key syntagma, they begin with the environment name. The environment name starts with <literal>@</literal> and ends with colone, and then  the usual derivation follows. It is conventional, but not mandatory, to add some indent to the environment name. There can be any number of non-default environments.</para>
1433 
1434 <para>The immediate question that arises is how are expansions treated in non-default environments. In the previous example, what does <literal>|</literal> expansion resolve to in <literal>modern</literal> environment? This depends on how the synder file is processed. By default, it is required that derivations referenced by expansions have matching environments. If <literal>|</literal> were defined as:
1435 <programlisting>
1436 |: nom=, gen=a, dat=u, acc=
1437 </programlisting>
1438 then the expansion of <emphasis>Isaac Newton</emphasis> in <literal>modern</literal> environment would fail. Instead, it would be necessary to define the base derivations as:
1439 <programlisting>
1440 |: nom=, gen=a, dat=u, acc=
1441     @modern: nom=, gen=a, dat=u, acc=
1442 </programlisting>
1443 However, this may not be a very useful requirement. As can be seen in this example already, in many cases base derivations are likely to be same for all environments, so they would be needlessly duplicated. It is therefore possible to define <emphasis>environment fallback</emphasis> chain in processing, such that when a derivation in certain environment is requested but not available, environments in the fallback chain are tried in order. In this example, if the chain would be given as <literal>("modern", "")</literal> (the empty string is the name of default environment), then we could write:
1444 <programlisting>
1445 |: nom=, gen=a, dat=u, acc=
1446 Isaac Newton: Isak| Njutn|
1447     @modern: Ajzak| Njuton|
1448 Charles Messier: Šarl| Mesje|
1449 </programlisting>
1450 When derivation of <emphasis>Isaac Newton</emphasis> in <literal>modern</literal> environment is requested, the default expansion for <literal>|</literal> will be used, and the derivation will succeed. Derivation of <emphasis>Charles Messier</emphasis> in <literal>modern</literal> environment will succeed too, because the environment fallback chain is applied throughout;
1451 if <emphasis>Charles Messier</emphasis> had different modern transcription, we would have explicitly provided it.</para>
1452 
1453 </sect2>
1454 
1455 <sect2 id="sec-lgsdwspace">
1456 <title>Treatment of Whitespace</title>
1457 
1458 <para>ASCII whitespace in derivations, namely the space, tab and newline,
1459 is not preserved as-is, but by default it is <emphasis>simplified</emphasis> in final property values. The simplification consists of removing all leading and trailing ASCII whitespace, and replacing all inner sequences of ASCII whitespace with a single space. Thus, these two derivations are equivalent:
1460 <programlisting>
1461 Venus: nom=Venera
1462 Venus  :  nom =  Venera
1463 </programlisting>
1464 but these two are not:
1465 <programlisting>
1466 Venus: Vener|a
1467 Venus: Vener  |a
1468 </programlisting>
1469 because the two spaces between the root <literal>Vener</literal> and expansion <literal>|a</literal> become inner spaces in resulting values, so they get converted into a single space.</para>
1470 
1471 <para>Non-ASCII whitespace, on the other hand, is preserved as-is. This means that significant whitespace, like non-breaking space, zero width space, word joiners, etc. can be used normally.</para>
1472 
1473 <para>It is possible to have different treatment of whitespace, through an optional parameter to the derivator object (<classname>Synder</classname> class). This parameter is a transformation function to which text segments with raw whitespace are passed, so that anything can be done with them.</para>
1474 
1475 <para>Due to simplifaction of whitespace, indentation of key syntagmas and environment names is not significant, but it is nevertheless enforced to be consistent. This will not be accepted as valid syntax:
1476 <programlisting>
1477 Isaac Newton: Isak| Njutn|
1478     @modern: Ajzak| Njuton|
1479  George Washington: Džordž| Vašington|  # inconsitent indent
1480   @modern: Džordž| Vošington|  # inconsitent indent
1481 </programlisting>
1482 Consistent indenting is enforced both for stylistic reasons when several people are working on the same synder file, and to discourage indentation styles unfriendly to version control systems, such as:
1483 <programlisting>
1484 Isaac Newton: Isak| Njutn|
1485      @modern: Ajzak| Njuton|
1486 George Washington: Džordž| Vašington|
1487           @modern: Džordž| Vošington|  # inconsitent indent
1488 </programlisting>
1489 Unfriendliness to version control comes from the need to reindent lines which are otherwise unchanged, merely in order to keep them aligned to lines which were actually changed.</para>
1490 
1491 </sect2>
1492 
1493 <sect2 id="sec-lgsdorg">
1494 <title>Uniqueness, Ordering, and Inclusions</title>
1495 
1496 <para>Within single synder file, each derivation must have at least one unique key syntagma, because they are used as keys in dictionary lookups. These two derivations are in conflict:
1497 <programlisting>
1498 Mars: Mars|  # the planet
1499 Mars: mars|  # the chocholate bar
1500 </programlisting>
1501 </para>
1502 
1503 <para>There are several possibilities to resolve key conflicts. The simplest possibility is to use keyword-like key syntagmas, if key syntagmas themselves do not need to be human readable:
1504 <programlisting>
1505 marsplanet: Mars|
1506 marsbar: mars|
1507 </programlisting>
1508 If key syntagmas have to be human readable, then one option is to extend them in human readable way as well:
1509 <programlisting>
1510 Mars (planet): Mars|
1511 Mars (chocolate bar): mars|
1512 </programlisting>
1513 This method too is not acceptable if key syntagmas are intended to be of equal weight to derived syntagmas, like in a dictionary application. In that case, the solution is to add a hidden keyword-like syntagma to both derivations:
1514 <programlisting>
1515 Mars, |marsplanet: Mars|
1516 Mars, |marsbar: mars|
1517 </programlisting>
1518 Processing will now silently eliminate <literal>Mars</literal> as the key to either derivation, because it is conflicted, and leave only <literal>marsplanet</literal> as key for the first and <literal>marsbar</literal> as key for the second derivation. These remaining keys must also used in expansions, to reference the appropriate derivation. However, when querying the dictionary for key syntagmas by key <literal>marsplanet</literal>, only <emphasis>Mars</emphasis> will be returned, because <literal>marsplanet</literal> is hidden; likewise for <literal>marsbar</literal>.</para>
1519 
1520 <para>Ordering of derivations is not important. The following order is valid,
1521 although the expansion <literal>|Venus~gen</literal> is seen before the derivation of <emphasis>Venus</emphasis>:
1522 <programlisting>
1523 Merchants of Venus: Trgovc|i> s |Venus~gen
1524 Venus: Vener|a
1525 </programlisting>
1526 This enables derivations to be ordered naturally, e.g. alphabetically,
1527 instead of the order being imposed by dependencies.</para>
1528 
1529 <para>It is possible to include one synder file into another. A typical use case would be to split out base derivations into a separate file, and include it into other synder files. If basic derivations are defined in <filename>base.sd</filename>:
1530 <programlisting>
1531 |: nom=, gen=a, dat=u, acc=, gender!=mas
1532 |a: nom=a, gen=e, dat=i, acc=u, gender!=fem
15331534 </programlisting>
1535 then the file <filename>solarsys.sd</filename>, placed in the same directory, can include <filename>base.sd</filename> and use its derivations in expansions like this:
1536 <programlisting>
1537 >base.sd
1538 Mercury: Merkur|
1539 Venus: Vener|a
1540 Earth: Zemlj|a
15411542 </programlisting>
1543 <literal>></literal> is the inclusion directive, followed by the absolute or relative path to file to be included. If the path is relative, it is considered relative to the including file, and not to some externaly defined set of
1544 inclusion paths.</para>
1545 
1546 <para>If the including and the included file contain a derivation with same key syntagmas, these two derivations are <emphasis>not</emphasis> a conflict. On expansion, first the derivations from the current file are checked, and if the referenced derivation is not there, then the included files are checked in reverse of the inclusion order. In this way, it is possible to override some of base derivations in one or few including files.</para>
1547 
1548 <para>Inclusions are "shallow": only the derivations in the included file itself are visible (available for use in expansions) in the including file. In other words, if file A includes file B, and file B includes file C, then derivations from C are not automatically visible in A; to use them, A must explicitly include C.</para>
1549 
1550 <para>Shallow inclusion and ordering-independent resolution of expansions, taken together, enable mutual inclusions: A can include B, while B can include A. This is an important capability when building derivations of taxonomies. While derivation of X naturally belongs to file A and of Y to file B, X may nevertheless be used in expansion in another derivation in B, and Y in another derivation in A.</para>
1551 
1552 <para>To make derivations from several synder files available for queries, these files are imported into the derivator object one by one. Derivations from imported files (but not from files included by them, according to shallow inclusion principle) all share a single namespace. This means that key syntagmas across imported files can conflict, and must be resolved by one of outlined methods.</para>
1553 
1554 <para>The design rationale for the inclusion mechanism was that in each collection of derivations, each <emphasis>visible</emphasis> derivation, one which is available to queries by the user of the collection, must be accessible by at least one unique key, which does not depend on the underlying file hierarchy.</para>
1555 
1556 </sect2>
1557 
1558 <sect2 id="sec-lgsderrors">
1559 <title>Error Handling</title>
1560 
1561 <para>There are three levels of errors which may happen in syntagma derivations.</para>
1562 
1563 <para>The first level are syntax errors, such as synder entry missing a colon which separates the key syntagma from the rest of the entry, unclosed curly bracket in expansion, etc. These errors are reported as soon as the synder file is imported into the derivator object or included by another synder file.</para>
1564 
1565 <para>The second level of errors are expansion errors, such as an expansion referencing an undefined derivation, or an expansion mask discarding all properties. These errors are reported lazily, when the problematic derivation is actually looked up for the first time.</para>
1566 
1567 <para>The third level is occupied by semantic errors, such as if we want every derivation to have a certain property, or <literal>gender</literal> property to have only values <literal>mas</literal>, <literal>fem</literal>, and <literal>neu</literal>, etc. and a derivation violates some of these requirements. At the moment, there is no prepared way to catch semantic errors.</para>
1568 
1569 <para>In future, a mechanism (in form of file-level directives, perhaps) may be introduced to immediately report reference errors on request, and to constrain property keys and property values to avoid semantic errors. Until then, the way to validate a collection of derivations would be to write a piece of Python code which will import all files into a derivator object, iterate through derivations (this alone will catch expansion errors) and check for semantic errors.</para>
1570 
1571 </sect2>
1572 
1573 </sect1>
1574 
1575 </chapter>