Warning, /sdk/pology/doc/user/programming.docbook is written in an unsupported language. File is not indexed.

0001 <?xml version="1.0" encoding="UTF-8"?>
0002 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
0003  "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
0004 [
0005     <!ENTITY apibase "../../api/en_US">
0006     <!ENTITY ap "&apibase;/pology.">
0007     <!ENTITY amm "-module.html">
0008     <!ENTITY am "&amm;#">
0009     <!ENTITY acc "-class.html">
0010     <!ENTITY ac "&acc;#">
0011 ]>
0012 
0013 <chapter id="ch-prog">
0014 <title>Programming with Pology</title>
0015 
0016 <para>You may find it odd that the user manual contains the section on programming, as that is normally the matter for a separate, programmer-oriented document. On the other hand, while reading the "pure user" sections of this manual, you may have noticed that in Pology the distinction between a user and a programmer is more blurry than one would expect of a translation-related tool. Indeed, before getting into writing standalone Python programs which use the Pology library, there are many places in Pology itself where you can plug in some Python code to adapt the behavior to your language and translation environment. This section exists to support and stimulate such interaction with Pology.</para>
0017 
0018 <para>The Pology library is quite simple conceptually and organizationally. It consists of a small core abstraction of the PO format, and a lot of mutually unrelated functionality that may come in handy in particular translation processing scenarios. Everything is covered by <ulink url="&apibase;">the Pology API documentation</ulink>, but since API documentation tends to be non-linear and full of details obstructing the bigger picture, the following subsections are there to provide synthesis and rationale of salient points.</para>
0019 
0020 <!-- ======================================== -->
0021 <sect1 id="sec-prfile">
0022 <title>PO Format Abstraction</title>
0023 
0024 <para>The PO format abstraction in Pology is a quite direct and fine-grained reflexion of PO format elements and conventions. This was a design goal from the start; no attempt was made at a more general abstraction, which would tentatively support various translation file formats.</para>
0025 
0026 <para>There is, however, one glaring but intentional omission: multi-domain PO files (those which contain <literal>domain "..."</literal> directives) are not supported. We had never observed a multi-domain PO file in the wild, nor thought of a significant advantage it could have today over multiple single-domain PO files. Supporting multi-domain PO files would mean not only always needing two nested loops to iterate through messages in a PO file, but it would also interfere with higher levels in Pology which assume equivalence between PO files and domains. Pology will simply report an error when trying to read a multi-domain PO file.</para>
0027 
0028 <sect2 id="sec-prflmon">
0029 <title>Monitored Objects</title>
0030 
0031 <para>Because the PO abstraction is intended to be robust against programming errors when quickly writting custom scripts, and frugal on file modifications, by default some of the abstracted objects are "monitored". This means that they are checked for expected data types and have modification counters. Main monitored objects are PO files, PO headers, and PO messages, but also their attributes which are not plain data types (strings or numbers). For the moment, these secondary monitored types include <ulink url="&ap;monitored.Monlist&acc;"><classname>Monlist</classname></ulink> (the monitored counterpart to built-in <type>list</type>), <ulink url="&ap;monitored.Monset&acc;"><classname>Monset</classname></ulink> (counterpart to <type>set</type>), and <ulink url="&ap;monitored.Monpair&acc;"><classname>Monpair</classname></ulink> (like two-element <type>tuple</type>). Monitored types do not in general provide the full scope of functionality of their built-in counterparts, so sometimes it may be easier (and faster) to work with built-in types and convert them to monitored at the moment of adding to PO objects.</para>
0032 
0033 <para>To take a <classname>Monlist</classname> instance as an example, here is how it behaves on its own:
0034 <programlisting language="python">
0035 >>> from pology.monitored import Monlist
0036 >>> l = Monlist(["a", "b", "c"])
0037 >>> l.modcount
0038 0
0039 >>> l.append(10)
0040 >>> l
0041 Monlist(["a", "b", "c", 10])
0042 >>> l.modcount
0043 1
0044 >>>
0045 </programlisting>
0046 Appending an element has caused the modification counter to increase, but, as expected, it was possible to add an integer in spite of previous elements being strings. However, if the monitored list comes from a PO message:
0047 <programlisting language="python">
0048 >>> from pology.message import Message
0049 >>> msg = Message()
0050 >>> msg.msgstr
0051 Monlist([])
0052 >>> msg.msgstr.append(10)
0053 Traceback (most recent call last):
0054 ...
0055 pology.PologyError: Expected &lt;type 'unicode'&gt; for sequence element, got &lt;type 'int'&gt;.
0056 >>> msg.msgstr.append("bar")
0057 >>> msg.msgstr.modcount
0058 1
0059 >>> msg.modcount
0060 1
0061 </programlisting>
0062 The <classname>Message</classname> class has type constraints added to its attributes, and therefore addition of an integer to the <varname>.msgstr</varname> list was rejected: only <type>str</type> values are allowed, to prevent carelessness with encodings. Once a proper string was added to <varname>.msgstr</varname> list, its modification counter increased, but also the modification counter of the parent object.</para>
0063 
0064 <para>A few more notes on modification counters. Consider this example:
0065 <programlisting language="python">
0066 >>> msg = Message()
0067 >>> msg.msgstr = Monlist("foo")
0068 >>> msg.msgstr.modcount
0069 0
0070 >>> msg.msgstr_modcount
0071 1
0072 >>> msg.modcount
0073 1
0074 >>> msg.msgstr[0] = "foo"
0075 >>> msg.msgstr.modcount
0076 0
0077 >>> msg.msgstr = Monlist("foo")
0078 >>> msg.msgstr_modcount
0079 1
0080 >>> msg.modcount
0081 1
0082 </programlisting>
0083 <literal>Monlist("foo")</literal> itself is a fresh list with modification counter at 0, so after it was assigned to <varname>msg.msgstr</varname>, its modification counter is still 0. However, every attribute of a parent monitored object also has the associated <emphasis>attribute</emphasis> modification counter, denoted with trailing <literal>_modcount</literal>; therefore <varname>msg.msgstr_modcount</varname> did increase on assignment, and so did the parent <varname>msg.modcount</varname>. Modification tracking actually checks for equality of values, so when same-valued objects are repeadetly assigned (starting from <literal>msg.msgstr[0] = "foo"</literal> above), modification counters do not increase.</para>
0084 
0085 <para>Compound monitored objects may also have the attributes themselves constrained, to prevent typos and other brain glitches from causing mysterious wrong behavior when processing PO files. For example:
0086 <programlisting language="python">
0087 >>> msg = Message()
0088 >>> msg.msgtsr = Monlist("foo")
0089 Traceback (most recent call last):
0090 ...
0091 pology.PologyError: Attribute 'msgtsr' is not among specified.
0092 >>>
0093 </programlisting>
0094 </para>
0095 
0096 <para>You may conclude that modification tracking and type and attribute constraining would slow down processing, and you would be right. Since PO messages are by far the most processed objects, a non-monitored counterpart to <classname>Message</classname> is provided as well, for occasions where the code is only reading PO files, or has been sufficiently tested, and speed is of importance. See <xref linkend="sec-prflmsg"/> for details.</para>
0097 
0098 </sect2>
0099 
0100 <sect2 id="sec-prflmsg">
0101 <title>Message</title>
0102 
0103 <para>PO messages are by default represented with the <ulink url="&ap;message.Message&acc;"><classname>Message</classname></ulink> class. It is monitored for modifications, and constrained on attributes and attribute types. It provides direct attribute access to parts of a PO message:
0104 <programlisting language="python">
0105 >>> from pology.monitored import Monpair
0106 >>> from pology.message import Message
0107 >>> msg = Message()
0108 >>> msg.msgid = "Foo %s"
0109 >>> msg.msgstr.append("Bar %s")
0110 >>> msg.flag.add("c-format")
0111 >>> msg.fuzzy = True
0112 >>> print msg.to_string(),
0113 #, fuzzy, c-format
0114 msgid "Foo %s"
0115 msgstr "Bar %s"
0116 
0117 >>>
0118 </programlisting>
0119 Attribute access provides the least hassle, while being guarded by monitoring, and makes clear the semantics of particular message parts. For example, the <varname>.flag</varname> attribute is a set, to indicate that the order of flags should be of no importance to either a human translator or a PO processor, and the <varname>.msgstr</varname> attribute is always a list in order to prevent the programmer from not taking into account plural messages. While the fuzzy state is formally indicated by a flag, it is considered special enough to have a separate attribute.</para>
0120 
0121 <para>Some message parts may or may not be present in a message, and when they are not present, the corresponding attributes are either empty if sequences (e.g. <varname>.manual_comment</varname> list for translator comments), or set to <literal>None</literal> if strings<footnote>
0122 <para>The canonical way to check if message is a plural message is <literal>msg.msgid_plural is not None</literal>.</para>
0123 </footnote> (e.g. <varname>.msgctxt</varname>).</para>
0124 
0125 <para>There are also several derived, read-only attributes for special purposes. For example, if in some context the messages are to be tracked in a dictionary by their keys, there is the <varname>.key</varname> attribute available, which is an undefined but unique combination of <varname>.msgctxt</varname> and <varname>.msgid</varname> attributes. Or, there is the <varname>.active</varname> attribute which is <literal>True</literal> if the message is neither fuzzy nor obsolete, i.e. its translation (if there is one) would be used by the consumer of the PO file that the message is part of.</para>
0126 
0127 <para><classname>Message</classname> has a number of methods for frequent operations that need to read or modify more than one attribute. For example, to thoroughly unfuzzy a message, it is not sufficient to just remove its fuzzy flag (by setting <varname>.fuzzy</varname> to <literal>False</literal> or removing <literal>"fuzzy"</literal> from <varname>.flag</varname> set), but previous field comments (<literal>#| ...</literal>) should be removed as well, and this is what <function>.unfuzzy()</function> method does:
0128 <programlisting language="python">
0129 >>> print msg.to_string(),
0130 #| msgid "Foubar"
0131 #, fuzzy
0132 msgid "Foobar"
0133 msgstr "Fubar"
0134 
0135 >>> msg.unfuzzy()
0136 >>> print msg.to_string(),
0137 msgid "Foobar"
0138 msgstr "Fubar"
0139 
0140 </programlisting>
0141 Other methods include those to copy over a subset of parts from another message, to revert the message to pristine untranslated state, and so on.</para>
0142 
0143 <para>There exists a non-monitored counterpart to <classname>Message</classname>, the <ulink url="&ap;message.MessageUnsafe&acc;"><classname>MessageUnsafe</classname></ulink>class. Its attributes are of built-in types, e.g. <varname>.msgstr</varname> is plain <classname>list</classname>, and there is no type nor attribute checking. By using <classname>MessageUnsafe</classname>, a speedup of 50% to 100% has been observed in practical applications, so it makes for a good trade-off when you know what you are doing (e.g. you are certain that no modifications will be made). A PO file is opened with non-monitored messages by issuing the <literal>monitored=False</literal> argument to <classname>Catalog</classname> constructor.</para>
0144 
0145 <para>Read-only code could should work with <classname>Message</classname> and <classname>MessageUnsafe</classname> objects without any type-based specialization. Code that writes may need some care to achieve the same, for example:
0146 <programlisting language="python">
0147 def translate_moo_as_mu (msg):
0148 
0149     if msg.msgid == "Moo!":  # works for both
0150         msg.msgstr = ["Mu!"]  # raises exception if Message
0151         msg.msgstr[:] = ["Mu!"]  # works for both
0152         msg.msgstr[0] = "Mu!"  # works for both (when not empty)
0153 </programlisting>
0154 If you need to create an empty message of the same type as another message, or make a same-type copy of the message, you can use <function>type</function> built-in:
0155 <programlisting language="python">
0156 newmsg1 = type(msg)()  # create empty
0157 newmsg2 = type(msg)(msg)  # copy
0158 </programlisting>
0159 <classname>Message</classname> and <classname>MessageUnsafe</classname> share the virtual base class <classname>Message_base</classname>, so you can use <literal>isinstance(obj, Message_base)</literal> to check if an object is a PO message of either type.</para>
0160 
0161 </sect2>
0162 
0163 <sect2 id="sec-prflhead">
0164 <title>Header</title>
0165 
0166 <para>The PO header could be treated as just another message, but that would both be inconvenient for operating on it, and disruptive in iteration over a catalog. Instead the <ulink url="&ap;header.Header&acc;"><classname>Header</classname></ulink> class is introduced. Similar to <classname>Message</classname>, it provides both direct attribute access to parts of the header (like the <varname>.field</varname> list of name-value pairs), and methods for usual manipulations which would need a sequence of basic data manipulations (like <function>.set_field()</function> to either modify an existing or add a new header field with the given value).</para>
0167 
0168 <para>In particular, header comments are represented by a number of attributes (<varname>.title</varname>, <varname>.author</varname>, etc.), some of which are strings and some lists, depending on semantics. Unfortunatelly, the PO format does not define this separation formally, so when the PO file is parsed, comments are split heuristically (<varname>.title</varname> will be the first comment line, <varname>.author</varname> will get every line which looks like it has an email address and a year in it, etc.)</para>
0169 
0170 <para><classname>Header</classname> is a monitored class just like <classname>Message</classname>, but unlike <classname>Message</classname> it has no non-monitored counterpart. This is because in practice the header operations make a small part of total processing, so there is no real advantage at having non-monitored headers.</para>
0171 
0172 </sect2>
0173 
0174 <sect2 id="sec-prflcat">
0175 <title>Catalog</title>
0176 
0177 <para>PO files are read and written through <ulink url="&ap;catalog.Catalog&acc;"><classname>Catalog</classname></ulink> objects. A small script to open a PO file on disk (given as the first argument), find all messages that contain a certain substring in the original text (given as the second argument), and write those messages to standard output, would look like this:
0178 <programlisting language="python">
0179 import sys
0180 from pology.catalog import Catalog
0181 from pology.msgreport import report_msg_content
0182 
0183 popath = sys.argv[1]
0184 substr = sys.argv[2]
0185 
0186 cat = Catalog(popath)
0187 for msg in cat:
0188     if substr in msg.msgid:
0189         report_msg_content(msg, cat)
0190 </programlisting>
0191 Note the minimalistic code, both by raw length and access interface. Instead of using something like <literal>print msg.to_string()</literal> to output the message, already in this example we introduce the <ulink url="&ap;msgreport&amm;"><literal>msgreport</literal></ulink> module, which contains various functions for reporting on PO messages;<footnote>
0192 <para>There is also the <ulink url="&ap;report&amm;"><literal>report</literal></ulink> module for reporting general strings. In fact, all code in Pology distribution is expected to use function from these modules for writing to output streams, and there should not be a <function>print</function> in sight.</para>
0193 </footnote> <function>report_msg_content()</function> will first output the PO file name and location of the message (line and entry number) within the file, and then the message content itself, with some highlighting (for field keywords, fuzzy state, etc.) if the output destination permits it. Since no modifications are done to messages, this example would be just as safe but run significantly faster if the PO file were opened in non-monitored mode. This is done by adding the <literal>monitored=False</literal> argument to <classname>Catalog</classname> constructor:
0194 <programlisting language="python">
0195 cat = Catalog(popath, monitored=False)
0196 </programlisting>
0197 and no other modification is required.</para>
0198 
0199 <para>When some messages are modified in a catalog created by opening a PO file on disk, the modifications will not be written back to disk until the <function>.sync()</function> method is called -- not even if the program exists. If the catalog is monitored and there were no modifications to it up to the moment <function>.sync()</function> is called, the file on disk will not be touched, and <function>.sync()</function> will return <literal>False</literal> (it returns <literal>True</literal> if the file is written).<footnote>
0200 <para>This holds only for catalogs created with monitoring, i.e. no <literal>monitored=True</literal> constructor argument. For non-monitored <function>.sync()</function> will always touch the file and report <literal>True</literal>.</para>
0201 </footnote> In a scenario where a bunch of PO files are processed, this allows you to report only those which were actually modified. Take as an example a simplistic<footnote>
0202 <para>As opposed to <link linkend="sv-find-messages">the <command>find-messages</command> sieve</link>.</para>
0203 </footnote> script to search and replace in translation:
0204 <programlisting language="python">
0205 import sys
0206 from pology.catalog import Catalog
0207 from pology.fsops import collect_catalogs
0208 from pology.report import report
0209 
0210 serchstr = sys.argv[1]
0211 replacestr = sys.argv[2]
0212 popaths = sys.argv[3:]
0213 
0214 popaths = collect_catalogs(popaths)
0215 for popath in popaths:
0216     cat = Catalog(popath)
0217     for msg in cat:
0218         for i, text in enumerate(msg.msgstr):
0219             msg.msgstr[i] = text.replace(searchstr, replacestr)
0220     if cat.sync():
0221         report("%s (%d)" % (cat.filename, cat.modcount))
0222 </programlisting>
0223 This script takes the search and replace strings as the first two arguments, followed by any number of PO paths. The paths do not have to be only file paths, but can also be directory paths, in which case the <function>collect_catalogs()</function> function from <ulink url="&ap;fsops&amm;"><literal>fsops</literal></ulink> module will recursively collect any PO files in them. After the search and replace iteration through a catalog is done (<varname>msgstr</varname> being properly handled on plain and plural messages alike), its <function>.sync()</function> method is called, and if it reports that the file was modified, the file's path and number of modified texts is output. The latter is obtained simply as the modification counter state of the catalog, since it was bumped up by one on each text that actually got modified. Note the use of <varname>.filename</varname> attribute for illustration, although in this particular case we had the path available in <varname>popath</varname> variable.</para>
0224 
0225 <para>Syncing to disk is an atomic operation. This means that if you or something else aborts the program in the middle of execution, none of the processed PO files will become corrupted; they will either be in their original state, or in the expected modified state.</para>
0226 
0227 <para>As can be seen, at its base the <classname>Catalog</classname> class is an iterable container of messages. However, the precise nature of this container is less obvious. To the consumer (a program or converter) the PO file is a dictionary of messages by keys (<varname>msgctxt</varname> and <varname>msgid</varname> fields); there can be no two messages with the same key, and the order of messages is of no importance. For the human translator, however, the order of messages in the PO file is of great importance, because it is one of <link linkend="sec-pocontext">context indicators</link>. Message keys are parts of the messages themselves, which means that a message is both its own dictionary key and the value. Taking these constraints together, in Pology the PO file is treated as an <emphasis>ordered set</emphasis>, and the <classname>Catalog</classname> class interface is made to reflect this.</para>
0228 
0229 <para>The ordered set nature of catalogs comes into play when the composition of messages, rather than just the messages themselves, is modified. For example, to remove all obsolete messages from the catalog, the <function>.remove()</function> method <emphasis>could</emphasis> be used:
0230 <programlisting language="python">
0231 for msg in list(cat):
0232     if msg.obsolete:
0233         cat.remove(msg)
0234 cat.sync()
0235 </programlisting>
0236 Note that the message sequence was first copied into a list, since the removal would otherwise clobber the iteration. Unfortunatelly, this code will be very slow (linear time wrt. catalog size), since when a message is removed, internal indexing has to be updated to maintain both the order and quick lookups. Instead, the better way to remove messges is the <function>.remove_on_sync()</function> method, which marks the message for removal on syncing. This runs fast (constant time wrt. catalog size) and requires no copying into a list prior to iteration:
0237 <programlisting language="python">
0238 for msg in cat:
0239     if msg.obsolete:
0240         cat.remove_on_sync(msg)
0241 cat.sync()
0242 </programlisting>
0243 </para>
0244 
0245 <para>A message is added to the catalog using the <function>.add()</function> method. If <function>.add()</function> is given only the message itself, it will overwrite the message with the same key if there is one such, or else insert it according to source references, or append it to the end. If <function>.add()</function> is also given the insertion position, it will insert the message at that position only if the message with the same key does not exist in the catalog; if it does, it will ignore the given position and overwrite the existing message. When the message is inserted, <function>.add()</function> suffers the same performance problem as <function>.remove()</function>: it runs in linear time. However, the common case when an empty catalog is created and messages added one by one to the end can run in constant time, and this is what <function>.add_last()</function> method does.<footnote>
0246 <para>In fact, <function>.add_last()</function> does a bit more: if both non-obsolete and obsolete messages are added in mixed order, in the catalog they will be separated such that all non-obsolete come before all obsolete, but otherwise maintaining the order of addition.</para>
0247 </footnote></para>
0248 
0249 <para>The basic way to check if a message with the same key exists in the catalog is to use the <literal>in</literal> operator. Since the catalog is ordered, if the position of the message is wanted, <function>.find()</function> method can be used instead. Both these methods are fast, running in constant time. There is a series of <function>.select_*()</function> methods for looking up messages by other than the key, which run in linear time, and return lists of messages since the result may not be unique any more.</para>
0250 
0251 <para>Since it is ordered, the catalog can be indexed, and that either by a position or by a message (whose key is used for lookup). To replace a message in the catalog with a message which has the same key but is otherwise different, you can either first fetch its position and then use it as the index, or use the message itself as the index:
0252 <programlisting language="python">
0253 # Idexing by position.
0254 pos = cat.find(msg)
0255 cat[pos] = msg
0256 
0257 # Indexing by message key.
0258 cat[msg] = msg
0259 </programlisting>
0260 This leads to the following question: what happens if you modify the key of a message (its <varname>.msgctxt</varname> or <varname>.msgid</varname> attributes) in the catalog? In that case the internal index goes out of sync, rather than being automatically updated. This is a necessary performance measure. If you need to change message keys, while doing that you should treat the catalog as a pure list, using only <literal>in</literal> iteration and positional indexing. Afterwards you should either call <function>.sync()</function> if you are done with the catalog, or <function>.sync_map()</function> to only update indexing (and remove messages marked with <function>.remove_on_sync()</function>) without writing out the PO file.</para>
0261 
0262 <para>The <classname>Catalog</classname> class provides a number of convenience methods which report things about the catalog based on the header information, rather than having to manually examine the header. These include the number of plural forms, the <varname>msgstr</varname> index for the given plural number, as well as information important in some Pology contexts, like language code, accelerator markers, markup types, etc. Each of these methods has a counterpart which sets the appropriate value, but this value is not written to disk when the catalog is synced. This is because frequently there are more ways in which the value can be determined from the header, so it is ambiguous how to write it out. Instead, these methods are used to set or override values provided by the catalog (e.g. based on command line options) for the duration of processing only.</para>
0263 
0264 <para>To create an empty catalog if it does not exist on disk, the <literal>create=True</literal> argument can be added to the constructor. If the catalog does exist, it will be opened as usual; if it did not exist, the new PO file will be written to disk on sync. To unconditionally create an empty catalog, whether the PO file exists or not at the given path, the <literal>truncate=True</literal> parameter should be added as well. In this case, if the PO file did exist, it will be overwritten with the new content only when the catalog is synced. The catalog can also be created with an empty string for path, in which case it is guaranteed to be empty even without setting <literal>truncate=True</literal>. If a catalog with empty path should later be synced (as opposed to being transient during processing), its <varname>.filename</varname> attribute can simply be assigned a valid path before calling <function>.sync()</function>.</para>
0265 
0266 <para>In summary, it can be said that the <classname>Catalog</classname> class is biased, in terms of performance and ease of use, towards processing existing PO files rather than creating PO files from scratch, and towards processing existing messages in the PO file rather than shuffling them around.</para>
0267 
0268 </sect2>
0269 
0270 </sect1>
0271 
0272 <!-- ======================================== -->
0273 <sect1 id="sec-prcodconv">
0274 <title>Coding Conventions</title>
0275 
0276 <para>This section describes the style and conventions that the code which is intended to be included in Pology distribution should adhere to. The general coding style is expected to follow the Python style guide described in <ulink url="http://www.python.org/dev/peps/pep-0008/">PEP 8</ulink>.</para>
0277 
0278 <para>Lines should be up to 80 characters long. Class names should be written in camel case, and all other names in lower case with underscores:
0279 <programlisting language="python">
0280 class SomeThingy (object):
0281     ...
0282 
0283     def some_method (self, ...):
0284 
0285         ...
0286         longer_variable = ...
0287 
0288 
0289 def some_function (...):
0290     ...
0291 </programlisting>
0292 Long expressions with operators should be wrapped in parentheses and before the binary operator, with the first line indented to the level of the other operand:
0293 <programlisting language="python">
0294 some_quantity = (  a_number_of_thingies * quantity_of_that_per_unit
0295                   + the_base_offset)
0296 </programlisting>
0297 In particular, long conditions in <literal>if</literal> and <literal>while</literal> statements should be written like this:
0298 <programlisting language="python">
0299 if (    something and something_else and yet_something
0300     and somewhere_in_between and who_knows_what_else
0301 ):
0302     do_something_appropriate()
0303 </programlisting>
0304 </para>
0305 
0306 <para>All messages, warnings, and errors should be issued through <ulink url="&ap;report&amm;"><literal>msgreport</literal></ulink> and <ulink url="&ap;msgreport&amm;"><literal>msgreport</literal></ulink> modules. There should be no <function>print</function> statements or raw writes to <literal>sys.stdout</literal>/<literal>sys.stderr</literal>.</para>
0307 
0308 <para>For the code in Pology library, it is always preferable to raise an exception instead of aborting execution. On the other hand, it is fine to add optional parameters by which the client can select if the function should abort rather than raise an exception. All topical problems should raise <classname>pology.PologyError</classname> or a subclass of it, and built-in exceptions only for simple general problems (e.g. <classname>IndexError</classname> for indexing past the end of something).</para>
0309 
0310 <sect2 id="sec-prcsi18n">
0311 <title>User-Visible Text and Internationalization</title>
0312 
0313 <para>All user-visible text, be it reports, warnings, errors (including exception messages) should be wrapped for internationalization through Gettext. The top <ulink url="&ap;pology&amm;"><literal>pology</literal></ulink> module provides several wrappers for Gettext functions, which have the following special traits: context is mandatory on every wrapped text, all format directives must be named, and arguments are specified as keyword-value pairs just after the text argument (unless deferred translation is used). Some examples:
0314 <programlisting language="python">
0315 # Simple message with context marker.
0316 _("@info",
0317   "Trying to sync unnamed catalog.")
0318 
0319 # Simple message with extended context.
0320 _("@info command description",
0321   "Keep track of who, when, and how, has translated, modified, "
0322   "or reviewed messages in a collection of PO files.")
0323 
0324 # Another context marker and extended context.
0325 _("@title:column words per message in original",
0326   "w/msg-or")
0327 
0328 # Parameter substitution.
0329 _("@info",
0330   "Review tag '%(tag)s' not defined in '%(file)s'.",
0331   tag=rev_tag, file=config_path)
0332 
0333 # Plural message
0334 n_("@item:inlist",
0335    "written %(num)d word", "written %(num)d words",
0336    num=nwords)
0337 
0338 # Deferred translation, when arguments are known later.
0339 tmsg = t_("@info:progress",
0340           "Examining state: %(file)s")
0341 ...
0342 msg = tmsg.with_args(file=some_path).to_string()
0343 </programlisting>
0344 Every context starts with the "context marker" in form of <literal>@<replaceable>keyword</replaceable></literal>, drawn from a predefined set (see the <ulink url="http://techbase.kde.org/Development/Tutorials/Localization/i18n_Semantics#Context_Markers">article on i18n semantics</ulink> at KDE Techbase); it is most often <literal>@info</literal> in Pology code. The context marker may be, and should be, followed by a free-form extend context whenever it can help the translator to understand how and where the message is used. It is usual to have the context, text and arguments in different lines, though not necessary if they are short enough to fit one line.</para>
0345 
0346 <para>Pology defines lightweight XML markup for coloring text in the <ulink url="&ap;colors&amm;"><literal>colors</literal></ulink> module. In fact, Gettext wrappers do not return ordinary strings, but <ulink url="&ap;colors.ColorString&acc;"><classname>ColorString</classname></ulink> objects, and functions from <literal>report</literal> and <literal>msgreport</literal> modules know how to convert it to raw strings for given output destination (file, terminal, web page...). Therefore you can use colors in any wrapped string:
0347 <programlisting language="python">
0348 _("@info:progress",
0349   "&lt;green&gt;History follows:&lt;/green&gt;")
0350 
0351 _("@info",
0352   "&lt;bold&gt;Context:&lt;/bold&gt; %(snippet)s",
0353   snippet=some_text)
0354 </programlisting>
0355 Coloring should be used sparingly, only when it will help to cue user's eyes to significant elements of the output.</para>
0356 
0357 <para>There are two consequences of having text markup available throughout. The first is that every message must be well-formed XML, which means that it must contain no unballanced tags, and that literal <literal>&lt;</literal> characters must be escaped (and then also <literal>&gt;</literal> for good style):
0358 <programlisting language="python">
0359 _("@item automatic name for anonymous input stream",
0360   "&amp;lt;stream-%(num)s&amp;gt;",
0361   num=strno)
0362 </programlisting>
0363 The other consequence is that <classname>ColorString</classname> instances must be joined and interpolated with dedicated functions; see <function>cjoin()</function> and <function>cinterp()</function> functions in <literal>colors</literal> module.</para>
0364 
0365 <para>Unless the text of the message is specifically intended to be a title or an insert (i.e. <literal>@title</literal> or <literal>@item</literal> context markers), it should be a proper sentence, starting with a capital letter and ending with a dot.</para>
0366 
0367 </sect2>
0368 
0369 </sect1>
0370 
0371 <!-- ======================================== -->
0372 <sect1 id="sec-prsieves">
0373 <title>Writing Sieves</title>
0374 
0375 <para><link linkend="ch-sieve">Pology sieves</link> are filtering-like processing elements applied by the <command>posieve</command> script to collections of PO files. A sieve can examine as well as modify the PO entries passed through it. Each sieve is written in a separate file. If the sieve file is put into <filename>sieve/</filename> directory of Pology distribution (or intallation), the sieve can be referenced on <command>posieve</command> command line by the shorthand notation; otherwise the path to the sieve file is given. The former is called an internal sieve, and the latter an external sieve, but the sieve file layout and the sieve definition are same for both cases.</para>
0376 
0377 <para>In the following, <command>posieve</command> will be referred to as "the client". This is because tools other than <command>posieve</command> may start to use sieves in the future, and it will also be described what these clients should adhere to when using sieves.</para>
0378 
0379 <sect2 id="sec-prsvlayout">
0380 <title>Sieve Layout</title>
0381 
0382 <para>The sieve file must define the <classname>Sieve</classname> class, with some mandatory and some optional interface methods and instance variables. There are no restrictions at what you can put into the sieve file beside this class, only keep in mind that <command>posieve</command> will load the sieve file as a Python module, exactly once during a single run.</para>
0383 
0384 <para>Here is a simple sieve (also the complete sieve file) which just counts the number of translated messages:
0385 <programlisting language="python">
0386 class Sieve (object):
0387 
0388     def __init__ (self, params):
0389 
0390         self.ntranslated = 0
0391 
0392     def process (self, msg, cat):
0393 
0394         if msg.translated:
0395             self.ntranslated += 1
0396 
0397     def finalize (self):
0398 
0399         report("Total translated: %d" % self.ntranslated)
0400 </programlisting>
0401 The constructor takes as argument an object specifying any sieve parameters (more on that soon). The <methodname>process</methodname> method gets called for each message in each PO file processed by the client, and must take as parameters the message (instance of <ulink url="&ap;message.Message_base&acc;"><classname>Message_base</classname></ulink>) and the catalog which contains it (<ulink url="&ap;catalog.Catalog&acc;"><classname>Catalog</classname></ulink>). The client calls the <methodname>finalize</methodname> method after no more messages will be fed to the sieve, but this method does need to be defined (client should check if it exists before placing the call).</para>
0402 
0403 <para>Another optional method is <methodname>process_header</methodname>, which the client calls on the PO header:
0404 <programlisting language="python">
0405 def process_header (self, hdr, cat):
0406     # ...
0407 </programlisting>
0408 <literal>hdr</literal> is an instance of <ulink url="&ap;header.Header&acc;"><classname>Header</classname></ulink>, and <literal>cat</literal> is the containing catalog. The client will check for the presence of this method, and if it is defined, it will call it prior to any <methodname>process</methodname> call on the messages from the given catalog. In other words, the client is not allowed to switch catalogs between two calls to <methodname>process</methodname> without calling <methodname>process_header</methodname> in between.</para>
0409 
0410 <para>There is also the optional <methodname>process_header_last</methodname> method, for which everything holds just like for <methodname>process_header</methodname>, except that, when present, the client must call it <emphasis>after</emphasis> all consecutive <methodname>process</methodname> calls on messages from the same catalog:
0411 <programlisting language="python">
0412 def process_header_last (self, hdr, cat):
0413     # ...
0414 </programlisting>
0415 </para>
0416 
0417 <para>Sieve methods should not abort program execution in case of errors, instead they should throw an exception. In particular, if the <methodname>process</methodname> method throws <ulink url="&ap;sieve.SieveMessageError&acc;"><classname>SieveMessageError</classname></ulink>, it means that the sieve can still process other messages in the same catalog; if it throws <ulink url="&ap;sieve.SieveCatalogError&acc;"><classname>SieveCatalogError</classname></ulink>, then any following messages from the same catalog must be skipped, but other catalogs may be processed. Similarly, if <methodname>process_header</methodname> throws <classname>SieveCatalogError</classname>, other catalogs may still be processed. Any other type of exception tells the client that the sieve should no longer be used.</para>
0418 
0419 <para>The <methodname>process</methodname> and <methodname>process_header</methodname> methods should either return <literal>None</literal> or an integer exit code. A return value which is neither <literal>None</literal> nor <literal>0</literal> indicates that while the evaluation was successfull (no exception was thrown), the processed entry (message or header) should not be passed further along the <link linkend="sec-svchains">sieve chain</link>.</para>
0420 
0421 </sect2>
0422 
0423 <sect2 id="sec-prsvparams">
0424 <title>Sieve Parameter Handling</title>
0425 
0426 <para>The <literal>params</literal> parameter of the sieve constructor is an object with data attributes as <link linkend="p-svparam">parameters which may influence</link> the sieve operation. The sieve file can define the <function>setup_sieve</function> function, which the client will call with
0427 a <ulink url="&ap;subcmd.SubcmdView&acc;"><classname>SubcmdView</classname></ulink> object as the single argument, to fill in the sieve description and define all mandatory and optional parameters. For example, if the sieve takes an optional parameter named <literal>checklevel</literal>, which controles the level (an integer) at which to perform some checks, here is how <function>setup_sieve</function> could look like:
0428 <programlisting language="python">
0429 def setup_sieve (p):
0430 
0431     p.set_desc("An example sieve.")
0432     p.add_param("checklevel", int, defval=0,
0433                 desc="Validity checking level.")
0434 
0435 
0436 class Sieve (object):
0437 
0438     def __init__ (self, params):
0439 
0440         if params.checklevel >= 1:
0441             # ...setup some level 1 validity checks...
0442         if params.checklevel >= 2:
0443             # ...setup some level 2 validity checks...
0444         #...
0445 
0446     ...
0447 </programlisting>
0448 See the <ulink url="&ap;subcmd.SubcmdView&ac;add_param"><methodname>add_param</methodname></ulink> method for details on defining sieve parameters.</para>
0449 
0450 <para>The client is not obliged to call <function>setup_sieve</function>, but it must make sure that the object it sends to the sieve as <literal>params</literal> has all the instance variable according to the defined parameters.</para>
0451 
0452 </sect2>
0453 
0454 <sect2 id="sec-prsvregime">
0455 <title>Catalog Regime Indicators</title>
0456 
0457 <para>There are two boolean instance variables that the sieve may define, and
0458 which the client may check for to decide on the regime in which the
0459 catalogs are opened and closed:
0460 <programlisting language="python">
0461 class Sieve (object):
0462 
0463     def __init__ (self, params):
0464 
0465         # These are the defaults:
0466         self.caller_sync = True
0467         self.caller_monitored = True
0468 
0469     ...
0470 </programlisting>
0471 The variables are:
0472 <itemizedlist>
0473 
0474 <listitem>
0475 <para><varname>caller_sync</varname> instructs the client whether catalogs processed by the sieve should be synced to disk at the end. If the sieve does not define this variable, the client should assume <literal>True</literal> and sync catalogs. This variable is typically set to <literal>False</literal> in sieves which do not modify anything, because syncing catalogs takes time.</para>
0476 </listitem>
0477 
0478 <listitem>
0479 <para><varname>caller_monitored</varname> tells the client whether it should open catalogs in monitored mode. If this variable is not set, the client should assume it <literal>True</literal>. This is another way of reducing processing time for sieves which do not modify PO entries.</para>
0480 </listitem>
0481 
0482 </itemizedlist>
0483 </para>
0484 
0485 <para>Usually a modifying sieve will set neither of these variables, i.e. catalogs will be monitored and synced by default, while a checker sieve will set both to <literal>False</literal>. For a modifying sieve that unconditionally modifies all entries sent to it, only <varname>caller_monitored</varname> may be set to <literal>False</literal> and <varname>caller_sync</varname> left undefined (i.e. <literal>True</literal>).</para>
0486 
0487 <para>If a sieve requests no monitoring or no syncing, the client is not obliged to satisfy these requests. On the other hand, if a sieve does request monitoring or syncing (either explicitly or by not defining the corresponding variables), the client must provide catalogs in that regime. This is because there may be several sieves operating at the same time (a sieve chain), and monitoring and syncing is usually necessary for proper operation of those sieves that request it.</para>
0488 
0489 </sect2>
0490 
0491 <sect2 id="sec-prsvnotes">
0492 <title>Further Notes on Sieves</title>
0493 
0494 <para>Since monitored catalogs have modification counters, the sieve may use them within its <methodname>process*</methodname> methods to find out if any modification really took place. The proper way to do this is to record the counter at start, and check for increase at end:
0495 <programlisting language="python">
0496 def process (self, msg, cat):
0497 
0498     startcount = msg.modcount
0499 
0500     # ...
0501     # ... do some stuff
0502     # ...
0503 
0504     if msg.modcount > startcount:
0505         self.nmodified += 1
0506 </programlisting>
0507 The <emphasis>wrong</emphasis> way to do it would be to merely check if <literal>msg.modcount > 0</literal>, because several modifying sieves may be operating at the same time, each increasing the counters.</para>
0508 
0509 <para>If the sieve wants to remove the message from the catalog, if at all possible it should use catalog's <methodname>remove_on_sync</methodname> instead of <methodname>remove</methodname> method, to defer actual removal to sync time. This is because <methodname>remove</methodname> will probably ruin client's iteration over the catalog, so if it must be used, the sieve documentation should state it clearly. <methodname>remove</methodname> also has linear execution time, while <methodname>remove_on_sync</methodname> has constant.</para>
0510 
0511 <para>If the sieve is to become part of Pology distribution, it should be properly documented. This means fully equipped <function>setup_sieve</function> function in the sieve file, and a piece of user manual documentation.
0512 The <classname>Sieve</classname> class itself should not be documented in general. Only when <methodname>process*</methodname> are returning an exit code, this should be stated in their own comments (and in the user manual).</para>
0513 
0514 </sect2>
0515 
0516 </sect1>
0517 
0518 <!-- ======================================== -->
0519 <sect1 id="sec-prhooks">
0520 <title>Writing Hooks</title>
0521 
0522 <para>Hooks are functions with specified sets of input parameters, return values, processing intent, and behavioral constraints. They can be used as modification and validation plugins in many processing contexts in Pology. There are three broad categories of hooks: filtering, validation and side-effect hooks.</para>
0523 
0524 <para>Filtering hooks modify some of their inputs. Modifications are done in-place whenever the input is mutable (like a PO message), otherwise the modified input is provided in a return value (like a PO message text field).</para>
0525 
0526 <para>Validation hooks perform certain checks on their inputs, and return a list of <emphasis>annotated spans</emphasis> or <emphasis>annotated parts</emphasis>, which record all the encountered errors:
0527 <itemizedlist>
0528 
0529 <listitem>
0530 <para id="p-annspans">Annotated spans are reported when the object of validation is a piece of text. Each span is a tuple of start and end index of the problematic segment in the text, and a note which explains the problem. The return value of a text-validation hook will thus be a list:
0531 <programlisting language="python">
0532 [(start1, end1, "note1"), (start2, end2, "note1"), ...]
0533 </programlisting>
0534 The note can also be <literal>None</literal>, if there is nothing to say about the problem.</para>
0535 </listitem>
0536 
0537 <listitem>
0538 <para id="p-annparts">Annotated parts are reported for an object which has more than one distinct piece of text, such as a PO message. Each annotated part is a tuple stating the name of the problematic part of the object (e.g. <literal>"msgid"</literal>, <literal>"msgstr"</literal>), the item index for array-like parts (e.g. for <literal>msgstr</literal>), and the list of problems in appropriate form (for a PO message this is a list of annotated spans).
0539 The return value of a PO message-validation hook will look like this:
0540 <programlisting language="python">
0541 [("part1", item1, [(start11, end11, "note11"), ...]),
0542  ("part2", item2, [(start21, end21, "note21"), ...]),
0543  ...]
0544 </programlisting>
0545 </para>
0546 </listitem>
0547 
0548 </itemizedlist>
0549 </para>
0550 
0551 <para>Side-effect hooks neither modify their inputs nor report validation information, but can be used for whatever purpose which is independent of the processing chain into which the hook is inserted. For example, a validation hook can be implemented like this as well, when it is enough that it reports problems to standard output, or where the hook client does not know how to use structured validation data (annotated spans or parts). The return value of a side-effect hook the number of errors encountered internally by the hook (an integer). Clients may use this number to decide upon further behavior. For example, if a side-effect hook modified a temporary copy of a file, the client may decide to abandon the result and use the original file if there were some errors.</para>
0552 
0553 <sect2 id="sec-prhktypes">
0554 <title>Hook Taxonomy</title>
0555 
0556 <para>In this section a number of hook types are described and assigned a formal
0557 type keyword, so that they can be conveniently referred to elsewhere in Pology documentation.</para>
0558 
0559 <para>Each type keyword has the form <emphasis>&lt;letter1&gt;&lt;number&gt;&lt;letter2&gt;</emphasis>, e.g. F1A. The first letter represents the hook category: <emphasis>F</emphasis> for filtering hooks, <emphasis>V</emphasis> for validation hooks, and <emphasis>S</emphasis> for side-effect hooks. The number enumerates the input signature by parameter types, and the final letter denotes the difference in semantics of input parameters for equal input signatures. As a handy mnemonic, each type is also given an informal signature in the form of <literal>(param1, param2, ...) -> result</literal>; in them, <literal>spans</literal> stand for <link linkend="p-annspans">annotated spans</link>, <literal>parts</literal> for <link linkend="p-annparts">annotated parts</link>, and <literal>numerr</literal> for number of errors.</para>
0560 
0561 <para>Hooks on pure text:
0562 <itemizedlist>
0563 <listitem>
0564 <para>F1A (<literal>(text) -> text</literal>): filters the text</para>
0565 </listitem>
0566 <listitem>
0567 <para>V1A (<literal>(text) -> spans</literal>): validates the text</para>
0568 </listitem>
0569 <listitem>
0570 <para>S1A (<literal>(text) -> numerr</literal>): side-effects on text</para>
0571 </listitem>
0572 </itemizedlist>
0573 </para>
0574 
0575 <para>Hooks on text fields in a PO message in a catalog:
0576 <itemizedlist>
0577 <listitem>
0578 <para>F3A (<literal>(text, msg, cat) -> text</literal>): filters any text field</para>
0579 </listitem>
0580 <listitem>
0581 <para>V3A (<literal>(text, msg, cat) -> spans</literal>): validates any text field</para>
0582 </listitem>
0583 <listitem>
0584 <para>S3A (<literal>(text, msg, cat) -> numerr</literal>): side-effects on any text field</para>
0585 </listitem>
0586 <listitem>
0587 <para>F3B (<literal>(msgid, msg, cat) -> msgid</literal>): filters an original text field; original fields are either <literal>msgid</literal> or <literal>msgid_plural</literal></para>
0588 </listitem>
0589 <listitem>
0590 <para>V3B (<literal>(msgid, msg, cat) -> spans</literal>): validates an original text field</para>
0591 </listitem>
0592 <listitem>
0593 <para>S3B (<literal>(msgid, msg, cat) -> numerr</literal>): side-effects on an original text field</para>
0594 </listitem>
0595 <listitem>
0596 <para>F3C (<literal>(msgstr, msg, cat) -> msgstr</literal>): filters a translation text field; translation fields are the <literal>msgstr</literal> array</para>
0597 </listitem>
0598 <listitem>
0599 <para>V3C (<literal>(msgstr, msg, cat) -> spans</literal>): validates a translation text field</para>
0600 </listitem>
0601 <listitem>
0602 <para>S3C (<literal>(msgstr, msg, cat) -> numerr</literal>): side-effects on a translation text field</para>
0603 </listitem>
0604 </itemizedlist>
0605 </para>
0606 
0607 <para>*3B and *3C hook series are introduced next to *3A for cases when it does not make sense for text field to be any other but one of the original, or translation fields. For example, to process the translation sometimes the original (obtained by <literal>msg</literal> parameter) must be consulted. If a *3B or *3C hook is applied on an inappropriate text field, the results are undefined.</para>
0608 
0609 <para>Hooks on PO entries in a catalog:
0610 <itemizedlist>
0611 <listitem>
0612 <para>F4A (<literal>(msg, cat) -> numerr</literal>): filters a message, modifying it</para>
0613 </listitem>
0614 <listitem>
0615 <para>V4A (<literal>(msg, cat) -> parts</literal>): validates a message</para>
0616 </listitem>
0617 <listitem>
0618 <para>S4A (<literal>(msg, cat) -> numerr</literal>): side-effects on a message (no modification)</para>
0619 </listitem>
0620 <listitem>
0621 <para>F4B (<literal>(hdr, cat) -> numerr</literal>): filters a header, modifying it</para>
0622 </listitem>
0623 <listitem>
0624 <para>V4B (<literal>(hdr, cat) -> parts</literal>): validates a header</para>
0625 </listitem>
0626 <listitem>
0627 <para>S4B (<literal>(hdr, cat) -> numerr</literal>): side-effects on a header (no modification)</para>
0628 </listitem>
0629 </itemizedlist>
0630 </para>
0631 
0632 <para>Hooks on PO catalogs:
0633 <itemizedlist>
0634 <listitem>
0635 <para>F5A (<literal>(cat) -> numerr</literal>): filters a catalog, modifying it in any way</para>
0636 </listitem>
0637 <listitem>
0638 <para>S5A (<literal>(cat) -> numerr</literal>): side-effects on a catalog (no modification)</para>
0639 </listitem>
0640 </itemizedlist>
0641 </para>
0642 
0643 <para>Hooks on file paths:
0644 <itemizedlist>
0645 <listitem>
0646 <para>F6A (<literal>(filepath) -> numerr</literal>): filters a file, modifying it in any way</para>
0647 </listitem>
0648 <listitem>
0649 <para>S6A (<literal>(filepath) -> numerr</literal>): side-effects on a file, no modification</para>
0650 </listitem>
0651 </itemizedlist>
0652 </para>
0653 
0654 <para>The *2* hook series (with signatures <literal>(text, msg) -> ...</literal>) has been skipped because no need for them was observed so far next to *3* hooks.</para>
0655 
0656 </sect2>
0657 
0658 <sect2 id="sec-prhkfact">
0659 <title>Hook Factories</title>
0660 
0661 <para>Since hooks have fixed input signatures by type, the way to customize
0662 a given hook behavior is to produce its function by another function.
0663 The hook-producing function is called a I{hook factory}. It works by
0664 preparing anything needed for the hook, and then defining the hook proper
0665 and returning it, thereby creating a lexical closure around it:
0666 <programlisting language="python">
0667 def hook_factory (param1, param2, ...):
0668 
0669     # Use param1, param2, ... to prepare for hook definition.
0670 
0671     def hook (...):
0672 
0673         # Perhaps use param1, param2, ... in the hook definition too.
0674 
0675     return hook
0676 </programlisting>
0677 </para>
0678 
0679 <para>In fact, most internal Pology hooks are defined by factories.</para>
0680 
0681 </sect2>
0682 
0683 <sect2 id="sec-prhknotes">
0684 <title>Further Notes on Hooks</title>
0685 
0686 <para>General hooks should be defined in top level modules, language-dependent hooks in <literal>lang.<replaceable>code</replaceable>.<replaceable>module</replaceable></literal>,
0687 project-dependent hooks in <literal>proj.<replaceable>name</replaceable>.<replaceable>module</replaceable></literal>,
0688 and hooks that are both language- and project-dependent in <literal>lang.<replaceable>code</replaceable>.proj.<replaceable>name</replaceable>.<replaceable>module</replaceable></literal>. Hooks placed like this can be fetched by <ulink url="&ap;getfunc&am;get_hook_ireq"><function>getfunc.get_hook_ireq</function></ulink> in various non-code contexts, in particular from Pology utilities which allow users to insert hooks into processing through command line options or configurations. If the complete module is dedicated to a single hook, the hook function (or factory) should be named same as the module, so that users can select it by giving only the hook module name.</para>
0689 
0690 <para><link linkend="p-annparts">Annotated parts</link> for PO messages returned by hooks are a reduced but valid instance of highlight specifications used by reporting functions, e.g. <ulink url="&ap;msgreport&am;report_msg_content"><function>msgreport.report_msg_content</function></ulink>. Annotated parts do not have the optional fourth element of a tuple in highlight specification, which is used to provide the filtered text against which spans were constructed, instead of the original text. If a validation hook constructs the list of problematic spans against the filtered text, just before returning it can apply <ulink url="&ap;diff&am;adapt_spans"><function>diff.adapt_spans</function></ulink>
0691 to reconstruct the spans against the original text.</para>
0692 
0693 <para>The documentation to a hook function should state the hook type within the short description, in square brackets at the end as <literal>[type ... hook]</literal>. Input parameters should be named like in the informal signatures in the taxonomy above, and should not be omitted in <literal>@param:</literal> Epydoc entries; but the return should be given under <literal>@return:</literal>, also using one of the listed return names, in order to complete the hook signature.</para>
0694 
0695 <para>The documentation to a hook factory should have <literal>[hook factory]</literal> at the end of the short description. It should normally list all the input parameters, while the return value should be given as <literal>@return: type ... hook</literal>, and
0696 the hook signature as the <literal>@rtype:</literal> Epydoc field.</para>
0697 
0698 </sect2>
0699 
0700 </sect1>
0701 
0702 <!-- ======================================== -->
0703 <sect1 id="sec-prascsel">
0704 <title>Writing Ascription Selectors</title>
0705 
0706 <para>Ascription selectors are functions used by <command>poascribe</command> in the translation review workflow as described in <xref linkend="ch-ascript"/>. This section describes how you can write your own ascription selector, which you can then put to use by following the instructions in <xref linkend="sec-asccustsels"/>.</para>
0707 
0708 <para>In terms of code, an ascription selector is a function factory, which construct the actual selector function based on supplied selector arguments. It has the following form:
0709 <programlisting language="python">
0710 # Selector factory.
0711 def selector_foo (args):
0712 
0713     # Validate input arguments.
0714     if (...):
0715         raise PologyError(...)
0716 
0717     # Prepare selector definition.
0718     ...
0719 
0720     # The selector function itself.
0721     def selector (msg, cat, ahist, aconf):
0722 
0723         # Prepare selection process.
0724         ...
0725 
0726         # Iterate through ascription history looking for something.
0727         for i, asc in enumerate(ahist):
0728             ...
0729 
0730         # Return False or True if a shallow selector,
0731         # and 0 or 1-based history index if history selector.
0732         return ...
0733 
0734     return selector
0735 </programlisting>
0736 It is customary to name the selector function <function>selector_<replaceable>something</replaceable></function>, where <replaceable>something</replaceable> will also be used as the selector name (in command line, etc). The input <varname>args</varname> parameter is always a list of strings. It should first be validated, insofar as possible without having in hand the particular message, catalog, ascription history or ascription configuration. Whatever does not depend on any of these can also be precomputed for later use in the selector function.</para>
0737 
0738 <para>The selector function takes as arguments the message (an instance of <ulink url="&ap;message.Message_base&acc;"><classname>Message_base</classname></ulink>), the catalog (<ulink url="&ap;catalog.Catalog&acc;"><classname>Catalog</classname></ulink>) it comes from, the ascription history (list of <ulink url="&ap;ascript.AscPoint&acc;"><classname>AscPoint</classname></ulink> objects), and the ascription configuration (<ulink url="&ap;ascript.AscConfig&acc;"><classname>AscConfig</classname></ulink>). For the most part, <classname>AscPoint</classname> and <classname>AscConfig</classname> are simple attribute objects; check their API documentation for the list and description of attributes. Some of the attributes of <classname>AscPoint</classname> objects that you will usually inspect are <varname>.msg</varname> (the historical version of the message), <varname>.user</varname> (the user to whom the ascription was made), or <varname>.type</varname> (the type of the ascription, one of <varname>AscPoint.ATYPE_*</varname> constants). The ascription history is sorted from the latest to the earliest ascription. If the <varname>.user</varname> of the first entry in the history is <literal>None</literal>, that means that the current version of the message has not been ascribed yet (e.g. if its translation has been modified compared to the latest ascribed version). If you are writing a shallow selector, it should return <literal>True</literal> to select the message, or <literal>False</literal> otherwise. In a history selector, the return value should be a 1-based index of an entry in the ascription history which caused the message to be selected, or <literal>0</literal> if the message was not selected.<footnote>
0739 <para>In this way the history selector can automatically behave as shallow selector as well, because simply testing for falsity on the return value will show whether the message has been selected or not.</para>
0740 </footnote></para>
0741 
0742 <para>The entry index returned by history selectors is used to compute embedded difference from a historical to the current version of the message, e.g. on <literal>poascribe diff</literal>. Note that <command>poascribe</command> will actually take as base for differencing the first non-fuzzy historical message <emphasis>after</emphasis> the indexed one, because it is assumed that already the historical message which triggered the selection contains some changes to be inspected. (When this behavior is not sufficient, <command>poascribe</command> offers the user to specify a second history selector, which directly selects the historical message to base the difference on.)</para>
0743 
0744 <para>Most of the time the selector will operate on messages covered by a single ascription configuration, which means that the ascription configuration argument sent to it will always be the same. On the other hand, the resolution of some of the arguments to the selector factory will depend only on the ascription configuration (e.g. a list of users). In this scenario, it would be waste of performance if such arguments were resolved anew in each call to the selector. You could instead write a small caching (memoizing) resolver function, which when called for the second and subsequent times with the same configuration object, returns previously resolved argument value from the cache. A few such caching resolvers for some common arguments have been provided in the <ulink url="&ap;ascript&amm;"><literal>ascript</literal></ulink> module, functions named <function>cached_*()</function> (e.g. <ulink url="&ap;ascript&am;cached_users"><function>cached_users()</function></ulink>).</para>
0745 
0746 </sect1>
0747 
0748 </chapter>