Warning, /sdk/pology/doc/user/misctools.docbook is written in an unsupported language. File is not indexed.

0001 <?xml version="1.0" encoding="UTF-8"?>
0002 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
0003  "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
0004 
0005 <chapter id="ch-misctools">
0006 <title>Miscellaneous Tools</title>
0007 
0008 <para>This chapter describes various smaller standalone tools in Pology, which do not introduce any major PO processing concepts nor can be grouped under a common topic.</para>
0009 
0010 <!-- ======================================== -->
0011 <sect1 id="sec-mirewrap">
0012 <title>Rewrapping PO Files with <command>porewrap</command></title>
0013 
0014 <para>The <command>porewrap</command> script does one simple thing: it rewraps message strings (<varname>msgid</varname>, <varname>msgstr</varname>, etc.) in PO files. Gettext's tools, e.g. <command>msgcat</command>, can be used for rewrapping as well, so what is the reason of existence of <command>porewrap</command>? The lesser reason is convenience. Arbitrary number of PO file paths can be given to it as arguments, as well as directory paths which will be recursively search for PO files. The more important reason is that Pology can also perform "fine" wrapping, as described in <xref linkend="sec-cmwrap"/>. Thus, running:
0015 <programlisting language="bash">
0016 $ porewrap --no-wrap --fine-wrap somedir/
0017 </programlisting>
0018 will rewrap all PO files found in <filename>somedir/</filename> and below, such that basic wrapping (on column) is disabled (<literal>--no-wrap</literal>), while fine wrapping (on logical breaks) is enabled (<literal>--fine-wrap</literal>).</para>
0019 
0020 <para>Other than from command line options, <command>porewrap</command> will also consult the PO file header and the user configuration, for the wrapping mode. Command line options have the highest priority, followed by the PO header, and the user configuration at the end. For details on how to set the wrapping mode in PO headers, see the description of <literal>X-Wrapping</literal> header field in <xref linkend="sec-cmheader"/>. If none of these sources specify the wrapping mode, <command>porewrap</command> will apply basic wrapping.</para>
0021 
0022 <sect2 id="sec-mirwopts">
0023 <title>Command Line Options</title>
0024 
0025 <para>Options specific to <command>porewrap</command>:
0026 <variablelist>
0027 
0028 <varlistentry>
0029 <term><option>-v</option>, <option>--verbose</option></term>
0030 <listitem>
0031 <para>Since <command>porewrap</command> just opens and writes back all the PO files given to it, it normally does not report anything. But this option can be issued for it to report PO file paths as they have been written out.</para>
0032 </listitem>
0033 </varlistentry>
0034 
0035 </variablelist>
0036 </para>
0037 
0038 <para>
0039 Options common with other Pology tools:
0040 <variablelist>
0041 
0042 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
0043             href="stdopt-wrapping.docbook"/>
0044 
0045 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
0046             href="stdopt-filesfrom.docbook"/>
0047 
0048 </variablelist>
0049 </para>
0050 
0051 </sect2>
0052 
0053 <sect2 id="sec-mirwcfg">
0054 <title>User Configuration</title>
0055 
0056 <para><command>porewrap</command> reads the wrapping mode fields as described in <xref linkend="sec-cmwrcfg"/>, from its <literal>[porewrap]</literal> section.</para>
0057 
0058 </sect2>
0059 
0060 </sect1>
0061 
0062 <!-- ======================================== -->
0063 <sect1 id="sec-miselfmerge">
0064 <title>Self-Merging PO Files with <command>poselfmerge</command></title>
0065 
0066 <para>Normally, PO files are periodically merged with latest PO templates, to introduce changes from the source material while preserving as much of the existing translation as possible. <command>poselfmerge</command>, on the other hand, will merge the PO file with "itself". More precisely, it will derive the temporary template version of the PO file (by cleaning it from translations and other details), and then merge the original PO file with the derived template, by calling <command>msgmerge</command> internally. This can have several uses:
0067 <itemizedlist>
0068 
0069 <listitem>
0070 <para>The fuzzy matching algorithm of <command>msgmerge</command> is extremely fast and robust, but treats all messages the same and in isolation, without trying out more complicated (and necessarily much slower) heuristic criteria. This can cause the translator to spend more time updating a fuzzy message than it would take to translate it from scratch. <command>poselfmerge</command> can be therefore instructed to go over all fuzzy messages created by merging, and apply additional heuristics to determine whether to leave the message fuzzy or to clean it up and make it fully untranslated.</para>
0071 </listitem>
0072 
0073 <listitem>
0074 <para>Sometimes the PO file can contain a number of quite similar longer messages (this is especially the case when <link linkend="ch-summit">translating in summit</link>). A capable PO editor should automatically offer the previous translation on the next similar message (by using internal translation memory), and show the what the small differences in the original text are, thus greately speeding up the translation of that message. If, however, the PO editor is not that capable, or you use a plain text editor, while translating you can simply skip every long message that looks familiar, and afterwards run <command>poselfmerge</command> on the PO file to introduce fuzzy matches on those messages.</para>
0075 </listitem>
0076 
0077 <listitem>
0078 <para>More generally, if your PO editor does not have (a good enough) translation memory feature, or you edit PO files with a plain text editor, you can instruct <command>poselfmerge</command> to use one or more <emphasis>PO compendia</emphasis> to provide additional exact and fuzzy matches. This is essentially the batch application of translation memory. <xref linkend="sec-cbcompend"/> provides some hints on how to create and maintain PO compendia.</para>
0079 </listitem>
0080 
0081 </itemizedlist>
0082 </para>
0083 
0084 <para>Arguments to <command>poselfmerge</command> are any number of PO file paths or directories to search for PO files, which will be modified in place:
0085 <programlisting language="bash">
0086 $ poselfmerge foo.po bar.po somedir/
0087 </programlisting>
0088 However, this run will do almost nothing (except possibly rewrap files), just as <command>msgmerge</command> would do nothing if the same template were used twice. Instead, all special processing must be requested by command line options, or activated through the <link linkend="sec-cmconfig">user configuration</link> to avoid issuing some options with same values all the time.</para>
0089 
0090 <sect2 id="sec-mismopts">
0091 <title>Command Line Options</title>
0092 
0093 <para>Options specific to <command>poselfmerge</command>:
0094 <variablelist>
0095 
0096 <varlistentry>
0097 <term><option>-A <replaceable>RATIO</replaceable></option>, <option>--min-adjsim-fuzzy=<replaceable>RATIO</replaceable></option></term>
0098 <listitem>
0099 <para>The minimum required "adjust similarity" between the old and the new orginal text in a fuzzy message, in order to accept it and not clean it to untranslated state. The similarity is expressed as the ratio in range 0.0-1.0, with 0.0 meaning no similarity and 1.0 no difference. A practical range is 0.6-0.8. If this option is not issued, fuzzy messages are kept as they are (as if 0.0 would be given).</para>
0100 
0101 <para>The requirement for computation of adjusted similarity is that fuzzy messages contain previous strings, i.e. that the PO file was originally merged with <option>--previous</option> to <command>msgmerge</command>.</para>
0102 </listitem>
0103 </varlistentry>
0104 
0105 <varlistentry>
0106 <term><option>-b</option>, <option>--rebase-fuzzies</option></term>
0107 <listitem>
0108 <para>Normally, when merging with template, the untranslated and fuzzy messages already present in the PO file are not checked again for approximate matches. This is on the one hand side a performance measure (why fuzzy match again something that was already matched before?), and on the other hand a safety measure (higher trust in an old fuzzy match based on the PO file itself than e.g. a new match from an arbitrary compendium). By issuing this option, prior to merging all untranslated messages are removed from the PO file, and all fuzzy messages which have previous strings are converted to obsolete previous messages. This activates fuzzy matching on untranslated messages (e.g. if new compendium given, or for similar messages skipped during translation), and puts possibly better previous strings on fuzzy messages (unless an exact match is found in compendium).</para>
0109 </listitem>
0110 </varlistentry>
0111 
0112 <varlistentry>
0113 <term><option>-C <replaceable>POFILE</replaceable></option>, <option>--compendium=<replaceable>POFILE</replaceable></option></term>
0114 <listitem>
0115 <para>The PO file to use as compendium on merging, to produce more exact and fuzzy matches. This option can be repeated to add several compendia.</para>
0116 </listitem>
0117 </varlistentry>
0118 
0119 <varlistentry>
0120 <term><option>-v</option>, <option>--verbose</option></term>
0121 <listitem>
0122 <para><command>poselfmerge</command> normally operates silently, and this option requests some progress information. Quite useful if processing a large collection of PO files, because merging and post-merge processing can take <emphasis>a lot</emphasis> of time (especially in presence of compendium).</para>
0123 </listitem>
0124 </varlistentry>
0125 
0126 <varlistentry>
0127 <term><option>-W <replaceable>NUMBER</replaceable></option>, <option>--min-words-exact=<replaceable>NUMBER</replaceable></option></term>
0128 <listitem>
0129 <para>When an exact match for an untranslated message is produced from the compendium, it is not always safe to silently accept it, because the compendium may contain translations from contexts totally unrelated with the current PO file. The shorter the message, the higher the chance that translation will not be suitable in current context. This option provides the minimum number of words (in the original) to accept an exact match from the compendium, or else the message is made fuzzy. The reasonable value depends on the relation between the source and the target language, with 5 to 10 probably being on the safe side.</para>
0130 
0131 <para>Note that afterwards you can see when an exact match has been demoted into a fuzzy one, by that message not having previous strings (<literal>#| msgid "..."</literal>, etc.).</para>
0132 </listitem>
0133 </varlistentry>
0134 
0135 <varlistentry>
0136 <term><option>-x</option>, <option>--fuzzy-exact</option></term>
0137 <listitem>
0138 <para>This option is used to unconditionally demote exact matches from the compendium into fuzzy messages (e.g. regardless of the length of the text, as done by <option>-W</option>/<option>--min-words-exact</option>). This may be needed, for example, when there is a strict review procedure in place, and the compendium is built from unreviewed translations.</para>
0139 </listitem>
0140 </varlistentry>
0141 
0142 </variablelist>
0143 </para>
0144 
0145 <para>
0146 Options common with other Pology tools:
0147 <variablelist>
0148 
0149 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
0150             href="stdopt-wrapping.docbook"/>
0151 
0152 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
0153             href="stdopt-filesfrom.docbook"/>
0154 
0155 </variablelist>
0156 </para>
0157 
0158 </sect2>
0159 
0160 <sect2 id="sec-mismcfg">
0161 <title>User Configuration</title>
0162 
0163 <para>It is likely that the translator will have a certain personal preference of the various match acceptance criteria provided by command line options. Instead of issuing those options all the time, the following user configuration fields may be set:
0164 <variablelist>
0165 
0166 <varlistentry>
0167 <term><literal>[poselfmerge]/fuzzy-exact=[yes|*no]</literal></term>
0168 <listitem>
0169 <para>Counterpart to the <option>-x</option>/<option>--fuzzy-exact</option> option.</para>
0170 </listitem>
0171 </varlistentry>
0172 
0173 <varlistentry>
0174 <term><literal>[poselfmerge]/min-adjsim-fuzzy</literal></term>
0175 <listitem>
0176 <para>Counterpart to the <option>-A</option>/<option>--min-adjsim-fuzzy</option> option.</para>
0177 </listitem>
0178 </varlistentry>
0179 
0180 <varlistentry>
0181 <term><literal>[poselfmerge]/min-words-exact</literal></term>
0182 <listitem>
0183 <para>Counterpart to the <option>-W</option>/<option>--min-words-exact</option> option.</para>
0184 </listitem>
0185 </varlistentry>
0186 
0187 <varlistentry>
0188 <term><literal>[poselfmerge]/rebase-fuzzies=[yes|*no]</literal></term>
0189 <listitem>
0190 <para>Counterpart to the <option>-b</option>/<option>--rebase-fuzzies</option> option.</para>
0191 </listitem>
0192 </varlistentry>
0193 
0194 </variablelist>
0195 Of course, command line options can be issued to override the user configuration fields when necessary.</para>
0196 
0197 <para><command>poselfmerge</command> also reads the wrapping mode fields as described in <xref linkend="sec-cmwrcfg"/>, from its <literal>[poselfmerge]</literal> section.</para>
0198 
0199 </sect2>
0200 
0201 </sect1>
0202 
0203 <!-- ======================================== -->
0204 <sect1 id="sec-mimtrans">
0205 <title>Machine Translation with <command>pomtrans</command></title>
0206 
0207 <para><emphasis>Machine translation</emphasis> is the process where a computer program is used to produce translation of more than a trivial piece of text, starting from single sentences, over paragraphs, to full documents. There are debates on how useful machine translation is right now and how much better it could become in the future, and there is a steady line of research in that direction. Limiting to widely available examples of machine translation software today, it is safe to say that, on the one hand, machine translation can preserve a lot of the meaning of the original and thus be very useful to the reader who needs to grasp the main points of the text, but on the other hand, are not at all passable for producing translations of the quality expected of human translators who are native speaker of the target language.</para>
0208 
0209 <para>As far as Pology is concerned, the question of machine translation reduces to this: would it increase the efficiency of translation if PO files were first machine-translated, and then manually corrected by a human translator? There is no general answer to this question, as it depends stronly on all elements in the chain: the quality of machine translation software, the source language, the target language, and the human translator. Be that as it may, Pology provides the <command>pomtrans</command> script, which can fill in untranslated messages in PO files by passing original text through various machine translation services.</para>
0210 
0211 <para><command>pomtrans</command> has two principal modes of operation. The more straightforward is the direct mode, where original texts are simply <varname>msgid</varname> strings in the given PO file. In this mode, PO files can be machine-translated with:
0212 <programlisting language="bash">
0213 $ pomtrans <replaceable>transerv</replaceable> -t <replaceable>lang</replaceable> <replaceable>paths...</replaceable>
0214 </programlisting>
0215 The first argument is the translation service keyword, chosen from one known to <command>pomtrans</command>. The <option>-t</option> option specifies the target language; it may not be necessary if processed PO files have the <literal>Language:</literal> header field properly set. The source language is assumed to be English, but there is an option to specify another source language. Afterwards an arbitrary number of paths follow, which may be either single PO files or directories which will be recursively searched for PO files.</para>
0216 
0217 <para><command>pomtrans</command> will try to translate only untranslated messages, and not fuzzy messages. When it translates a message, by default it will make it fuzzy as well, meaning that a human should go through all machine-translated messages. These defaults are based on the perceived current quality of most machine translation services. There are several command line options to change this behavior.</para>
0218 
0219 <para>The other mode of operation is the parallel mode. Here <command>pomtrans</command> takes the original text to be the translation into another language, i.e. <varname>msgstr</varname> strings from a PO file translated into another language. For example, if a PO file should be translated into Spanish (i.e. from English to Spanish), and that same PO file is available fully translated into French (i.e. from English to French), then <command>pomtrans</command> could be used to translate from French to Spanish. This is done in the following way:
0220 <programlisting language="bash">
0221 $ pomtrans <replaceable>transerv</replaceable> -s <replaceable>lang1</replaceable> -t <replaceable>lang2</replaceable> -p <replaceable>search</replaceable>:<replaceable>replace</replaceable> <replaceable>paths...</replaceable>
0222 </programlisting>
0223 As in direct mode, the first argument is the translation service. Then both the source (<option>-s</option>) and the target language (<option>-t</option>) are specified; again, if PO files have their <literal>Language:</literal> header fields set, these options are not necessary. The perculiar here is the <option>-p</option> option, which specifies two strings, separated by colon. These are used to construct paths to source language PO files, by replacing the first string in paths of target language PO files with the second string. For example, if the file tree is:
0224 <programlisting>
0225 foo/
0226     po/
0227         alpha/
0228             alpha.pot
0229             fr.po
0230             es.po
0231         bravo/
0232             bravo.pot
0233             fr.po
0234             es.po
0235 </programlisting>
0236 then the invocation could be:
0237 <programlisting language="bash">
0238 $ cd .../foo/
0239 $ pomtrans <replaceable>transerv</replaceable> -s fr -t es -p es.:fr. po/*/es.po
0240 </programlisting>
0241 In case a PO file in target language does not have a counterpart in source language, it is simply skipped.</para>
0242 
0243 <para>There is another variation of the parallel mode, where source language texts are drawn not from counterpart PO files, but from a single, compendium PO file in source language. This mode is engaged by giving the path to that compendium with the <option>-c</option> option, instead of the <option>-p</option> option for path replacement.</para>
0244 
0245 <sect2 id="sec-mimtopts">
0246 <title>Command Line Options</title>
0247 
0248 <para>Options specific to <command>pomtrans</command>:
0249 <variablelist>
0250 
0251 <varlistentry>
0252 <term><option>-a <replaceable>CHARS</replaceable></option>, <option>--accelerator=<replaceable>CHARS</replaceable></option></term>
0253 <listitem>
0254 <para>Characters used as <link linkend="sec-poaccel">accelerator markers</link> in user interface messages. They should be removed from the source language text before translation, in order not to confuse the translation service.<footnote>
0255 <para>This also means that, at the moment, machine-translated text has no accelerator when the original text did have one. Some heuristics may be implemented in the future to add the accelerator to translated text as well.</para>
0256 </footnote></para>
0257 </listitem>
0258 </varlistentry>
0259 
0260 <varlistentry>
0261 <term><option>-c <replaceable>FILE</replaceable></option>, <option>--parallel-compendium=<replaceable>FILE</replaceable></option></term>
0262 <listitem>
0263 <para>The path to source language compendium, in parallel translation mode.</para>
0264 </listitem>
0265 </varlistentry>
0266 
0267 <varlistentry>
0268 <term><option>-l</option>, <option>--list-transervs</option></term>
0269 <listitem>
0270 <para>Lists known translation services (the keywords which can be the first argument to <command>pomtrans</command>).</para>
0271 </listitem>
0272 </varlistentry>
0273 
0274 <varlistentry>
0275 <term><option>-m</option>, <option>--flag-mtrans</option></term>
0276 <listitem>
0277 <para>Adds the <literal>mtrans</literal> flag to each machine-translated message. This may be useful to positively identify machine-translated messages in the resulting PO file, as otherwise they are simply fuzzy.</para>
0278 </listitem>
0279 </varlistentry>
0280 
0281 <varlistentry>
0282 <term><option>-M <replaceable>MODE</replaceable></option>, <option>--translation-mode=<replaceable>MODE</replaceable></option></term>
0283 <listitem>
0284 <para>Translation services need as input the mode in which to operate, usually the source and target language at minimum. By default the translation mode is constructed based on source and target languages, but this is sometimes not precise enough. This option can be used to issue a custom mode string for the chosen translation service, overriding the default construction. The format of the mode string is translation service dependent, check documentation of respective translation services for details.</para>
0285 </listitem>
0286 </varlistentry>
0287 
0288 <varlistentry>
0289 <term><option>-n</option>, <option>--no-fuzzy-flag</option></term>
0290 <listitem>
0291 <para>By default machine-translated messages are made fuzzy, which is prevented by this option. It goes without saying that this is dangerous at current state of the art in machine translation, and should be used only in very specific scenarios (e.g. high quality machine translation between two dialects of the same language).</para>
0292 </listitem>
0293 </varlistentry>
0294 
0295 <varlistentry>
0296 <term><option>-p <replaceable>SEARCH</replaceable>:<replaceable>REPLACE</replaceable></option>, <option>--parallel-catalogs=<replaceable>SEARCH</replaceable>:<replaceable>REPLACE</replaceable></option></term>
0297 <listitem>
0298 <para>The string to search for in paths of target language PO files, and the string to replace them with to construct paths of source language PO files, in parallel translation mode.</para>
0299 </listitem>
0300 </varlistentry>
0301 
0302 <varlistentry>
0303 <term><option>-s <replaceable>LANG</replaceable></option>, <option>--source-lang=<replaceable>LANG</replaceable></option></term>
0304 <listitem>
0305 <para>The source language code, i.e. the language which is being translated from.</para>
0306 </listitem>
0307 </varlistentry>
0308 
0309 <varlistentry>
0310 <term><option>-t <replaceable>LANG</replaceable></option>, <option>--target-lang=<replaceable>LANG</replaceable></option></term>
0311 <listitem>
0312 <para>The target language code, i.e. the language which is being translated into.</para>
0313 </listitem>
0314 </varlistentry>
0315 
0316 <varlistentry>
0317 <term><option>-T <replaceable>PATH</replaceable></option>, <option>--transerv-bin=<replaceable>PATH</replaceable></option></term>
0318 <listitem>
0319 <para>If the selected translation service is (or can be) a program on the local computer, this option can be used to specify the path to its executable file, if it is not in the <envar>PATH</envar>.</para>
0320 </listitem>
0321 </varlistentry>
0322 
0323 <varlistentry>
0324 <term><option>-d <replaceable>DIRECTORY</replaceable></option>, <option>--data-directory=<replaceable>DIRECTORY</replaceable></option></term>
0325 <listitem>
0326 <para>If the selected translation service can use a local directory of translation data, this option can be used to specify the path to that directory. It is equivalent to Apertium’s <option>-d</option> parameter.</para>
0327 </listitem>
0328 </varlistentry>
0329 
0330 </variablelist>
0331 </para>
0332 
0333 </sect2>
0334 
0335 <sect2 id="sec-mimtservs">
0336 <title>Supported Machine Translation Services</title>
0337 
0338 <para>Currently supported translation services are as follows (with keyword in parenthesis):
0339 <variablelist>
0340 
0341 <varlistentry>
0342 <term>Apertium (<literal>apertium</literal>)</term>
0343 <listitem>
0344 <para><ulink url="http://www.apertium.org/">Apertium</ulink> is a free machine translation platform, developed by the TRANSDUCENS research group of University of Alicante. There is a basic web service, but the software can be locally installed and that is how <command>pomtrans</command> uses it (some distributions provide packages).</para>
0345 </listitem>
0346 </varlistentry>
0347 
0348 <varlistentry>
0349 <term>Google Translate (<literal>google</literal>)</term>
0350 <listitem>
0351 <para><ulink url="http://translate.google.com/">Google Translate</ulink> is Google's proprietary web machine-translation service. The user must obtain an API key from Google, and set it in <link linkend="sec-cmconfig">Pology configuration</link> under <literal>[pomtrans]/google-api-key</literal>. At the moment, <command>pomtrans</command> makes one query to the service per message, which can take quite some time on long PO files.</para>
0352 </listitem>
0353 </varlistentry>
0354 
0355 </variablelist>
0356 </para>
0357 
0358 </sect2>
0359 
0360 </sect1>
0361 
0362 </chapter>