doc/katepart/regular-expressions.docbook

0001 <appendix id="regular-expressions">
0002 <appendixinfo>
0003 <authorgroup>
0004 <author>&Anders.Lund; &Anders.Lund.mail;</author>
0005 <!-- TRANS:ROLES_OF_TRANSLATORS -->
0006 </authorgroup>
0007 </appendixinfo>
0008
0009 <title>Regular Expressions</title>
0010
0011 <synopsis>This Appendix contains a brief but hopefully sufficient and
0012 covering introduction to the world of <emphasis>regular
0013 expressions</emphasis>. It documents regular expressions in the form
0014 available within &kappname;, which is not compatible with the regular
0015 expressions of perl, nor with those of for example
0016 <command>grep</command>.</synopsis>
0017
0018 <sect1>
0019
0020 <title>Introduction</title>
0021
0022 <para><emphasis>Regular Expressions</emphasis> provides us with a way
0023 to describe some possible contents of a text string in a way
0024 understood by a small piece of software, so that it can investigate if
0025 a text matches, and also in the case of advanced applications with the
0026 means of saving pieces or the matching text.</para>
0027
0028 <para>An example: Say you want to search a text for paragraphs that
0029 starts with either of the names <quote>Henrik</quote> or
0030 <quote>Pernille</quote> followed by some form of the verb
0031 <quote>say</quote>.</para>
0032
0033 <para>With a normal search, you would start out searching for the
0034 first name, <quote>Henrik</quote> maybe followed by <quote>sa</quote>
0035 like this: <userinput>Henrik sa</userinput>, and while looking for
0036 matches, you would have to discard those not being the beginning of a
0037 paragraph, as well as those in which the word starting with the
0038 letters <quote>sa</quote> was not either <quote>says</quote>,
0039 <quote>said</quote> or so.  And then of course repeat all of that with
0040 the next name...</para>
0041
0042 <para>With Regular Expressions, that task could be accomplished with a
0043 single search, and with a larger degree of preciseness.</para>
0044
0045 <para>To achieve this, Regular Expressions defines rules for
0046 expressing in details a generalization of a string to match. Our
0047 example, which we might literally express like this: <quote>A line
0048 starting with either <quote>Henrik</quote> or <quote>Pernille</quote>
0049 (possibly following up to 4 blanks or tab characters) followed by a
0050 whitespace followed by <quote>sa</quote> and then either
0051 <quote>ys</quote> or <quote>id</quote></quote> could be expressed with
0052 the following regular expression:</para> <para><userinput>^[
0053 \t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput></para>
0054
0055 <para>The above example demonstrates all four major concepts of modern
0056 Regular Expressions, namely:</para>
0057
0058 <itemizedlist>
0059 <listitem><para>Patterns</para></listitem>
0060 <listitem><para>Assertions</para></listitem>
0061 <listitem><para>Quantifiers</para></listitem>
0062 <listitem><para>Back references</para></listitem>
0063 </itemizedlist>
0064
0065 <para>The caret (<literal>^</literal>) starting the expression is an
0066 assertion, being true only if the following matching string is at the
0067 start of a line.</para>
0068
0069 <para>The strings <literal>[ \t]</literal> and
0070 <literal>(Henrik|Pernille) sa(ys|id)</literal> are patterns. The first
0071 one is a <emphasis>character class</emphasis> that matches either a
0072 blank or a (horizontal) tab character; the other pattern contains
0073 first a subpattern matching either <literal>Henrik</literal>
0074 <emphasis>or</emphasis> <literal>Pernille</literal>, then a piece
0075 matching the exact string <literal> sa</literal> and finally a
0076 subpattern matching either <literal>ys</literal>
0077 <emphasis>or</emphasis> <literal>id</literal></para>
0078
0079 <para>The string <literal>{0,4}</literal> is a quantifier saying
0080 <quote>anywhere from 0 up to 4 of the previous</quote>.</para>
0081
0082 <para>Because regular expression software supporting the concept of
0083 <emphasis>back references</emphasis> saves the entire matching part of
0084 the string as well as sub-patterns enclosed in parentheses, given some
0085 means of access to those references, we could get our hands on either
0086 the whole match (when searching a text document in an editor with a
0087 regular expression, that is often marked as selected) or either the
0088 name found, or the last part of the verb.</para>
0089
0090 <para>All together, the expression will match where we wanted it to,
0091 and only there.</para>
0092
0093 <para>The following sections will describe in details how to construct
0094 and use patterns, character classes, assertions, quantifiers and
0095 back references, and the final section will give a few useful
0096 examples.</para>
0097
0098 </sect1>
0099
0100 <sect1 id="regex-patterns">
0101
0102 <title>Patterns</title>
0103
0104 <para>Patterns consists of literal strings and character
0105 classes. Patterns may contain sub-patterns, which are patterns enclosed
0106 in parentheses.</para>
0107
0108 <sect2>
0109 <title>Escaping characters</title>
0110
0111 <para>In patterns as well as in character classes, some characters
0112 have a special meaning.  To literally match any of those characters,
0113 they must be marked or <emphasis>escaped</emphasis> to let the regular
0114 expression software know that it should interpret such characters in
0115 their literal meaning.</para>
0116
0117 <para>This is done by prepending the character with a backslash
0118 (<literal>\</literal>).</para>
0119
0120
0121 <para>The regular expression software will silently ignore escaping a
0122 character that does not have any special meaning in the context, so
0123 escaping for example a <quote>j</quote> (<userinput>\j</userinput>) is
0124 safe. If you are in doubt whether a character could have a special
0125 meaning, you can therefore escape it safely.</para>
0126
0127 <para>Escaping of course includes the backslash character itself, to
0128 literally match a such, you would write
0129 <userinput>\\</userinput>.</para>
0130
0131 </sect2>
0132
0133 <sect2>
0134 <title>Character Classes and abbreviations</title>
0135
0136 <para>A <emphasis>character class</emphasis> is an expression that
0137 matches one of a defined set of characters. In Regular Expressions,
0138 character classes are defined by putting the legal characters for the
0139 class in square brackets, <literal>[]</literal>, or by using one of
0140 the abbreviated classes described below.</para>
0141
0142 <para>Simple character classes just contains one or more literal
0143 characters, for example <userinput>[abc]</userinput> (matching either
0144 of the letters <quote>a</quote>, <quote>b</quote> or <quote>c</quote>)
0145 or <userinput>[0123456789]</userinput> (matching any digit).</para>
0146
0147 <para>Because letters and digits have a logical order, you can
0148 abbreviate those by specifying ranges of them:
0149 <userinput>[a-c]</userinput> is equal to <userinput>[abc]</userinput>
0150 and <userinput>[0-9]</userinput> is equal to
0151 <userinput>[0123456789]</userinput>.  Combining these constructs, for
0152 example <userinput>[a-fynot1-38]</userinput> is completely legal (the
0153 last one would match, of course, either of
0154 <quote>a</quote>,<quote>b</quote>,<quote>c</quote>,<quote>d</quote>,
0155 <quote>e</quote>,<quote>f</quote>,<quote>y</quote>,<quote>n</quote>,<quote>o</quote>,<quote>t</quote>,
0156 <quote>1</quote>,<quote>2</quote>,<quote>3</quote> or
0157 <quote>8</quote>).</para>
0158
0159 <para>As capital letters are different characters from their
0160 non-capital equivalents, to create a caseless character class matching
0161 <quote>a</quote> or <quote>b</quote>, in any case, you need to write it
0162 <userinput>[aAbB]</userinput>.</para>
0163
0164 <para>It is of course possible to create a <quote>negative</quote>
0165 class matching as <quote>anything but</quote> To do so put a caret
0166 (<literal>^</literal>) at the beginning of the class: </para>
0167
0168 <para><userinput>[^abc]</userinput> will match any character
0169 <emphasis>but</emphasis> <quote>a</quote>, <quote>b</quote> or
0170 <quote>c</quote>.</para>
0171
0172 <para>In addition to literal characters, some abbreviations are
0173 defined, making life still a bit easier:
0174
0175 <variablelist>
0176
0177 <varlistentry>
0178 <term><userinput>\a</userinput></term>
0179 <listitem><para> This matches the &ASCII; bell character (BEL, 0x07).</para></listitem>
0180 </varlistentry>
0181
0182 <varlistentry>
0183 <term><userinput>\f</userinput></term>
0184 <listitem><para> This matches the &ASCII; form feed character (FF, 0x0C).</para></listitem>
0185 </varlistentry>
0186
0187 <varlistentry>
0188 <term><userinput>\n</userinput></term>
0189 <listitem><para> This matches the &ASCII; line feed character (LF, 0x0A, Unix newline).</para></listitem>
0190 </varlistentry>
0191
0192 <varlistentry>
0193 <term><userinput>\r</userinput></term>
0194 <listitem><para> This matches the &ASCII; carriage return character (CR, 0x0D).</para></listitem>
0195 </varlistentry>
0196
0197 <varlistentry>
0198 <term><userinput>\t</userinput></term>
0199 <listitem><para> This matches the &ASCII; horizontal tab character (HT, 0x09).</para></listitem>
0200 </varlistentry>
0201
0202 <varlistentry>
0203 <term><userinput>\v</userinput></term>
0204 <listitem><para> This matches the &ASCII; vertical tab character (VT, 0x0B).</para></listitem>
0205 </varlistentry>
0206 <varlistentry>
0207 <term><userinput>\xhhhh</userinput></term>
0208
0209 <listitem><para> This matches the Unicode character corresponding to
0210 the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;,
0211 \zero ooo) matches the &ASCII;/Latin-1 character
0212 corresponding to the octal number ooo (between 0 and
0213 0377).</para></listitem>
0214 </varlistentry>
0215
0216 <varlistentry>
0217 <term><userinput>.</userinput> (dot)</term>
0218 <listitem><para> This matches any character (including newline).</para></listitem>
0219 </varlistentry>
0220
0221 <varlistentry>
0222 <term><userinput>\d</userinput></term>
0223 <listitem><para> This matches a digit. Equal to <literal>[0-9]</literal></para></listitem>
0224 </varlistentry>
0225
0226 <varlistentry>
0227 <term><userinput>\D</userinput></term>
0228 <listitem><para> This matches a non-digit. Equal to <literal>[^0-9]</literal> or <literal>[^\d]</literal></para></listitem>
0229 </varlistentry>
0230
0231 <varlistentry>
0232 <term><userinput>\s</userinput></term>
0233 <listitem><para> This matches a whitespace character. Practically equal to <literal>[ \t\n\r]</literal></para></listitem>
0234 </varlistentry>
0235
0236 <varlistentry>
0237 <term><userinput>\S</userinput></term>
0238 <listitem><para> This matches a non-whitespace. Practically equal to <literal>[^ \t\r\n]</literal>, and equal to <literal>[^\s]</literal></para></listitem>
0239 </varlistentry>
0240
0241 <varlistentry>
0242 <term><userinput>\w</userinput></term>
0243 <listitem><para>Matches any <quote>word character</quote> - in this case any letter, digit or underscore.
0244 Equal to <literal>[a-zA-Z0-9_]</literal></para></listitem>
0245 </varlistentry>
0246
0247 <varlistentry>
0248 <term><userinput>\W</userinput></term>
0249 <listitem><para>Matches any non-word character - anything but letters, numbers or underscore.
0250 Equal to <literal>[^a-zA-Z0-9_]</literal> or <literal>[^\w]</literal></para></listitem>
0251 </varlistentry>
0252
0253
0254 </variablelist>
0255
0256 </para>
0257
0258 <para>The <emphasis>POSIX notation of classes</emphasis>,
0259 <userinput>[:&lt;class name&gt;:]</userinput> are also supported.
0260 For example, <userinput>[:digit:]</userinput> is equivalent to <userinput>\d</userinput>,
0261 and <userinput>[:space:]</userinput> to <userinput>\s</userinput>.
0262 See the full list of POSIX character classes
0263 <ulink url="https://www.regular-expressions.info/posixbrackets.html">here</ulink>.</para>
0264
0265 <para>The abbreviated classes can be put inside a custom class, for
0266 example to match a word character, a blank or a dot, you could write
0267 <userinput>[\w \.]</userinput></para>
0268
0269 <sect3>
0270 <title>Characters with special meanings inside character classes</title>
0271
0272 <para>The following characters has a special meaning inside the
0273 <quote>[]</quote> character class construct, and must be escaped to be
0274 literally included in a class:</para>
0275
0276 <variablelist>
0277 <varlistentry>
0278 <term><userinput>]</userinput></term>
0279 <listitem><para>Ends the character class. Must be escaped unless it is the very first character in the
0280 class (may follow an unescaped caret).</para></listitem>
0281 </varlistentry>
0282 <varlistentry>
0283 <term><userinput>^</userinput> (caret)</term>
0284 <listitem><para>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para></listitem>
0285 </varlistentry>
0286 <varlistentry>
0287 <term><userinput>-</userinput> (dash)</term>
0288 <listitem><para>Denotes a logical range. Must always be escaped within a character class.</para></listitem>
0289 </varlistentry>
0290 <varlistentry>
0291 <term><userinput>\</userinput> (backslash)</term>
0292 <listitem><para>The escape character. Must always be escaped.</para></listitem>
0293 </varlistentry>
0294
0295 </variablelist>
0296
0297 </sect3>
0298
0299 </sect2>
0300
0301 <sect2>
0302
0303 <title>Alternatives: matching <quote>one of</quote></title>
0304
0305 <para>If you want to match one of a set of alternative patterns, you
0306 can separate those with <literal>|</literal> (vertical bar character).</para>
0307
0308 <para>For example to find either <quote>John</quote> or <quote>Harry</quote> you would use an expression <userinput>John|Harry</userinput>.</para>
0309
0310 </sect2>
0311
0312 <sect2>
0313
0314 <title>Sub Patterns</title>
0315
0316 <para><emphasis>Sub patterns</emphasis> are patterns enclosed in
0317 parentheses, and they have several uses in the world of regular
0318 expressions.</para>
0319
0320 <sect3>
0321
0322 <title>Specifying alternatives</title>
0323
0324 <para>You may use a sub pattern to group a set of alternatives within
0325 a larger pattern. The alternatives are separated by the character
0326 <quote>|</quote> (vertical bar).</para>
0327
0328 <para>For example to match either of the words <quote>int</quote>,
0329 <quote>float</quote> or <quote>double</quote>, you could use the
0330 pattern <userinput>int|float|double</userinput>. If you only want to
0331 find one if it is followed by some whitespace and then some letters,
0332 put the alternatives inside a subpattern:
0333 <userinput>(int|float|double)\s+\w+</userinput>.</para>
0334
0335 </sect3>
0336
0337 <sect3 id="regex-capturing">
0338
0339 <title>Capturing matching text (back references)</title>
0340
0341 <para>If you want to use a back reference, use a sub pattern <userinput>(PATTERN)</userinput>
0342 to have the desired part of the pattern remembered.
0343 To prevent the sub pattern from being remembered, use a non-capturing group
0344 <userinput>(?:PATTERN)</userinput>.</para>
0345
0346 <para>For example, if you want to find two occurrences of the same
0347 word separated by a comma and possibly some whitespace, you could
0348 write <userinput>(\w+),\s*\1</userinput>. The sub pattern
0349 <literal>\w+</literal> would find a chunk of word characters, and the
0350 entire expression would match if those were followed by a comma, 0 or
0351 more whitespace and then an equal chunk of word characters.  (The
0352 string <literal>\1</literal> references <emphasis>the first sub pattern
0353 enclosed in parentheses</emphasis>.)</para>
0354
0355 <note>
0356 <para>To avoid ambiguities with usage of <userinput>\1</userinput> with some digits behind it (&eg; <userinput>\12</userinput> can be 12th subpattern or just the first subpattern with <userinput>2</userinput>) we use <userinput>\{12}</userinput> as syntax for multi-digit subpatterns.</para>
0357 <para>Examples:</para>
0358 <itemizedlist>
0359 <listitem><para><userinput>\{12}1</userinput> is <quote>use subpattern 12</quote></para></listitem>
0360 <listitem><para><userinput>\123</userinput> is <quote>use capture 1 then 23 as the normal text</quote></para></listitem>
0361 </itemizedlist>
0362
0363 </note>
0364
0365 <!-- <para>See also <link linkend="backreferences">Back references</link>.</para> -->
0366
0367 </sect3>
0368
0369 <sect3 id="lookahead-assertions">
0370 <title>Lookahead Assertions</title>
0371
0372 <para>A lookahead assertion is a sub pattern, starting with either
0373 <literal>?=</literal> or <literal>?!</literal>.</para>
0374
0375 <para>For example to match the literal string <quote>Bill</quote> but
0376 only if not followed by <quote> Gates</quote>, you could use this
0377 expression: <userinput>Bill(?! Gates)</userinput>.  (This would find
0378 <quote>Bill Clinton</quote> as well as <quote>Billy the kid</quote>,
0379 but silently ignore the other matches.)</para>
0380
0381 <para>Sub patterns used for assertions are not captured.</para>
0382
0383 <para>See also <link linkend="assertions">Assertions</link>.</para>
0384
0385 </sect3>
0386
0387 <sect3 id="lookbehind-assertions">
0388 <title>Lookbehind Assertions</title>
0389
0390 <para>A lookbehind assertion is a sub pattern, starting with either
0391 <literal>?&lt;=</literal> or <literal>?&lt;!</literal>.</para>
0392
0393 <para>Lookbehind has the same effect as the lookahead, but works backwards.
0394 For example to match the literal string <quote>fruit</quote> but
0395 only if not preceded by <quote>grape</quote>, you could use this
0396 expression: <userinput>(?&lt;!grape)fruit</userinput>.</para>
0397
0398 <para>Sub patterns used for assertions are not captured.</para>
0399
0400 <para>See also <link linkend="assertions">Assertions</link></para>
0401
0402 </sect3>
0403
0404 </sect2>
0405
0406 <sect2 id="special-characters-in-patterns">
0407 <title>Characters with a special meaning inside patterns</title>
0408
0409 <para>The following characters have meaning inside a pattern, and
0410 must be escaped if you want to literally match them:
0411
0412 <variablelist>
0413
0414 <varlistentry>
0415 <term><userinput>\</userinput> (backslash)</term>
0416 <listitem><para>The escape character.</para></listitem>
0417 </varlistentry>
0418
0419 <varlistentry>
0420 <term><userinput>^</userinput> (caret)</term>
0421 <listitem><para>Asserts the beginning of the string.</para></listitem>
0422 </varlistentry>
0423
0424 <varlistentry>
0425 <term><userinput>$</userinput></term>
0426 <listitem><para>Asserts the end of string.</para></listitem>
0427 </varlistentry>
0428
0429 <varlistentry>
0430 <term><userinput>()</userinput> (left and right parentheses)</term>
0431 <listitem><para>Denotes sub patterns.</para></listitem>
0432 </varlistentry>
0433
0434 <varlistentry>
0435 <term><userinput>{}</userinput> (left and right curly braces)</term>
0436 <listitem><para>Denotes numeric quantifiers.</para></listitem>
0437 </varlistentry>
0438
0439 <varlistentry>
0440 <term><userinput>[]</userinput> (left and right square brackets)</term>
0441 <listitem><para>Denotes character classes.</para></listitem>
0442 </varlistentry>
0443
0444 <varlistentry>
0445 <term><userinput>|</userinput> (vertical bar)</term>
0446 <listitem><para>logical OR. Separates alternatives.</para></listitem>
0447 </varlistentry>
0448
0449 <varlistentry>
0450 <term><userinput>+</userinput> (plus sign)</term>
0451 <listitem><para>Quantifier, 1 or more.</para></listitem>
0452 </varlistentry>
0453
0454 <varlistentry>
0455 <term><userinput>*</userinput> (asterisk)</term>
0456 <listitem><para>Quantifier, 0 or more.</para></listitem>
0457 </varlistentry>
0458
0459 <varlistentry>
0460 <term><userinput>?</userinput> (question mark)</term>
0461 <listitem><para>An optional character. Can be interpreted as a quantifier, 0 or 1.</para></listitem>
0462 </varlistentry>
0463
0464 </variablelist>
0465
0466 </para>
0467
0468 </sect2>
0469
0470 </sect1>
0471
0472 <sect1 id="quantifiers">
0473 <title>Quantifiers</title>
0474
0475 <para><emphasis>Quantifiers</emphasis> allows a regular expression to
0476 match a specified number or range of numbers of either a character,
0477 character class or sub pattern.</para>
0478
0479 <para>Quantifiers are enclosed in curly brackets (<literal>{</literal>
0480 and <literal>}</literal>) and have the general form
0481 <literal>{[minimum-occurrences][,[maximum-occurrences]]}</literal>
0482 </para>
0483
0484 <para>The usage is best explained by example:
0485
0486 <variablelist>
0487
0488 <varlistentry>
0489 <term><userinput>{1}</userinput></term>
0490 <listitem><para>Exactly 1 occurrence</para></listitem>
0491 </varlistentry>
0492
0493 <varlistentry>
0494 <term><userinput>{0,1}</userinput></term>
0495 <listitem><para>Zero or 1 occurrences</para></listitem>
0496 </varlistentry>
0497
0498 <varlistentry>
0499 <term><userinput>{,1}</userinput></term>
0500 <listitem><para>The same, with less work;)</para></listitem>
0501 </varlistentry>
0502
0503 <varlistentry>
0504 <term><userinput>{5,10}</userinput></term>
0505 <listitem><para>At least 5 but maximum 10 occurrences.</para></listitem>
0506 </varlistentry>
0507
0508 <varlistentry>
0509 <term><userinput>{5,}</userinput></term>
0510 <listitem><para>At least 5 occurrences, no maximum.</para></listitem>
0511 </varlistentry>
0512
0513 </variablelist>
0514
0515 </para>
0516
0517 <para>Additionally, there are some abbreviations:
0518
0519 <variablelist>
0520
0521 <varlistentry>
0522 <term><userinput>*</userinput> (asterisk)</term>
0523 <listitem><para>similar to <literal>{0,}</literal>, find any number of occurrences.</para></listitem>
0524 </varlistentry>
0525
0526 <varlistentry>
0527 <term><userinput>+</userinput> (plus sign)</term>
0528 <listitem><para>similar to <literal>{1,}</literal>, at least 1 occurrence.</para></listitem>
0529 </varlistentry>
0530
0531 <varlistentry>
0532 <term><userinput>?</userinput> (question mark)</term>
0533 <listitem><para>similar to <literal>{0,1}</literal>, zero or 1 occurrence.</para></listitem>
0534 </varlistentry>
0535
0536 </variablelist>
0537
0538 </para>
0539
0540 <sect2>
0541
0542 <title>Greed</title>
0543
0544 <para>When using quantifiers with no maximum, regular expressions
0545 defaults to match as much of the searched string as possible, commonly
0546 known as <emphasis>greedy</emphasis> behavior.</para>
0547
0548 <para>Modern regular expression software provides the means of
0549 <quote>turning off greediness</quote>, though in a graphical
0550 environment it is up to the interface to provide you with access to
0551 this feature. For example a search dialog providing a regular
0552 expression search could have a check box labeled <quote>Minimal
0553 matching</quote> as well as it ought to indicate if greediness is the
0554 default behavior.</para>
0555
0556 </sect2>
0557
0558 <sect2>
0559 <title>In context examples</title>
0560
0561 <para>Here are a few examples of using quantifiers:</para>
0562
0563 <variablelist>
0564
0565 <varlistentry>
0566 <term><userinput>^\d{4,5}\s</userinput></term>
0567 <listitem><para>Matches the digits in <quote>1234 go</quote> and <quote>12345 now</quote>, but neither in <quote>567 eleven</quote>
0568 nor in <quote>223459 somewhere</quote>.</para></listitem>
0569 </varlistentry>
0570
0571 <varlistentry>
0572 <term><userinput>\s+</userinput></term>
0573 <listitem><para>Matches one or more whitespace characters.</para></listitem>
0574 </varlistentry>
0575
0576 <varlistentry>
0577 <term><userinput>(bla){1,}</userinput></term>
0578 <listitem><para>Matches all of <quote>blablabla</quote> and the <quote>bla</quote> in <quote>blackbird</quote> or <quote>tabla</quote>.</para></listitem>
0579 </varlistentry>
0580
0581 <varlistentry>
0582 <term><userinput>/?&gt;</userinput></term>
0583 <listitem><para>Matches <quote>/&gt;</quote> in <quote>&lt;closeditem/&gt;</quote> as well as
0584 <quote>&gt;</quote> in <quote>&lt;openitem&gt;</quote>.</para></listitem>
0585 </varlistentry>
0586
0587 </variablelist>
0588
0589 </sect2>
0590
0591 </sect1>
0592
0593 <sect1 id="assertions">
0594 <title>Assertions</title>
0595
0596 <para><emphasis>Assertions</emphasis> allows a regular expression to
0597 match only under certain controlled conditions.</para>
0598
0599 <para>An assertion does not need a character to match, it rather
0600 investigates the surroundings of a possible match before acknowledging
0601 it. For example the <emphasis>word boundary</emphasis> assertion does
0602 not try to find a non word character opposite a word one at its
0603 position, instead it makes sure that there is not a word
0604 character. This means that the assertion can match where there is no
0605 character, &ie; at the ends of a searched string.</para>
0606
0607 <para>Some assertions actually do have a pattern to match, but the
0608 part of the string matching that will not be a part of the result of
0609 the match of the full expression.</para>
0610
0611 <para>Regular Expressions as documented here supports the following
0612 assertions:
0613
0614 <variablelist>
0615
0616 <varlistentry>
0617 <term><userinput>^</userinput> (caret: beginning of
0618 string)</term>
0619 <listitem><para>Matches the beginning of the searched
0620 string.</para> <para>The expression <userinput>^Peter</userinput> will
0621 match at <quote>Peter</quote> in the string <quote>Peter, hey!</quote>
0622 but not in <quote>Hey, Peter!</quote> </para> </listitem>
0623 </varlistentry>
0624
0625 <varlistentry>
0626 <term><userinput>$</userinput> (end of string)</term>
0627 <listitem><para>Matches the end of the searched string.</para>
0628
0629 <para>The expression <userinput>you\?$</userinput> will match at the
0630 last you in the string <quote>You didn't do that, did you?</quote> but
0631 nowhere in <quote>You didn't do that, right?</quote></para>
0632
0633 </listitem>
0634 </varlistentry>
0635
0636 <varlistentry>
0637 <term><userinput>\b</userinput> (word boundary)</term>
0638 <listitem><para>Matches if there is a word character at one side and not a word character at the
0639 other.</para>
0640 <para>This is useful to find word ends, for example both ends to find
0641 a whole word. The expression <userinput>\bin\b</userinput> will match
0642 at the separate <quote>in</quote> in the string <quote>He came in
0643 through the window</quote>, but not at the <quote>in</quote> in
0644 <quote>window</quote>.</para></listitem>
0645
0646 </varlistentry>
0647
0648 <varlistentry>
0649 <term><userinput>\B</userinput> (non word boundary)</term>
0650 <listitem><para>Matches wherever <quote>\b</quote> does not.</para>
0651 <para>That means that it will match for example within words: The expression
0652 <userinput>\Bin\B</userinput> will match at in <quote>window</quote> but not in <quote>integer</quote> or <quote>I'm in love</quote>.</para>
0653 </listitem>
0654 </varlistentry>
0655
0656 <varlistentry>
0657 <term><userinput>(?=PATTERN)</userinput> (Positive lookahead)</term>
0658 <listitem><para>A lookahead assertion looks at the part of the string following a possible match.
0659 The positive lookahead will prevent the string from matching if the text following the possible match
0660 does not match the <emphasis>PATTERN</emphasis> of the assertion, but the text matched by that will
0661 not be included in the result.</para>
0662 <para>The expression <userinput>handy(?=\w)</userinput> will match at <quote>handy</quote> in
0663 <quote>handyman</quote> but not in <quote>That came in handy!</quote></para>
0664 </listitem>
0665 </varlistentry>
0666
0667 <varlistentry>
0668 <term><userinput>(?!PATTERN)</userinput> (Negative lookahead)</term>
0669
0670 <listitem><para>The negative lookahead prevents a possible match to be
0671 acknowledged if the following part of the searched string does match
0672 its <emphasis>PATTERN</emphasis>.</para>
0673 <para>The expression <userinput>const \w+\b(?!\s*&amp;)</userinput>
0674 will match at <quote>const char</quote> in the string <quote>const
0675 char* foo</quote> while it can not match <quote>const QString</quote>
0676 in <quote>const QString&amp; bar</quote> because the
0677 <quote>&amp;</quote> matches the negative lookahead assertion
0678 pattern.</para>
0679 </listitem>
0680 </varlistentry>
0681
0682 <varlistentry>
0683 <term><userinput>(?&lt;=PATTERN)</userinput> (Positive lookbehind)</term>
0684 <listitem><para>Lookbehind has the same effect as the lookahead, but works backwards.
0685 A lookbehind looks at the part of the string previous a possible match. The positive
0686 lookbehind will match a string only if it is preceded by the <emphasis>PATTERN</emphasis>
0687 of the assertion, but the text matched by that will not be included in the result.</para>
0688 <para>The expression <userinput>(?&lt;=cup)cake</userinput> will match at <quote>cake</quote>
0689 if it is succeeded by <quote>cup</quote> (in <quote>cupcake</quote> but not in
0690 <quote>cheesecake</quote> or in <quote>cake</quote> alone).</para>
0691 </listitem>
0692 </varlistentry>
0693
0694 <varlistentry>
0695 <term><userinput>(?&lt;!PATTERN)</userinput> (Negative lookbehind)</term>
0696 <listitem><para>The negative lookbehind prevents a possible match to be acknowledged if
0697 the previous part of the searched string does match its <emphasis>PATTERN</emphasis>.</para>
0698 <para>The expression <userinput>(?&lt;![\w\.])[0-9]+</userinput> will match at <quote>123</quote>
0699 in the strings <quote>=123</quote> and <quote>-123</quote> while it can not match <quote>123</quote>
0700 in <quote>.123</quote> or <quote>word123</quote>.</para>
0701 </listitem>
0702 </varlistentry>
0703
0704 <varlistentry>
0705 <term><userinput>(PATTERN)</userinput> (Capturing group)</term>
0706
0707 <listitem><para>The sub pattern within the parentheses is captured and remembered,
0708 so that it can be used in back references. For example, the expression
0709 <userinput>(&amp;quot;+)[^&amp;quot;]*\1</userinput> matches
0710 <userinput>&quot;&quot;&quot;&quot;text&quot;&quot;&quot;&quot;</userinput> and
0711 <userinput>&quot;text&quot;</userinput>.</para>
0712 <para>See the section <link linkend="regex-capturing">Capturing matching text (back references)</link>
0713 for more information.</para>
0714 </listitem>
0715 </varlistentry>
0716
0717 <varlistentry>
0718 <term><userinput>(?:PATTERN)</userinput> (Non-capturing group)</term>
0719
0720 <listitem><para>The sub pattern within the parentheses is not captured and
0721 is not remembered. It is preferable to always use non-capturing groups if
0722 the captures will not be used.</para>
0723 </listitem>
0724 </varlistentry>
0725
0726 </variablelist>
0727
0728 </para>
0729
0730 </sect1>
0731
0732 <!-- TODO sect1 id="backreferences">
0733
0734 <title>Back References</title>
0735
0736 <para></para>
0737
0738 </sect1 -->
0739
0740 </appendix>