Warning, /utilities/kate/doc/katepart/regular-expressions.docbook is written in an unsupported language. File is not indexed.
0001 <appendix id="regular-expressions"> 0002 <appendixinfo> 0003 <authorgroup> 0004 <author>&Anders.Lund; &Anders.Lund.mail;</author> 0005 <!-- TRANS:ROLES_OF_TRANSLATORS --> 0006 </authorgroup> 0007 </appendixinfo> 0008 0009 <title>Regular Expressions</title> 0010 0011 <synopsis>This Appendix contains a brief but hopefully sufficient and 0012 covering introduction to the world of <emphasis>regular 0013 expressions</emphasis>. It documents regular expressions in the form 0014 available within &kappname;, which is not compatible with the regular 0015 expressions of perl, nor with those of for example 0016 <command>grep</command>.</synopsis> 0017 0018 <sect1> 0019 0020 <title>Introduction</title> 0021 0022 <para><emphasis>Regular Expressions</emphasis> provides us with a way 0023 to describe some possible contents of a text string in a way 0024 understood by a small piece of software, so that it can investigate if 0025 a text matches, and also in the case of advanced applications with the 0026 means of saving pieces or the matching text.</para> 0027 0028 <para>An example: Say you want to search a text for paragraphs that 0029 starts with either of the names <quote>Henrik</quote> or 0030 <quote>Pernille</quote> followed by some form of the verb 0031 <quote>say</quote>.</para> 0032 0033 <para>With a normal search, you would start out searching for the 0034 first name, <quote>Henrik</quote> maybe followed by <quote>sa</quote> 0035 like this: <userinput>Henrik sa</userinput>, and while looking for 0036 matches, you would have to discard those not being the beginning of a 0037 paragraph, as well as those in which the word starting with the 0038 letters <quote>sa</quote> was not either <quote>says</quote>, 0039 <quote>said</quote> or so. And then of course repeat all of that with 0040 the next name...</para> 0041 0042 <para>With Regular Expressions, that task could be accomplished with a 0043 single search, and with a larger degree of preciseness.</para> 0044 0045 <para>To achieve this, Regular Expressions defines rules for 0046 expressing in details a generalization of a string to match. Our 0047 example, which we might literally express like this: <quote>A line 0048 starting with either <quote>Henrik</quote> or <quote>Pernille</quote> 0049 (possibly following up to 4 blanks or tab characters) followed by a 0050 whitespace followed by <quote>sa</quote> and then either 0051 <quote>ys</quote> or <quote>id</quote></quote> could be expressed with 0052 the following regular expression:</para> <para><userinput>^[ 0053 \t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput></para> 0054 0055 <para>The above example demonstrates all four major concepts of modern 0056 Regular Expressions, namely:</para> 0057 0058 <itemizedlist> 0059 <listitem><para>Patterns</para></listitem> 0060 <listitem><para>Assertions</para></listitem> 0061 <listitem><para>Quantifiers</para></listitem> 0062 <listitem><para>Back references</para></listitem> 0063 </itemizedlist> 0064 0065 <para>The caret (<literal>^</literal>) starting the expression is an 0066 assertion, being true only if the following matching string is at the 0067 start of a line.</para> 0068 0069 <para>The strings <literal>[ \t]</literal> and 0070 <literal>(Henrik|Pernille) sa(ys|id)</literal> are patterns. The first 0071 one is a <emphasis>character class</emphasis> that matches either a 0072 blank or a (horizontal) tab character; the other pattern contains 0073 first a subpattern matching either <literal>Henrik</literal> 0074 <emphasis>or</emphasis> <literal>Pernille</literal>, then a piece 0075 matching the exact string <literal> sa</literal> and finally a 0076 subpattern matching either <literal>ys</literal> 0077 <emphasis>or</emphasis> <literal>id</literal></para> 0078 0079 <para>The string <literal>{0,4}</literal> is a quantifier saying 0080 <quote>anywhere from 0 up to 4 of the previous</quote>.</para> 0081 0082 <para>Because regular expression software supporting the concept of 0083 <emphasis>back references</emphasis> saves the entire matching part of 0084 the string as well as sub-patterns enclosed in parentheses, given some 0085 means of access to those references, we could get our hands on either 0086 the whole match (when searching a text document in an editor with a 0087 regular expression, that is often marked as selected) or either the 0088 name found, or the last part of the verb.</para> 0089 0090 <para>All together, the expression will match where we wanted it to, 0091 and only there.</para> 0092 0093 <para>The following sections will describe in details how to construct 0094 and use patterns, character classes, assertions, quantifiers and 0095 back references, and the final section will give a few useful 0096 examples.</para> 0097 0098 </sect1> 0099 0100 <sect1 id="regex-patterns"> 0101 0102 <title>Patterns</title> 0103 0104 <para>Patterns consists of literal strings and character 0105 classes. Patterns may contain sub-patterns, which are patterns enclosed 0106 in parentheses.</para> 0107 0108 <sect2> 0109 <title>Escaping characters</title> 0110 0111 <para>In patterns as well as in character classes, some characters 0112 have a special meaning. To literally match any of those characters, 0113 they must be marked or <emphasis>escaped</emphasis> to let the regular 0114 expression software know that it should interpret such characters in 0115 their literal meaning.</para> 0116 0117 <para>This is done by prepending the character with a backslash 0118 (<literal>\</literal>).</para> 0119 0120 0121 <para>The regular expression software will silently ignore escaping a 0122 character that does not have any special meaning in the context, so 0123 escaping for example a <quote>j</quote> (<userinput>\j</userinput>) is 0124 safe. If you are in doubt whether a character could have a special 0125 meaning, you can therefore escape it safely.</para> 0126 0127 <para>Escaping of course includes the backslash character itself, to 0128 literally match a such, you would write 0129 <userinput>\\</userinput>.</para> 0130 0131 </sect2> 0132 0133 <sect2> 0134 <title>Character Classes and abbreviations</title> 0135 0136 <para>A <emphasis>character class</emphasis> is an expression that 0137 matches one of a defined set of characters. In Regular Expressions, 0138 character classes are defined by putting the legal characters for the 0139 class in square brackets, <literal>[]</literal>, or by using one of 0140 the abbreviated classes described below.</para> 0141 0142 <para>Simple character classes just contains one or more literal 0143 characters, for example <userinput>[abc]</userinput> (matching either 0144 of the letters <quote>a</quote>, <quote>b</quote> or <quote>c</quote>) 0145 or <userinput>[0123456789]</userinput> (matching any digit).</para> 0146 0147 <para>Because letters and digits have a logical order, you can 0148 abbreviate those by specifying ranges of them: 0149 <userinput>[a-c]</userinput> is equal to <userinput>[abc]</userinput> 0150 and <userinput>[0-9]</userinput> is equal to 0151 <userinput>[0123456789]</userinput>. Combining these constructs, for 0152 example <userinput>[a-fynot1-38]</userinput> is completely legal (the 0153 last one would match, of course, either of 0154 <quote>a</quote>,<quote>b</quote>,<quote>c</quote>,<quote>d</quote>, 0155 <quote>e</quote>,<quote>f</quote>,<quote>y</quote>,<quote>n</quote>,<quote>o</quote>,<quote>t</quote>, 0156 <quote>1</quote>,<quote>2</quote>,<quote>3</quote> or 0157 <quote>8</quote>).</para> 0158 0159 <para>As capital letters are different characters from their 0160 non-capital equivalents, to create a caseless character class matching 0161 <quote>a</quote> or <quote>b</quote>, in any case, you need to write it 0162 <userinput>[aAbB]</userinput>.</para> 0163 0164 <para>It is of course possible to create a <quote>negative</quote> 0165 class matching as <quote>anything but</quote> To do so put a caret 0166 (<literal>^</literal>) at the beginning of the class: </para> 0167 0168 <para><userinput>[^abc]</userinput> will match any character 0169 <emphasis>but</emphasis> <quote>a</quote>, <quote>b</quote> or 0170 <quote>c</quote>.</para> 0171 0172 <para>In addition to literal characters, some abbreviations are 0173 defined, making life still a bit easier: 0174 0175 <variablelist> 0176 0177 <varlistentry> 0178 <term><userinput>\a</userinput></term> 0179 <listitem><para> This matches the &ASCII; bell character (BEL, 0x07).</para></listitem> 0180 </varlistentry> 0181 0182 <varlistentry> 0183 <term><userinput>\f</userinput></term> 0184 <listitem><para> This matches the &ASCII; form feed character (FF, 0x0C).</para></listitem> 0185 </varlistentry> 0186 0187 <varlistentry> 0188 <term><userinput>\n</userinput></term> 0189 <listitem><para> This matches the &ASCII; line feed character (LF, 0x0A, Unix newline).</para></listitem> 0190 </varlistentry> 0191 0192 <varlistentry> 0193 <term><userinput>\r</userinput></term> 0194 <listitem><para> This matches the &ASCII; carriage return character (CR, 0x0D).</para></listitem> 0195 </varlistentry> 0196 0197 <varlistentry> 0198 <term><userinput>\t</userinput></term> 0199 <listitem><para> This matches the &ASCII; horizontal tab character (HT, 0x09).</para></listitem> 0200 </varlistentry> 0201 0202 <varlistentry> 0203 <term><userinput>\v</userinput></term> 0204 <listitem><para> This matches the &ASCII; vertical tab character (VT, 0x0B).</para></listitem> 0205 </varlistentry> 0206 <varlistentry> 0207 <term><userinput>\xhhhh</userinput></term> 0208 0209 <listitem><para> This matches the Unicode character corresponding to 0210 the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;, 0211 \zero ooo) matches the &ASCII;/Latin-1 character 0212 corresponding to the octal number ooo (between 0 and 0213 0377).</para></listitem> 0214 </varlistentry> 0215 0216 <varlistentry> 0217 <term><userinput>.</userinput> (dot)</term> 0218 <listitem><para> This matches any character (including newline).</para></listitem> 0219 </varlistentry> 0220 0221 <varlistentry> 0222 <term><userinput>\d</userinput></term> 0223 <listitem><para> This matches a digit. Equal to <literal>[0-9]</literal></para></listitem> 0224 </varlistentry> 0225 0226 <varlistentry> 0227 <term><userinput>\D</userinput></term> 0228 <listitem><para> This matches a non-digit. Equal to <literal>[^0-9]</literal> or <literal>[^\d]</literal></para></listitem> 0229 </varlistentry> 0230 0231 <varlistentry> 0232 <term><userinput>\s</userinput></term> 0233 <listitem><para> This matches a whitespace character. Practically equal to <literal>[ \t\n\r]</literal></para></listitem> 0234 </varlistentry> 0235 0236 <varlistentry> 0237 <term><userinput>\S</userinput></term> 0238 <listitem><para> This matches a non-whitespace. Practically equal to <literal>[^ \t\r\n]</literal>, and equal to <literal>[^\s]</literal></para></listitem> 0239 </varlistentry> 0240 0241 <varlistentry> 0242 <term><userinput>\w</userinput></term> 0243 <listitem><para>Matches any <quote>word character</quote> - in this case any letter, digit or underscore. 0244 Equal to <literal>[a-zA-Z0-9_]</literal></para></listitem> 0245 </varlistentry> 0246 0247 <varlistentry> 0248 <term><userinput>\W</userinput></term> 0249 <listitem><para>Matches any non-word character - anything but letters, numbers or underscore. 0250 Equal to <literal>[^a-zA-Z0-9_]</literal> or <literal>[^\w]</literal></para></listitem> 0251 </varlistentry> 0252 0253 0254 </variablelist> 0255 0256 </para> 0257 0258 <para>The <emphasis>POSIX notation of classes</emphasis>, 0259 <userinput>[:<class name>:]</userinput> are also supported. 0260 For example, <userinput>[:digit:]</userinput> is equivalent to <userinput>\d</userinput>, 0261 and <userinput>[:space:]</userinput> to <userinput>\s</userinput>. 0262 See the full list of POSIX character classes 0263 <ulink url="https://www.regular-expressions.info/posixbrackets.html">here</ulink>.</para> 0264 0265 <para>The abbreviated classes can be put inside a custom class, for 0266 example to match a word character, a blank or a dot, you could write 0267 <userinput>[\w \.]</userinput></para> 0268 0269 <sect3> 0270 <title>Characters with special meanings inside character classes</title> 0271 0272 <para>The following characters has a special meaning inside the 0273 <quote>[]</quote> character class construct, and must be escaped to be 0274 literally included in a class:</para> 0275 0276 <variablelist> 0277 <varlistentry> 0278 <term><userinput>]</userinput></term> 0279 <listitem><para>Ends the character class. Must be escaped unless it is the very first character in the 0280 class (may follow an unescaped caret).</para></listitem> 0281 </varlistentry> 0282 <varlistentry> 0283 <term><userinput>^</userinput> (caret)</term> 0284 <listitem><para>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para></listitem> 0285 </varlistentry> 0286 <varlistentry> 0287 <term><userinput>-</userinput> (dash)</term> 0288 <listitem><para>Denotes a logical range. Must always be escaped within a character class.</para></listitem> 0289 </varlistentry> 0290 <varlistentry> 0291 <term><userinput>\</userinput> (backslash)</term> 0292 <listitem><para>The escape character. Must always be escaped.</para></listitem> 0293 </varlistentry> 0294 0295 </variablelist> 0296 0297 </sect3> 0298 0299 </sect2> 0300 0301 <sect2> 0302 0303 <title>Alternatives: matching <quote>one of</quote></title> 0304 0305 <para>If you want to match one of a set of alternative patterns, you 0306 can separate those with <literal>|</literal> (vertical bar character).</para> 0307 0308 <para>For example to find either <quote>John</quote> or <quote>Harry</quote> you would use an expression <userinput>John|Harry</userinput>.</para> 0309 0310 </sect2> 0311 0312 <sect2> 0313 0314 <title>Sub Patterns</title> 0315 0316 <para><emphasis>Sub patterns</emphasis> are patterns enclosed in 0317 parentheses, and they have several uses in the world of regular 0318 expressions.</para> 0319 0320 <sect3> 0321 0322 <title>Specifying alternatives</title> 0323 0324 <para>You may use a sub pattern to group a set of alternatives within 0325 a larger pattern. The alternatives are separated by the character 0326 <quote>|</quote> (vertical bar).</para> 0327 0328 <para>For example to match either of the words <quote>int</quote>, 0329 <quote>float</quote> or <quote>double</quote>, you could use the 0330 pattern <userinput>int|float|double</userinput>. If you only want to 0331 find one if it is followed by some whitespace and then some letters, 0332 put the alternatives inside a subpattern: 0333 <userinput>(int|float|double)\s+\w+</userinput>.</para> 0334 0335 </sect3> 0336 0337 <sect3 id="regex-capturing"> 0338 0339 <title>Capturing matching text (back references)</title> 0340 0341 <para>If you want to use a back reference, use a sub pattern <userinput>(PATTERN)</userinput> 0342 to have the desired part of the pattern remembered. 0343 To prevent the sub pattern from being remembered, use a non-capturing group 0344 <userinput>(?:PATTERN)</userinput>.</para> 0345 0346 <para>For example, if you want to find two occurrences of the same 0347 word separated by a comma and possibly some whitespace, you could 0348 write <userinput>(\w+),\s*\1</userinput>. The sub pattern 0349 <literal>\w+</literal> would find a chunk of word characters, and the 0350 entire expression would match if those were followed by a comma, 0 or 0351 more whitespace and then an equal chunk of word characters. (The 0352 string <literal>\1</literal> references <emphasis>the first sub pattern 0353 enclosed in parentheses</emphasis>.)</para> 0354 0355 <note> 0356 <para>To avoid ambiguities with usage of <userinput>\1</userinput> with some digits behind it (⪚ <userinput>\12</userinput> can be 12th subpattern or just the first subpattern with <userinput>2</userinput>) we use <userinput>\{12}</userinput> as syntax for multi-digit subpatterns.</para> 0357 <para>Examples:</para> 0358 <itemizedlist> 0359 <listitem><para><userinput>\{12}1</userinput> is <quote>use subpattern 12</quote></para></listitem> 0360 <listitem><para><userinput>\123</userinput> is <quote>use capture 1 then 23 as the normal text</quote></para></listitem> 0361 </itemizedlist> 0362 0363 </note> 0364 0365 <!-- <para>See also <link linkend="backreferences">Back references</link>.</para> --> 0366 0367 </sect3> 0368 0369 <sect3 id="lookahead-assertions"> 0370 <title>Lookahead Assertions</title> 0371 0372 <para>A lookahead assertion is a sub pattern, starting with either 0373 <literal>?=</literal> or <literal>?!</literal>.</para> 0374 0375 <para>For example to match the literal string <quote>Bill</quote> but 0376 only if not followed by <quote> Gates</quote>, you could use this 0377 expression: <userinput>Bill(?! Gates)</userinput>. (This would find 0378 <quote>Bill Clinton</quote> as well as <quote>Billy the kid</quote>, 0379 but silently ignore the other matches.)</para> 0380 0381 <para>Sub patterns used for assertions are not captured.</para> 0382 0383 <para>See also <link linkend="assertions">Assertions</link>.</para> 0384 0385 </sect3> 0386 0387 <sect3 id="lookbehind-assertions"> 0388 <title>Lookbehind Assertions</title> 0389 0390 <para>A lookbehind assertion is a sub pattern, starting with either 0391 <literal>?<=</literal> or <literal>?<!</literal>.</para> 0392 0393 <para>Lookbehind has the same effect as the lookahead, but works backwards. 0394 For example to match the literal string <quote>fruit</quote> but 0395 only if not preceded by <quote>grape</quote>, you could use this 0396 expression: <userinput>(?<!grape)fruit</userinput>.</para> 0397 0398 <para>Sub patterns used for assertions are not captured.</para> 0399 0400 <para>See also <link linkend="assertions">Assertions</link></para> 0401 0402 </sect3> 0403 0404 </sect2> 0405 0406 <sect2 id="special-characters-in-patterns"> 0407 <title>Characters with a special meaning inside patterns</title> 0408 0409 <para>The following characters have meaning inside a pattern, and 0410 must be escaped if you want to literally match them: 0411 0412 <variablelist> 0413 0414 <varlistentry> 0415 <term><userinput>\</userinput> (backslash)</term> 0416 <listitem><para>The escape character.</para></listitem> 0417 </varlistentry> 0418 0419 <varlistentry> 0420 <term><userinput>^</userinput> (caret)</term> 0421 <listitem><para>Asserts the beginning of the string.</para></listitem> 0422 </varlistentry> 0423 0424 <varlistentry> 0425 <term><userinput>$</userinput></term> 0426 <listitem><para>Asserts the end of string.</para></listitem> 0427 </varlistentry> 0428 0429 <varlistentry> 0430 <term><userinput>()</userinput> (left and right parentheses)</term> 0431 <listitem><para>Denotes sub patterns.</para></listitem> 0432 </varlistentry> 0433 0434 <varlistentry> 0435 <term><userinput>{}</userinput> (left and right curly braces)</term> 0436 <listitem><para>Denotes numeric quantifiers.</para></listitem> 0437 </varlistentry> 0438 0439 <varlistentry> 0440 <term><userinput>[]</userinput> (left and right square brackets)</term> 0441 <listitem><para>Denotes character classes.</para></listitem> 0442 </varlistentry> 0443 0444 <varlistentry> 0445 <term><userinput>|</userinput> (vertical bar)</term> 0446 <listitem><para>logical OR. Separates alternatives.</para></listitem> 0447 </varlistentry> 0448 0449 <varlistentry> 0450 <term><userinput>+</userinput> (plus sign)</term> 0451 <listitem><para>Quantifier, 1 or more.</para></listitem> 0452 </varlistentry> 0453 0454 <varlistentry> 0455 <term><userinput>*</userinput> (asterisk)</term> 0456 <listitem><para>Quantifier, 0 or more.</para></listitem> 0457 </varlistentry> 0458 0459 <varlistentry> 0460 <term><userinput>?</userinput> (question mark)</term> 0461 <listitem><para>An optional character. Can be interpreted as a quantifier, 0 or 1.</para></listitem> 0462 </varlistentry> 0463 0464 </variablelist> 0465 0466 </para> 0467 0468 </sect2> 0469 0470 </sect1> 0471 0472 <sect1 id="quantifiers"> 0473 <title>Quantifiers</title> 0474 0475 <para><emphasis>Quantifiers</emphasis> allows a regular expression to 0476 match a specified number or range of numbers of either a character, 0477 character class or sub pattern.</para> 0478 0479 <para>Quantifiers are enclosed in curly brackets (<literal>{</literal> 0480 and <literal>}</literal>) and have the general form 0481 <literal>{[minimum-occurrences][,[maximum-occurrences]]}</literal> 0482 </para> 0483 0484 <para>The usage is best explained by example: 0485 0486 <variablelist> 0487 0488 <varlistentry> 0489 <term><userinput>{1}</userinput></term> 0490 <listitem><para>Exactly 1 occurrence</para></listitem> 0491 </varlistentry> 0492 0493 <varlistentry> 0494 <term><userinput>{0,1}</userinput></term> 0495 <listitem><para>Zero or 1 occurrences</para></listitem> 0496 </varlistentry> 0497 0498 <varlistentry> 0499 <term><userinput>{,1}</userinput></term> 0500 <listitem><para>The same, with less work;)</para></listitem> 0501 </varlistentry> 0502 0503 <varlistentry> 0504 <term><userinput>{5,10}</userinput></term> 0505 <listitem><para>At least 5 but maximum 10 occurrences.</para></listitem> 0506 </varlistentry> 0507 0508 <varlistentry> 0509 <term><userinput>{5,}</userinput></term> 0510 <listitem><para>At least 5 occurrences, no maximum.</para></listitem> 0511 </varlistentry> 0512 0513 </variablelist> 0514 0515 </para> 0516 0517 <para>Additionally, there are some abbreviations: 0518 0519 <variablelist> 0520 0521 <varlistentry> 0522 <term><userinput>*</userinput> (asterisk)</term> 0523 <listitem><para>similar to <literal>{0,}</literal>, find any number of occurrences.</para></listitem> 0524 </varlistentry> 0525 0526 <varlistentry> 0527 <term><userinput>+</userinput> (plus sign)</term> 0528 <listitem><para>similar to <literal>{1,}</literal>, at least 1 occurrence.</para></listitem> 0529 </varlistentry> 0530 0531 <varlistentry> 0532 <term><userinput>?</userinput> (question mark)</term> 0533 <listitem><para>similar to <literal>{0,1}</literal>, zero or 1 occurrence.</para></listitem> 0534 </varlistentry> 0535 0536 </variablelist> 0537 0538 </para> 0539 0540 <sect2> 0541 0542 <title>Greed</title> 0543 0544 <para>When using quantifiers with no maximum, regular expressions 0545 defaults to match as much of the searched string as possible, commonly 0546 known as <emphasis>greedy</emphasis> behavior.</para> 0547 0548 <para>Modern regular expression software provides the means of 0549 <quote>turning off greediness</quote>, though in a graphical 0550 environment it is up to the interface to provide you with access to 0551 this feature. For example a search dialog providing a regular 0552 expression search could have a check box labeled <quote>Minimal 0553 matching</quote> as well as it ought to indicate if greediness is the 0554 default behavior.</para> 0555 0556 </sect2> 0557 0558 <sect2> 0559 <title>In context examples</title> 0560 0561 <para>Here are a few examples of using quantifiers:</para> 0562 0563 <variablelist> 0564 0565 <varlistentry> 0566 <term><userinput>^\d{4,5}\s</userinput></term> 0567 <listitem><para>Matches the digits in <quote>1234 go</quote> and <quote>12345 now</quote>, but neither in <quote>567 eleven</quote> 0568 nor in <quote>223459 somewhere</quote>.</para></listitem> 0569 </varlistentry> 0570 0571 <varlistentry> 0572 <term><userinput>\s+</userinput></term> 0573 <listitem><para>Matches one or more whitespace characters.</para></listitem> 0574 </varlistentry> 0575 0576 <varlistentry> 0577 <term><userinput>(bla){1,}</userinput></term> 0578 <listitem><para>Matches all of <quote>blablabla</quote> and the <quote>bla</quote> in <quote>blackbird</quote> or <quote>tabla</quote>.</para></listitem> 0579 </varlistentry> 0580 0581 <varlistentry> 0582 <term><userinput>/?></userinput></term> 0583 <listitem><para>Matches <quote>/></quote> in <quote><closeditem/></quote> as well as 0584 <quote>></quote> in <quote><openitem></quote>.</para></listitem> 0585 </varlistentry> 0586 0587 </variablelist> 0588 0589 </sect2> 0590 0591 </sect1> 0592 0593 <sect1 id="assertions"> 0594 <title>Assertions</title> 0595 0596 <para><emphasis>Assertions</emphasis> allows a regular expression to 0597 match only under certain controlled conditions.</para> 0598 0599 <para>An assertion does not need a character to match, it rather 0600 investigates the surroundings of a possible match before acknowledging 0601 it. For example the <emphasis>word boundary</emphasis> assertion does 0602 not try to find a non word character opposite a word one at its 0603 position, instead it makes sure that there is not a word 0604 character. This means that the assertion can match where there is no 0605 character, &ie; at the ends of a searched string.</para> 0606 0607 <para>Some assertions actually do have a pattern to match, but the 0608 part of the string matching that will not be a part of the result of 0609 the match of the full expression.</para> 0610 0611 <para>Regular Expressions as documented here supports the following 0612 assertions: 0613 0614 <variablelist> 0615 0616 <varlistentry> 0617 <term><userinput>^</userinput> (caret: beginning of 0618 string)</term> 0619 <listitem><para>Matches the beginning of the searched 0620 string.</para> <para>The expression <userinput>^Peter</userinput> will 0621 match at <quote>Peter</quote> in the string <quote>Peter, hey!</quote> 0622 but not in <quote>Hey, Peter!</quote> </para> </listitem> 0623 </varlistentry> 0624 0625 <varlistentry> 0626 <term><userinput>$</userinput> (end of string)</term> 0627 <listitem><para>Matches the end of the searched string.</para> 0628 0629 <para>The expression <userinput>you\?$</userinput> will match at the 0630 last you in the string <quote>You didn't do that, did you?</quote> but 0631 nowhere in <quote>You didn't do that, right?</quote></para> 0632 0633 </listitem> 0634 </varlistentry> 0635 0636 <varlistentry> 0637 <term><userinput>\b</userinput> (word boundary)</term> 0638 <listitem><para>Matches if there is a word character at one side and not a word character at the 0639 other.</para> 0640 <para>This is useful to find word ends, for example both ends to find 0641 a whole word. The expression <userinput>\bin\b</userinput> will match 0642 at the separate <quote>in</quote> in the string <quote>He came in 0643 through the window</quote>, but not at the <quote>in</quote> in 0644 <quote>window</quote>.</para></listitem> 0645 0646 </varlistentry> 0647 0648 <varlistentry> 0649 <term><userinput>\B</userinput> (non word boundary)</term> 0650 <listitem><para>Matches wherever <quote>\b</quote> does not.</para> 0651 <para>That means that it will match for example within words: The expression 0652 <userinput>\Bin\B</userinput> will match at in <quote>window</quote> but not in <quote>integer</quote> or <quote>I'm in love</quote>.</para> 0653 </listitem> 0654 </varlistentry> 0655 0656 <varlistentry> 0657 <term><userinput>(?=PATTERN)</userinput> (Positive lookahead)</term> 0658 <listitem><para>A lookahead assertion looks at the part of the string following a possible match. 0659 The positive lookahead will prevent the string from matching if the text following the possible match 0660 does not match the <emphasis>PATTERN</emphasis> of the assertion, but the text matched by that will 0661 not be included in the result.</para> 0662 <para>The expression <userinput>handy(?=\w)</userinput> will match at <quote>handy</quote> in 0663 <quote>handyman</quote> but not in <quote>That came in handy!</quote></para> 0664 </listitem> 0665 </varlistentry> 0666 0667 <varlistentry> 0668 <term><userinput>(?!PATTERN)</userinput> (Negative lookahead)</term> 0669 0670 <listitem><para>The negative lookahead prevents a possible match to be 0671 acknowledged if the following part of the searched string does match 0672 its <emphasis>PATTERN</emphasis>.</para> 0673 <para>The expression <userinput>const \w+\b(?!\s*&)</userinput> 0674 will match at <quote>const char</quote> in the string <quote>const 0675 char* foo</quote> while it can not match <quote>const QString</quote> 0676 in <quote>const QString& bar</quote> because the 0677 <quote>&</quote> matches the negative lookahead assertion 0678 pattern.</para> 0679 </listitem> 0680 </varlistentry> 0681 0682 <varlistentry> 0683 <term><userinput>(?<=PATTERN)</userinput> (Positive lookbehind)</term> 0684 <listitem><para>Lookbehind has the same effect as the lookahead, but works backwards. 0685 A lookbehind looks at the part of the string previous a possible match. The positive 0686 lookbehind will match a string only if it is preceded by the <emphasis>PATTERN</emphasis> 0687 of the assertion, but the text matched by that will not be included in the result.</para> 0688 <para>The expression <userinput>(?<=cup)cake</userinput> will match at <quote>cake</quote> 0689 if it is succeeded by <quote>cup</quote> (in <quote>cupcake</quote> but not in 0690 <quote>cheesecake</quote> or in <quote>cake</quote> alone).</para> 0691 </listitem> 0692 </varlistentry> 0693 0694 <varlistentry> 0695 <term><userinput>(?<!PATTERN)</userinput> (Negative lookbehind)</term> 0696 <listitem><para>The negative lookbehind prevents a possible match to be acknowledged if 0697 the previous part of the searched string does match its <emphasis>PATTERN</emphasis>.</para> 0698 <para>The expression <userinput>(?<![\w\.])[0-9]+</userinput> will match at <quote>123</quote> 0699 in the strings <quote>=123</quote> and <quote>-123</quote> while it can not match <quote>123</quote> 0700 in <quote>.123</quote> or <quote>word123</quote>.</para> 0701 </listitem> 0702 </varlistentry> 0703 0704 <varlistentry> 0705 <term><userinput>(PATTERN)</userinput> (Capturing group)</term> 0706 0707 <listitem><para>The sub pattern within the parentheses is captured and remembered, 0708 so that it can be used in back references. For example, the expression 0709 <userinput>(&quot;+)[^&quot;]*\1</userinput> matches 0710 <userinput>""""text""""</userinput> and 0711 <userinput>"text"</userinput>.</para> 0712 <para>See the section <link linkend="regex-capturing">Capturing matching text (back references)</link> 0713 for more information.</para> 0714 </listitem> 0715 </varlistentry> 0716 0717 <varlistentry> 0718 <term><userinput>(?:PATTERN)</userinput> (Non-capturing group)</term> 0719 0720 <listitem><para>The sub pattern within the parentheses is not captured and 0721 is not remembered. It is preferable to always use non-capturing groups if 0722 the captures will not be used.</para> 0723 </listitem> 0724 </varlistentry> 0725 0726 </variablelist> 0727 0728 </para> 0729 0730 </sect1> 0731 0732 <!-- TODO sect1 id="backreferences"> 0733 0734 <title>Back References</title> 0735 0736 <para></para> 0737 0738 </sect1 --> 0739 0740 </appendix>