summaryrefslogtreecommitdiff
path: root/docs/reference/glib/tmpl/gregex.sgml
blob: 718631e8511a516ab8cd766b086949e6d79915c2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
<!-- ##### SECTION Title ##### -->
Perl-compatible regular expressions

<!-- ##### SECTION Short_Description ##### -->
matches strings against regular expressions.

<!-- ##### SECTION Long_Description ##### -->
<para>
The <function>g_regex_*()</function> functions implement regular
expression pattern matching using syntax and semantics similar to
Perl regular expression.
</para>
<para>
Some functions accept a <parameter>start_position</parameter> argument,
setting it differs from just passing over a shortened string and setting
#G_REGEX_MATCH_NOTBOL in the case of a pattern that begins with any kind
of lookbehind assertion.
For example, consider the pattern "\Biss\B" which finds occurrences of "iss"
in the middle of words. ("\B" matches only if the current position in the
subject is not a word boundary.) When applied to the string "Mississipi"
from the fourth byte, namely "issipi", it does not match, because "\B" is
always false at the start of the subject, which is deemed to be a word
boundary. However, if the entire string is passed , but with
<parameter>start_position</parameter> set to 4, it finds the second
occurrence of "iss" because it is able to look behind the starting point
to discover that it is preceded by a letter.
</para>
<para>
Note that, unless you set the #G_REGEX_RAW flag, all the strings passed
to these functions must be encoded in UTF-8. The lengths and the positions
inside the strings are in bytes and not in characters, so, for instance,
"\xc3\xa0" (i.e. "&agrave;") is two bytes long but it is treated as a single
character. If you set #G_REGEX_RAW the strings can be non-valid UTF-8
strings and a byte is treated as a character, so "\xc3\xa0" is two bytes
and two characters long.
</para>
<para>
When matching a pattern, "\n" matches only against a "\n" character in the
string, and "\r" matches only a "\r" character. To match any newline sequence
use "\R". This particular group matches either the two-character sequence
CR + LF ("\r\n"), or one of the single characters LF (linefeed, U+000A, "\n"), VT
(vertical tab, U+000B, "\v"), FF (formfeed, U+000C, "\f"), CR (carriage return,
U+000D, "\r"), NEL (next line, U+0085), LS (line separator, U+2028), or PS
(paragraph separator, U+2029).
</para>
<para>
The behaviour of the dot, circumflex, and dollar metacharacters are affected by
newline characters, the default is to recognize any newline character (the same
characters recognized by "\R"). This can be changed with #G_REGEX_NEWLINE_CR,
#G_REGEX_NEWLINE_LF and #G_REGEX_NEWLINE_CRLF compile options,
and with #G_REGEX_MATCH_NEWLINE_ANY, #G_REGEX_MATCH_NEWLINE_CR,
#G_REGEX_MATCH_NEWLINE_LF and #G_REGEX_MATCH_NEWLINE_CRLF match options.
These settings are also relevant when compiling a pattern if
#G_REGEX_EXTENDED is set, and an unescaped "#" outside a character class is
encountered. This indicates a comment that lasts until after the next
newline.
</para>
<para>
Creating and manipulating the same #GRegex structure from different
threads is not a problem as #GRegex does not modify its internal
state between creation and destruction, on the other hand #GMatchInfo is
not threadsafe.
</para>
<para>
The regular expressions low level functionalities are obtained through
the excellent <ulink url="http://www.pcre.org/">PCRE</ulink> library
written by Philip Hazel.
</para>

<!-- ##### SECTION See_Also ##### -->
<para>

</para>

<!-- ##### SECTION Stability_Level ##### -->


<!-- ##### ENUM GRegexError ##### -->
<para>
Error codes returned by regular expressions functions.
</para>

@G_REGEX_ERROR_COMPILE: Compilation of the regular expression in g_regex_new() failed.
@G_REGEX_ERROR_OPTIMIZE: Optimization of the regular expression failed.
@G_REGEX_ERROR_REPLACE: Replacement failed due to an ill-formed replacement string.
@G_REGEX_ERROR_MATCH: The match process failed.
@Since: 2.14

<!-- ##### MACRO G_REGEX_ERROR ##### -->
<para>
Error domain for regular expressions. Errors in this domain will be from the #GRegexError enumeration. See #GError for information on error domains.
</para>

@Since: 2.14


<!-- ##### ENUM GRegexCompileFlags ##### -->
<para>
Flags specifying compile-time options.
</para>

@G_REGEX_CASELESS: Letters in the pattern match both upper and lower case
letters. It be changed within a pattern by a "(?i)" option setting.
@G_REGEX_MULTILINE: By default, GRegex treats the strings as consisting
of a single line of characters (even if it actually contains newlines).
The "start of line" metacharacter ("^") matches only at the start of the
string, while the "end of line" metacharacter ("$") matches only at the
end of the string, or before a terminating newline (unless
#G_REGEX_DOLLAR_ENDONLY is set). When #G_REGEX_MULTILINE is set,
the "start of line" and "end of line" constructs match immediately following
or immediately before any newline in the string, respectively, as well
as at the very start and end. This can be changed within a pattern by a
"(?m)" option setting.
@G_REGEX_DOTALL: A dot metacharater (".") in the pattern matches all
characters, including newlines. Without it, newlines are excluded. This
option can be changed within a pattern by a ("?s") option setting.
@G_REGEX_EXTENDED: Whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class.
Whitespace does not include the VT character (code 11). In addition,
characters between an unescaped "#" outside a character class and
the next newline character, inclusive, are also ignored. This can be
changed within a pattern by a "(?x)" option setting.
@G_REGEX_ANCHORED: The pattern is forced to be "anchored", that is,
it is constrained to match only at the first matching point in the string
that is being searched. This effect can also be achieved by appropriate
constructs in the pattern itself such as the "^" metacharater.
@G_REGEX_DOLLAR_ENDONLY: A dollar metacharacter ("$") in the pattern
matches only at the end of the string. Without this option, a dollar also
matches immediately before the final character if it is a newline (but
not before any other newlines). This option is ignored if
#G_REGEX_MULTILINE is set.
@G_REGEX_UNGREEDY: Inverts the "greediness" of the
quantifiers so that they are not greedy by default, but become greedy
if followed by "?". It can also be set by a "(?U)" option setting within
the pattern.
@G_REGEX_RAW: Usually strings must be valid UTF-8 strings, using this
flag they are considered as a raw sequence of bytes.
@G_REGEX_NO_AUTO_CAPTURE: Disables the use of numbered capturing
parentheses in the pattern. Any opening parenthesis that is not followed
by "?" behaves as if it were followed by "?:" but named parentheses can
still be used for capturing (and they acquire numbers in the usual way).
@G_REGEX_OPTIMIZE: Optimize the regular expression. If the pattern will
be used many times, then it may be worth the effort to optimize it to
improve the speed of matches.
@G_REGEX_DUPNAMES: Names used to identify capturing subpatterns need not
be unique. This can be helpful for certain types of pattern when it is known
that only one instance of the named subpattern can ever be matched.
@G_REGEX_NEWLINE_CR: Usually any newline character is recognized, if this
option is set, the only recognized newline character is '\r'.
@G_REGEX_NEWLINE_LF: Usually any newline character is recognized, if this
option is set, the only recognized newline character is '\n'.
@G_REGEX_NEWLINE_CRLF: Usually any newline character is recognized, if this
option is set, the only recognized newline character sequence is '\r\n'.
@Since: 2.14

<!-- ##### ENUM GRegexMatchFlags ##### -->
<para>
Flags specifying match-time options.
</para>

@G_REGEX_MATCH_ANCHORED: The pattern is forced to be "anchored", that is,
it is constrained to match only at the first matching point in the string
that is being searched. This effect can also be achieved by appropriate
constructs in the pattern itself such as the "^" metacharater.
@G_REGEX_MATCH_NOTBOL: Specifies that first character of the string is
not the beginning of a line, so the circumflex metacharacter should not
match before it. Setting this without G_REGEX_MULTILINE (at compile time)
causes circumflex never to match. This option affects only the behaviour of
the circumflex metacharacter, it does not affect "\A".
@G_REGEX_MATCH_NOTEOL: Specifies that the end of the subject string is
not the end of a line, so the dollar metacharacter should not match it nor
(except in multiline mode) a newline immediately before it. Setting this
without G_REGEX_MULTILINE (at compile time) causes dollar never to match.
This option affects only the behaviour of the dollar metacharacter, it does
not affect "\Z" or "\z".
@G_REGEX_MATCH_NOTEMPTY: An empty string is not considered to be a valid
match if this option is set. If there are alternatives in the pattern, they
are tried. If all the alternatives match the empty string, the entire match
fails. For example, if the pattern "a?b?" is applied to a string not beginning
with "a" or "b", it matches the empty string at the start of the string.
With this flag set, this match is not valid, so GRegex searches further
into the string for occurrences of "a" or "b".
@G_REGEX_MATCH_PARTIAL: Turns on the partial matching feature, for more
documentation on partial matching see g_regex_is_partial_match().
@G_REGEX_MATCH_NEWLINE_CR: Overrides the newline definition set when creating
a new #GRegex, setting the '\r' character as line terminator.
@G_REGEX_MATCH_NEWLINE_LF: Overrides the newline definition set when creating
a new #GRegex, setting the '\n' character as line terminator.
@G_REGEX_MATCH_NEWLINE_CRLF: Overrides the newline definition set when creating
a new #GRegex, setting the '\r\n' characters as line terminator.
@G_REGEX_MATCH_NEWLINE_ANY: Overrides the newline definition set when creating
a new #GRegex, any newline character or character sequence is recognized.
@Since: 2.14

<!-- ##### STRUCT GRegex ##### -->
<para>
A GRegex is the "compiled" form of a regular expression pattern. This
structure is opaque and its fields cannot be accessed directly.
</para>

@Since: 2.14

<!-- ##### USER_FUNCTION GRegexEvalCallback ##### -->
<para>
Specifies the type of the function passed to g_regex_replace_eval().
It is called for each occurance of the pattern @regex in @string, and it
should append the replacement to @result.
</para>

@Param1: the #GRegex passed to g_regex_replace_eval()
@Param2: the #GMatchInfo generated by the match
@Param3: the string used to perform matches against
@Param4: a #GString containing the new string
@Param5: user data passed to g_regex_replace_eval()
@Returns: %FALSE to continue the replacement process, %TRUE to stop it
@Since: 2.14


<!-- ##### FUNCTION g_regex_new ##### -->
<para>

</para>

@pattern: 
@compile_options: 
@match_options: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_free ##### -->
<para>

</para>

@regex: 


<!-- ##### FUNCTION g_regex_get_pattern ##### -->
<para>

</para>

@regex: 
@Returns: 


<!-- ##### FUNCTION g_regex_get_max_backref ##### -->
<para>

</para>

@regex: 
@Returns: 


<!-- ##### FUNCTION g_regex_get_capture_count ##### -->
<para>

</para>

@regex: 
@Returns: 


<!-- ##### FUNCTION g_regex_get_string_number ##### -->
<para>

</para>

@regex: 
@name: 
@Returns: 


<!-- ##### FUNCTION g_regex_escape_string ##### -->
<para>

</para>

@string: 
@length: 
@Returns: 


<!-- ##### FUNCTION g_regex_match_simple ##### -->
<para>

</para>

@pattern: 
@string: 
@compile_options: 
@match_options: 
@Returns: 


<!-- ##### FUNCTION g_regex_match ##### -->
<para>

</para>

@regex: 
@string: 
@match_options: 
@match_info: 
@Returns: 


<!-- ##### FUNCTION g_regex_match_full ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@match_options: 
@match_info: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_match_all ##### -->
<para>

</para>

@regex: 
@string: 
@match_options: 
@match_info: 
@Returns: 


<!-- ##### FUNCTION g_regex_match_all_full ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@match_options: 
@match_info: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_split_simple ##### -->
<para>

</para>

@pattern: 
@string: 
@compile_options: 
@match_options: 
@Returns: 


<!-- ##### FUNCTION g_regex_split ##### -->
<para>

</para>

@regex: 
@string: 
@match_options: 
@Returns: 


<!-- ##### FUNCTION g_regex_split_full ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@match_options: 
@max_tokens: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_replace ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@replacement: 
@match_options: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_replace_literal ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@replacement: 
@match_options: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_regex_replace_eval ##### -->
<para>

</para>

@regex: 
@string: 
@string_len: 
@start_position: 
@match_options: 
@eval: 
@user_data: 
@error: 
@Returns: 


<!-- ##### STRUCT GMatchInfo ##### -->
<para>
#GMatchInfo is used to retrieve information about the regular expression match
which created it.
This structure is opaque and its fields cannot be accessed directly.
</para>

@Since: 2.14

<!-- ##### FUNCTION g_match_info_free ##### -->
<para>

</para>

@match_info: 


<!-- ##### FUNCTION g_match_info_next ##### -->
<para>

</para>

@match_info: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_match_info_get_match_count ##### -->
<para>

</para>

@match_info: 
@Returns: 


<!-- ##### FUNCTION g_match_info_is_partial_match ##### -->
<para>

</para>

@match_info: 
@Returns: 


<!-- ##### FUNCTION g_match_info_expand_references ##### -->
<para>

</para>

@match_info: 
@string_to_expand: 
@error: 
@Returns: 


<!-- ##### FUNCTION g_match_info_fetch ##### -->
<para>

</para>

@match_info: 
@match_num: 
@Returns: 


<!-- ##### FUNCTION g_match_info_fetch_pos ##### -->
<para>

</para>

@match_info: 
@match_num: 
@start_pos: 
@end_pos: 
@Returns: 


<!-- ##### FUNCTION g_match_info_fetch_named ##### -->
<para>

</para>

@match_info: 
@name: 
@Returns: 


<!-- ##### FUNCTION g_match_info_fetch_named_pos ##### -->
<para>

</para>

@match_info: 
@name: 
@start_pos: 
@end_pos: 
@Returns: 


<!-- ##### FUNCTION g_match_info_fetch_all ##### -->
<para>

</para>

@match_info: 
@Returns: