Boost.Regex: Standards Conformance

Boost.Regex

Standards Conformance

C++

Boost.regex is intended to conform to the regular expression standardization proposal, which will appear in a future C++ standard technical report (and hopefully in a future version of the standard).

ECMAScript / JavaScript

All of the ECMAScript regular expression syntax features are supported, except that:

Negated class escapes (\S, \D and \W) are not permitted inside character class definitions ( [...] ).

The escape sequence \u matches any upper case character (the same as [[:upper:]]) rather than a Unicode escape sequence; use \x{DDDD} for Unicode escape sequences.

Perl

Almost all Perl features are supported, except for:

(?{code}) Not implementable in a compiled strongly typed language.

(??{code}) Not implementable in a compiled strongly typed language.

POSIX

All the POSIX basic and extended regular expression features are supported, except that:

No character collating names are recognized except those specified in the POSIX standard for the C locale, unless they are explicitly registered with the traits class.

Character equivalence classes ( [[=a=]] etc) are probably buggy except on Win32. Implementing this feature requires knowledge of the format of the string sort keys produced by the system; if you need this, and the default implementation doesn't work on your platform, then you will need to supply a custom traits class.

Unicode

The following comments refer to Unicode Technical Standard #18: Unicode Regular Expressions version 9.

# Feature Support

1.1 Hex Notation Yes: use \x{DDDD} to refer to code point UDDDD.

1.2 Character Properties All the names listed under the General Category Property are supported. Script names and Other Names are not currently supported.

1.3 Subtraction and Intersection
Indirectly support by forward-lookahead:

(?=[[:X:]])[[:Y:]]

Gives the intersection of character properties X and Y.

(?![[:X:]])[[:Y:]]

Gives everything in Y that is not in X (subtraction).

1.4 Simple Word Boundaries Conforming: non-spacing marks are included in the set of word characters.

1.5 Caseless Matching Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS").

1.6 Line Boundaries Supported, except that "." matches only one character of "\r\n". Other than that word boundaries match correctly; including not matching in the middle of a "\r\n" sequence.

1.7 Code Points Supported: provided you use the u32* algorithms, then UTF-8, UTF-16 and UTF-32 are all treated as sequences of 32-bit code points.

2.1 Canonical Equivalence Not supported: it is up to the user of the library to convert all text into the same canonical form as the regular expression.

2.2 Default Grapheme Clusters Not supported.

2.3
Default Word Boundaries
Not supported.

2.4
Default Loose Matches
Not Supported.

2.5 Name Properties Supported: the expression "[[:name:]]" or \N{name} matches the named character "name".

2.6 Wildcard properties Not Supported.

3.1 Tailored Punctuation. Not Supported.

3.2 Tailored Grapheme Clusters Not Supported.

3.3 Tailored Word Boundaries. Not Supported.

3.4 Tailored Loose Matches Partial support: [[=c=]] matches characters with the same primary equivalence class as "c".

3.5 Tailored Ranges Supported: [a-b] matches any character that collates in the range a to b, when the expression is constructed with the collate flag set.

3.6 Context Matches Not Supported.

3.7 Incremental Matches Supported: pass the flag match_partial to the regex algorithms.

3.8 Unicode Set Sharing Not Supported.

3.9 Possible Match Sets Not supported, however this information is used internally to optimise the matching of regular expressions, and return quickly if no match is possible.

3.10 Folded Matching Partial Support: It is possible to achieve a similar effect by using a custom regular expression traits class.

3.11 Custom Submatch Evaluation Not Supported.

Revised 28 June 2004

Use, modification and distribution are subject to the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

(?{code})	Not implementable in a compiled strongly typed language.
(??{code})	Not implementable in a compiled strongly typed language.

#	Feature	Support
1.1	Hex Notation	Yes: use \x{DDDD} to refer to code point UDDDD.
1.2	Character Properties	All the names listed under the General Category Property are supported. Script names and Other Names are not currently supported.
1.3	Subtraction and Intersection	Indirectly support by forward-lookahead: (?=[[:X:]])[[:Y:]] Gives the intersection of character properties X and Y. (?![[:X:]])[[:Y:]] Gives everything in Y that is not in X (subtraction).
1.4	Simple Word Boundaries	Conforming: non-spacing marks are included in the set of word characters.
1.5	Caseless Matching	Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS").
1.6	Line Boundaries	Supported, except that "." matches only one character of "\r\n". Other than that word boundaries match correctly; including not matching in the middle of a "\r\n" sequence.
1.7	Code Points	Supported: provided you use the u32* algorithms, then UTF-8, UTF-16 and UTF-32 are all treated as sequences of 32-bit code points.
2.1	Canonical Equivalence	Not supported: it is up to the user of the library to convert all text into the same canonical form as the regular expression.
2.2	Default Grapheme Clusters	Not supported.
2.3	Default Word Boundaries	Not supported.
2.4	Default Loose Matches	Not Supported.
2.5	Name Properties	Supported: the expression "[[:name:]]" or \N{name} matches the named character "name".
2.6	Wildcard properties	Not Supported.
3.1	Tailored Punctuation.	Not Supported.
3.2	Tailored Grapheme Clusters	Not Supported.
3.3	Tailored Word Boundaries.	Not Supported.
3.4	Tailored Loose Matches	Partial support: [[=c=]] matches characters with the same primary equivalence class as "c".
3.5	Tailored Ranges	Supported: [a-b] matches any character that collates in the range a to b, when the expression is constructed with the collate flag set.
3.6	Context Matches	Not Supported.
3.7	Incremental Matches	Supported: pass the flag match_partial to the regex algorithms.
3.8	Unicode Set Sharing	Not Supported.
3.9	Possible Match Sets	Not supported, however this information is used internally to optimise the matching of regular expressions, and return quickly if no match is possible.
3.10	Folded Matching	Partial Support: It is possible to achieve a similar effect by using a custom regular expression traits class.
3.11	Custom Submatch Evaluation	Not Supported.