MediaWiki
REL1_20
|
Unicode normalization routines for working with UTF-8 strings. More...
Static Public Member Functions | |
static | cleanUp ($string) |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition. | |
static | loadData () |
Load the basic composition data if necessary. | |
static | placebo ($string) |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance. | |
static | quickIsNFC ($string) |
Returns true if the string is _definitely_ in NFC. | |
static | quickIsNFCVerify (&$string) |
Returns true if the string is _definitely_ in NFC. | |
static | toNFC ($string) |
Convert a UTF-8 string to normal form C, canonical composition. | |
static | toNFD ($string) |
Convert a UTF-8 string to normal form D, canonical decomposition. | |
static | toNFKC ($string) |
Convert a UTF-8 string to normal form KC, compatibility composition. | |
static | toNFKD ($string) |
Convert a UTF-8 string to normal form KD, compatibility decomposition. | |
Public Attributes | |
const | UNORM_DEFAULT = self::UNORM_NFC |
const | UNORM_FCD = 6 |
const | UNORM_NFC = 4 |
const | UNORM_NFD = 2 |
const | UNORM_NFKC = 5 |
const | UNORM_NFKD = 3 |
const | UNORM_NONE = 1 |
For using the ICU wrapper. | |
Static Public Attributes | |
static | $utfCanonicalComp = null |
static | $utfCanonicalDecomp = null |
static | $utfCheckNFC |
static | $utfCombiningClass = null |
static | $utfCompatibilityDecomp = null |
Static Private Member Functions | |
static | fastCombiningSort ($string) |
Sorts combining characters into canonical order. | |
static | fastCompose ($string) |
Produces canonically composed sequences, i.e. | |
static | fastDecompose ($string, $map) |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us). | |
static | NFC ($string) |
static | NFD ($string) |
static | NFKC ($string) |
static | NFKD ($string) |
static | replaceForNativeNormalize ($string) |
Function to replace some characters that we don't want but most of the native normalize functions keep. |
Unicode normalization routines for working with UTF-8 strings.
Currently assumes that input strings are valid UTF-8!
Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly deterimine is already normalized.
All functions can be called static.
See description of forms at http://www.unicode.org/reports/tr15/
Definition at line 48 of file UtfNormal.php.
static UtfNormal::cleanUp | ( | $ | string | ) | [static] |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().
$string | String: a UTF-8 string |
Definition at line 79 of file UtfNormal.php.
References NFC(), quickIsNFCVerify(), and replaceForNativeNormalize().
Referenced by CleanUpTest\doTestBytes(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), FeedUtils\formatDiffRow(), DjVuImage\pageTextCallback(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testChunkRegression(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), and CleanUpTest\XtestAllChars().
static UtfNormal::fastCombiningSort | ( | $ | string | ) | [static, private] |
Sorts combining characters into canonical order.
This is the final step in creating decomposed normal forms D and KD.
$string | String: a valid, decomposed UTF-8 string. Input is not validated. |
Definition at line 570 of file UtfNormal.php.
References $n, $out, and loadData().
Referenced by NFD().
static UtfNormal::fastCompose | ( | $ | string | ) | [static, private] |
Produces canonically composed sequences, i.e.
normal form C or KC.
$string | String: a valid UTF-8 string in sorted normal form D or KD. Input is not validated. |
Definition at line 622 of file UtfNormal.php.
References $n, $out, and loadData().
static UtfNormal::fastDecompose | ( | $ | string, |
$ | map | ||
) | [static, private] |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
Input is assumed to be *valid* UTF-8. Invalid code will break.
$string | String: valid UTF-8 string |
$map | Array: hash of expanded decomposition map |
Definition at line 510 of file UtfNormal.php.
References $n, $out, $t, and loadData().
Referenced by NFD().
static UtfNormal::loadData | ( | ) | [static] |
Load the basic composition data if necessary.
Definition at line 191 of file UtfNormal.php.
Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().
static UtfNormal::NFC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 462 of file UtfNormal.php.
References fastCompose(), and NFD().
Referenced by cleanUp(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().
static UtfNormal::NFD | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 471 of file UtfNormal.php.
References fastCombiningSort(), fastDecompose(), and loadData().
static UtfNormal::NFKC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 483 of file UtfNormal.php.
References fastCompose(), and NFKD().
Referenced by toNFKC().
static UtfNormal::NFKD | ( | $ | string | ) | [static, private] |
static UtfNormal::placebo | ( | $ | string | ) | [static] |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
$string | string |
Definition at line 753 of file UtfNormal.php.
References $out.
static UtfNormal::quickIsNFC | ( | $ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 203 of file UtfNormal.php.
References $n, and loadData().
Referenced by toNFC().
static UtfNormal::quickIsNFCVerify | ( | &$ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
$string | String: a UTF-8 string, altered on output to be valid UTF-8 safe for XML. |
Definition at line 243 of file UtfNormal.php.
References $matches, $n, and loadData().
Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().
static UtfNormal::replaceForNativeNormalize | ( | $ | string | ) | [static, private] |
Function to replace some characters that we don't want but most of the native normalize functions keep.
$string | String The string |
Definition at line 768 of file UtfNormal.php.
Referenced by cleanUp().
static UtfNormal::toNFC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 120 of file UtfNormal.php.
References NFC(), and quickIsNFC().
Referenced by normalize_form_c(), and normalize_form_c_php().
static UtfNormal::toNFD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form D, canonical decomposition.
Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 138 of file UtfNormal.php.
References NFD().
Referenced by normalize_form_d(), and normalize_form_d_php().
static UtfNormal::toNFKC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KC, compatibility composition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 157 of file UtfNormal.php.
References NFKC().
Referenced by normalize_form_kc(), and normalize_form_kc_php().
static UtfNormal::toNFKD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KD, compatibility decomposition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 176 of file UtfNormal.php.
References NFKD().
Referenced by normalize_form_kd(), and normalize_form_kd_php().
UtfNormal::$utfCanonicalComp = null [static] |
Definition at line 61 of file UtfNormal.php.
Referenced by CleanUpTest\XtestAllChars().
UtfNormal::$utfCanonicalDecomp = null [static] |
Definition at line 62 of file UtfNormal.php.
Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().
UtfNormal::$utfCheckNFC [static] |
Definition at line 67 of file UtfNormal.php.
UtfNormal::$utfCombiningClass = null [static] |
Definition at line 60 of file UtfNormal.php.
UtfNormal::$utfCompatibilityDecomp = null [static] |
Definition at line 65 of file UtfNormal.php.
const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC |
Definition at line 58 of file UtfNormal.php.
const UtfNormal::UNORM_FCD = 6 |
Definition at line 57 of file UtfNormal.php.
const UtfNormal::UNORM_NFC = 4 |
Definition at line 55 of file UtfNormal.php.
Referenced by donorm().
const UtfNormal::UNORM_NFD = 2 |
Definition at line 53 of file UtfNormal.php.
const UtfNormal::UNORM_NFKC = 5 |
Definition at line 56 of file UtfNormal.php.
const UtfNormal::UNORM_NFKD = 3 |
Definition at line 54 of file UtfNormal.php.
const UtfNormal::UNORM_NONE = 1 |
For using the ICU wrapper.
Definition at line 52 of file UtfNormal.php.