MediaWiki
REL1_19
|
Unicode normalization routines for working with UTF-8 strings. More...
Static Public Member Functions | |
static | cleanUp ($string) |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition. | |
static | loadData () |
Load the basic composition data if necessary. | |
static | placebo ($string) |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance. | |
static | quickIsNFC ($string) |
Returns true if the string is _definitely_ in NFC. | |
static | quickIsNFCVerify (&$string) |
Returns true if the string is _definitely_ in NFC. | |
static | toNFC ($string) |
Convert a UTF-8 string to normal form C, canonical composition. | |
static | toNFD ($string) |
Convert a UTF-8 string to normal form D, canonical decomposition. | |
static | toNFKC ($string) |
Convert a UTF-8 string to normal form KC, compatibility composition. | |
static | toNFKD ($string) |
Convert a UTF-8 string to normal form KD, compatibility decomposition. | |
Public Attributes | |
const | UNORM_DEFAULT = self::UNORM_NFC |
const | UNORM_FCD = 6 |
const | UNORM_NFC = 4 |
const | UNORM_NFD = 2 |
const | UNORM_NFKC = 5 |
const | UNORM_NFKD = 3 |
const | UNORM_NONE = 1 |
For using the ICU wrapper. | |
Static Public Attributes | |
static | $utfCanonicalComp = null |
static | $utfCanonicalDecomp = null |
static | $utfCheckNFC |
static | $utfCombiningClass = null |
static | $utfCompatibilityDecomp = null |
Static Private Member Functions | |
static | fastCombiningSort ($string) |
Sorts combining characters into canonical order. | |
static | fastCompose ($string) |
Produces canonically composed sequences, i.e. | |
static | fastDecompose ($string, $map) |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us). | |
static | NFC ($string) |
static | NFD ($string) |
static | NFKC ($string) |
static | NFKD ($string) |
static | replaceForNativeNormalize ($string) |
Function to replace some characters that we don't want but most of the native normalize functions keep. |
Unicode normalization routines for working with UTF-8 strings.
Currently assumes that input strings are valid UTF-8!
Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly deterimine is already normalized.
All functions can be called static.
See description of forms at http://www.unicode.org/reports/tr15/
Definition at line 48 of file UtfNormal.php.
static UtfNormal::cleanUp | ( | $ | string | ) | [static] |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().
$string | String: a UTF-8 string |
Definition at line 79 of file UtfNormal.php.
References NFC(), quickIsNFCVerify(), and replaceForNativeNormalize().
Referenced by CleanUpTest\doTestBytes(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), FeedUtils\formatDiffRow(), Language\normalize(), WebRequest\normalizeUnicode(), DjVuImage\pageTextCallback(), Preprocessor_DOM\preprocessToObj(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testChunkRegression(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), xmlsafe(), and CleanUpTest\XtestAllChars().
static UtfNormal::fastCombiningSort | ( | $ | string | ) | [static, private] |
Sorts combining characters into canonical order.
This is the final step in creating decomposed normal forms D and KD.
$string | String: a valid, decomposed UTF-8 string. Input is not validated. |
Definition at line 569 of file UtfNormal.php.
References $n, $out, and loadData().
Referenced by NFD().
static UtfNormal::fastCompose | ( | $ | string | ) | [static, private] |
Produces canonically composed sequences, i.e.
normal form C or KC.
$string | String: a valid UTF-8 string in sorted normal form D or KD. Input is not validated. |
Definition at line 621 of file UtfNormal.php.
References $n, $out, and loadData().
Referenced by NFC(), and NFKC().
static UtfNormal::fastDecompose | ( | $ | string, |
$ | map | ||
) | [static, private] |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
Input is assumed to be *valid* UTF-8. Invalid code will break.
$string | String: valid UTF-8 string |
$map | Array: hash of expanded decomposition map |
Definition at line 509 of file UtfNormal.php.
References $n, $out, $t, and loadData().
Referenced by NFD().
static UtfNormal::loadData | ( | ) | [static] |
Load the basic composition data if necessary.
Definition at line 191 of file UtfNormal.php.
Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().
static UtfNormal::NFC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 461 of file UtfNormal.php.
References fastCompose(), and NFD().
Referenced by cleanUp(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().
static UtfNormal::NFD | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 470 of file UtfNormal.php.
References fastCombiningSort(), fastDecompose(), and loadData().
Referenced by NFC(), and toNFD().
static UtfNormal::NFKC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 482 of file UtfNormal.php.
References fastCompose(), and NFKD().
Referenced by toNFKC().
static UtfNormal::NFKD | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 491 of file UtfNormal.php.
Referenced by NFKC(), and toNFKD().
static UtfNormal::placebo | ( | $ | string | ) | [static] |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
$string | string |
Definition at line 752 of file UtfNormal.php.
References $out.
static UtfNormal::quickIsNFC | ( | $ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 203 of file UtfNormal.php.
References $n, and loadData().
Referenced by toNFC().
static UtfNormal::quickIsNFCVerify | ( | &$ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
$string | String: a UTF-8 string, altered on output to be valid UTF-8 safe for XML. |
Definition at line 242 of file UtfNormal.php.
References $matches, $n, and loadData().
Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().
static UtfNormal::replaceForNativeNormalize | ( | $ | string | ) | [static, private] |
Function to replace some characters that we don't want but most of the native normalize functions keep.
$string | String The string |
Definition at line 767 of file UtfNormal.php.
Referenced by cleanUp().
static UtfNormal::toNFC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 120 of file UtfNormal.php.
References NFC(), and quickIsNFC().
Referenced by normalize_form_c(), and normalize_form_c_php().
static UtfNormal::toNFD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form D, canonical decomposition.
Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 138 of file UtfNormal.php.
References NFD().
Referenced by normalize_form_d(), and normalize_form_d_php().
static UtfNormal::toNFKC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KC, compatibility composition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 157 of file UtfNormal.php.
References NFKC().
Referenced by normalize_form_kc(), and normalize_form_kc_php().
static UtfNormal::toNFKD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KD, compatibility decomposition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
$string | String: a valid UTF-8 string. Input is not validated. |
Definition at line 176 of file UtfNormal.php.
References NFKD().
Referenced by normalize_form_kd(), and normalize_form_kd_php().
UtfNormal::$utfCanonicalComp = null [static] |
Definition at line 61 of file UtfNormal.php.
Referenced by CleanUpTest\XtestAllChars().
UtfNormal::$utfCanonicalDecomp = null [static] |
Definition at line 62 of file UtfNormal.php.
Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().
UtfNormal::$utfCheckNFC [static] |
Definition at line 67 of file UtfNormal.php.
UtfNormal::$utfCombiningClass = null [static] |
Definition at line 60 of file UtfNormal.php.
UtfNormal::$utfCompatibilityDecomp = null [static] |
Definition at line 65 of file UtfNormal.php.
const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC |
Definition at line 58 of file UtfNormal.php.
const UtfNormal::UNORM_FCD = 6 |
Definition at line 57 of file UtfNormal.php.
const UtfNormal::UNORM_NFC = 4 |
Definition at line 55 of file UtfNormal.php.
Referenced by donorm(), and Installer\envCheckLibicu().
const UtfNormal::UNORM_NFD = 2 |
Definition at line 53 of file UtfNormal.php.
const UtfNormal::UNORM_NFKC = 5 |
Definition at line 56 of file UtfNormal.php.
const UtfNormal::UNORM_NFKD = 3 |
Definition at line 54 of file UtfNormal.php.
const UtfNormal::UNORM_NONE = 1 |
For using the ICU wrapper.
Definition at line 52 of file UtfNormal.php.