MediaWiki
REL1_24
|
Unicode normalization routines for working with UTF-8 strings. More...
Static Public Member Functions | |
static | cleanUp ($string) |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition. | |
static | loadData () |
Load the basic composition data if necessary. | |
static | placebo ($string) |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance. | |
static | quickIsNFC ($string) |
Returns true if the string is _definitely_ in NFC. | |
static | quickIsNFCVerify (&$string) |
Returns true if the string is _definitely_ in NFC. | |
static | toNFC ($string) |
Convert a UTF-8 string to normal form C, canonical composition. | |
static | toNFD ($string) |
Convert a UTF-8 string to normal form D, canonical decomposition. | |
static | toNFKC ($string) |
Convert a UTF-8 string to normal form KC, compatibility composition. | |
static | toNFKD ($string) |
Convert a UTF-8 string to normal form KD, compatibility decomposition. | |
Public Attributes | |
const | UNORM_DEFAULT = self::UNORM_NFC |
const | UNORM_FCD = 6 |
const | UNORM_NFC = 4 |
const | UNORM_NFD = 2 |
const | UNORM_NFKC = 5 |
const | UNORM_NFKD = 3 |
const | UNORM_NONE = 1 |
For using the ICU wrapper. | |
Static Public Attributes | |
static | $utfCanonicalComp = null |
static | $utfCanonicalDecomp = null |
static | $utfCheckNFC |
static | $utfCombiningClass = null |
static | $utfCompatibilityDecomp = null |
Static Private Member Functions | |
static | fastCombiningSort ($string) |
Sorts combining characters into canonical order. | |
static | fastCompose ($string) |
Produces canonically composed sequences, i.e. | |
static | fastDecompose ($string, $map) |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us). | |
static | NFC ($string) |
static | NFD ($string) |
static | NFKC ($string) |
static | NFKD ($string) |
static | replaceForNativeNormalize ($string) |
Function to replace some characters that we don't want but most of the native normalize functions keep. |
Unicode normalization routines for working with UTF-8 strings.
Currently assumes that input strings are valid UTF-8!
Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly determine is already normalized.
All functions can be called static.
See description of forms at http://www.unicode.org/reports/tr15/
Definition at line 48 of file UtfNormal.php.
if there are corrupt characters this may produce erroneous results To also check for illegal use UtfNormal::cleanUp | ( | $ | string | ) | [static] |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().
string | $string | a UTF-8 string |
Definition at line 78 of file UtfNormal.php.
References NFC(), quickIsNFCVerify(), and replaceForNativeNormalize().
Referenced by MWDebug\debugMsg(), FeedUtils\formatDiffRow(), MediaWikiSite\normalizePageName(), DjVuImage\pageTextCallback(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testBytes(), CleanUpTest\testChunkRegression(), CleanUpTest\testDoubleBytes(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), CleanUpTest\testTripleBytes(), and CleanUpTest\XtestAllChars().
static UtfNormal::fastCombiningSort | ( | $ | string | ) | [static, private] |
Sorts combining characters into canonical order.
This is the final step in creating decomposed normal forms D and KD.
string | $string | a valid, decomposed UTF-8 string. Input is not validated. |
Definition at line 573 of file UtfNormal.php.
References $n, $out, array(), and loadData().
Referenced by NFD().
static UtfNormal::fastCompose | ( | $ | string | ) | [static, private] |
Produces canonically composed sequences, i.e.
normal form C or KC.
string | $string | a valid UTF-8 string in sorted normal form D or KD. Input is not validated. |
Definition at line 628 of file UtfNormal.php.
References $n, $out, empty, and loadData().
static UtfNormal::fastDecompose | ( | $ | string, |
$ | map | ||
) | [static, private] |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
Input is assumed to be *valid* UTF-8. Invalid code will break.
string | $string | valid UTF-8 string |
array | $map | hash of expanded decomposition map |
Definition at line 512 of file UtfNormal.php.
References $n, $out, $t, and loadData().
Referenced by NFD().
static UtfNormal::loadData | ( | ) | [static] |
Load the basic composition data if necessary.
Definition at line 190 of file UtfNormal.php.
Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().
static UtfNormal::NFC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 464 of file UtfNormal.php.
References fastCompose(), and NFD().
Referenced by cleanUp(), CleanUpTest\testDoubleBytes(), CleanUpTest\testTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().
static UtfNormal::NFD | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 473 of file UtfNormal.php.
References fastCombiningSort(), fastDecompose(), and loadData().
static UtfNormal::NFKC | ( | $ | string | ) | [static, private] |
$string | string |
Definition at line 485 of file UtfNormal.php.
References fastCompose(), and NFKD().
Referenced by toNFKC().
static UtfNormal::NFKD | ( | $ | string | ) | [static, private] |
static UtfNormal::placebo | ( | $ | string | ) | [static] |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
$string | string |
Definition at line 763 of file UtfNormal.php.
References $out.
static UtfNormal::quickIsNFC | ( | $ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 202 of file UtfNormal.php.
References $n, and loadData().
Referenced by toNFC().
static UtfNormal::quickIsNFCVerify | ( | &$ | string | ) | [static] |
Returns true if the string is _definitely_ in NFC.
Returns false if not or uncertain.
string | $string | a UTF-8 string, altered on output to be valid UTF-8 safe for XML. |
Definition at line 243 of file UtfNormal.php.
References $matches, $n, are, array(), as, in, is, loadData(), see, that, used, and UTF.
Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().
static UtfNormal::replaceForNativeNormalize | ( | $ | string | ) | [static, private] |
Function to replace some characters that we don't want but most of the native normalize functions keep.
string | $string | The string |
Definition at line 780 of file UtfNormal.php.
Referenced by cleanUp().
This directory contains some Unicode normalization routines These routines are meant to be reusable in other so I m not tying them to the MediaWiki utility functions The main function to care about is UtfNormal::toNFC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 119 of file UtfNormal.php.
References NFC(), and quickIsNFC().
Referenced by normalize_form_c(), and normalize_form_c_php().
static UtfNormal::toNFD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form D, canonical decomposition.
Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 137 of file UtfNormal.php.
References NFD().
Referenced by normalize_form_d(), and normalize_form_d_php().
static UtfNormal::toNFKC | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KC, compatibility composition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 156 of file UtfNormal.php.
References NFKC().
Referenced by normalize_form_kc(), and normalize_form_kc_php().
static UtfNormal::toNFKD | ( | $ | string | ) | [static] |
Convert a UTF-8 string to normal form KD, compatibility decomposition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 175 of file UtfNormal.php.
References NFKD().
Referenced by normalize_form_kd(), and normalize_form_kd_php().
UtfNormal::$utfCanonicalComp = null [static] |
Definition at line 61 of file UtfNormal.php.
Referenced by CleanUpTest\XtestAllChars().
UtfNormal::$utfCanonicalDecomp = null [static] |
Definition at line 62 of file UtfNormal.php.
Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().
UtfNormal::$utfCheckNFC [static] |
Definition at line 66 of file UtfNormal.php.
UtfNormal::$utfCombiningClass = null [static] |
Definition at line 60 of file UtfNormal.php.
UtfNormal::$utfCompatibilityDecomp = null [static] |
Definition at line 65 of file UtfNormal.php.
const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC |
Definition at line 58 of file UtfNormal.php.
const UtfNormal::UNORM_FCD = 6 |
Definition at line 57 of file UtfNormal.php.
const UtfNormal::UNORM_NFC = 4 |
Definition at line 55 of file UtfNormal.php.
Referenced by donorm().
const UtfNormal::UNORM_NFD = 2 |
Definition at line 53 of file UtfNormal.php.
const UtfNormal::UNORM_NFKC = 5 |
Definition at line 56 of file UtfNormal.php.
const UtfNormal::UNORM_NFKD = 3 |
Definition at line 54 of file UtfNormal.php.
const UtfNormal::UNORM_NONE = 1 |
For using the ICU wrapper.
Definition at line 52 of file UtfNormal.php.