MediaWiki  REL1_19
UtfNormal Class Reference

Unicode normalization routines for working with UTF-8 strings. More...

List of all members.

Static Public Member Functions

static cleanUp ($string)
 The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
static loadData ()
 Load the basic composition data if necessary.
static placebo ($string)
 This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
static quickIsNFC ($string)
 Returns true if the string is _definitely_ in NFC.
static quickIsNFCVerify (&$string)
 Returns true if the string is _definitely_ in NFC.
static toNFC ($string)
 Convert a UTF-8 string to normal form C, canonical composition.
static toNFD ($string)
 Convert a UTF-8 string to normal form D, canonical decomposition.
static toNFKC ($string)
 Convert a UTF-8 string to normal form KC, compatibility composition.
static toNFKD ($string)
 Convert a UTF-8 string to normal form KD, compatibility decomposition.

Public Attributes

const UNORM_DEFAULT = self::UNORM_NFC
const UNORM_FCD = 6
const UNORM_NFC = 4
const UNORM_NFD = 2
const UNORM_NFKC = 5
const UNORM_NFKD = 3
const UNORM_NONE = 1
 For using the ICU wrapper.

Static Public Attributes

static $utfCanonicalComp = null
static $utfCanonicalDecomp = null
static $utfCheckNFC
static $utfCombiningClass = null
static $utfCompatibilityDecomp = null

Static Private Member Functions

static fastCombiningSort ($string)
 Sorts combining characters into canonical order.
static fastCompose ($string)
 Produces canonically composed sequences, i.e.
static fastDecompose ($string, $map)
 Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
static NFC ($string)
static NFD ($string)
static NFKC ($string)
static NFKD ($string)
static replaceForNativeNormalize ($string)
 Function to replace some characters that we don't want but most of the native normalize functions keep.

Detailed Description

Unicode normalization routines for working with UTF-8 strings.

Currently assumes that input strings are valid UTF-8!

Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly deterimine is already normalized.

All functions can be called static.

See description of forms at http://www.unicode.org/reports/tr15/

Definition at line 48 of file UtfNormal.php.


Member Function Documentation

static UtfNormal::cleanUp ( string) [static]

The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.

Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().

Parameters:
$stringString: a UTF-8 string
Returns:
string a clean, shiny, normalized UTF-8 string

Definition at line 79 of file UtfNormal.php.

References NFC(), quickIsNFCVerify(), and replaceForNativeNormalize().

Referenced by CleanUpTest\doTestBytes(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), FeedUtils\formatDiffRow(), Language\normalize(), WebRequest\normalizeUnicode(), DjVuImage\pageTextCallback(), Preprocessor_DOM\preprocessToObj(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testChunkRegression(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), xmlsafe(), and CleanUpTest\XtestAllChars().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::fastCombiningSort ( string) [static, private]

Sorts combining characters into canonical order.

This is the final step in creating decomposed normal forms D and KD.

Parameters:
$stringString: a valid, decomposed UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string with combining characters sorted in canonical order

Definition at line 569 of file UtfNormal.php.

References $n, $out, and loadData().

Referenced by NFD().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::fastCompose ( string) [static, private]

Produces canonically composed sequences, i.e.

normal form C or KC.

Parameters:
$stringString: a valid UTF-8 string in sorted normal form D or KD. Input is not validated.
Returns:
string a UTF-8 string with canonical precomposed characters used where possible

Definition at line 621 of file UtfNormal.php.

References $n, $out, and loadData().

Referenced by NFC(), and NFKC().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::fastDecompose ( string,
map 
) [static, private]

Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).

Input is assumed to be *valid* UTF-8. Invalid code will break.

Parameters:
$stringString: valid UTF-8 string
$mapArray: hash of expanded decomposition map
Returns:
string a UTF-8 string decomposed, not yet normalized (needs sorting)

Definition at line 509 of file UtfNormal.php.

References $n, $out, $t, and loadData().

Referenced by NFD().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::loadData ( ) [static]

Load the basic composition data if necessary.

Access:
private

Definition at line 191 of file UtfNormal.php.

Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().

Here is the caller graph for this function:

static UtfNormal::NFC ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 461 of file UtfNormal.php.

References fastCompose(), and NFD().

Referenced by cleanUp(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::NFD ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 470 of file UtfNormal.php.

References fastCombiningSort(), fastDecompose(), and loadData().

Referenced by NFC(), and toNFD().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::NFKC ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 482 of file UtfNormal.php.

References fastCompose(), and NFKD().

Referenced by toNFKC().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::NFKD ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 491 of file UtfNormal.php.

Referenced by NFKC(), and toNFKD().

Here is the caller graph for this function:

static UtfNormal::placebo ( string) [static]

This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.

Parameters:
$stringstring
Returns:
string

Definition at line 752 of file UtfNormal.php.

References $out.

static UtfNormal::quickIsNFC ( string) [static]

Returns true if the string is _definitely_ in NFC.

Returns false if not or uncertain.

Parameters:
$stringString: a valid UTF-8 string. Input is not validated.
Returns:
bool

Definition at line 203 of file UtfNormal.php.

References $n, and loadData().

Referenced by toNFC().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::quickIsNFCVerify ( &$  string) [static]

Returns true if the string is _definitely_ in NFC.

Returns false if not or uncertain.

Parameters:
$stringString: a UTF-8 string, altered on output to be valid UTF-8 safe for XML.

Definition at line 242 of file UtfNormal.php.

References $matches, $n, and loadData().

Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::replaceForNativeNormalize ( string) [static, private]

Function to replace some characters that we don't want but most of the native normalize functions keep.

Parameters:
$stringString The string
Returns:
String String with the character codes replaced.

Definition at line 767 of file UtfNormal.php.

Referenced by cleanUp().

Here is the caller graph for this function:

static UtfNormal::toNFC ( string) [static]

Convert a UTF-8 string to normal form C, canonical composition.

Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.

Parameters:
$stringString: a valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form C

Definition at line 120 of file UtfNormal.php.

References NFC(), and quickIsNFC().

Referenced by normalize_form_c(), and normalize_form_c_php().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::toNFD ( string) [static]

Convert a UTF-8 string to normal form D, canonical decomposition.

Fast return for pure ASCII strings.

Parameters:
$stringString: a valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form D

Definition at line 138 of file UtfNormal.php.

References NFD().

Referenced by normalize_form_d(), and normalize_form_d_php().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::toNFKC ( string) [static]

Convert a UTF-8 string to normal form KC, compatibility composition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters:
$stringString: a valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form KC

Definition at line 157 of file UtfNormal.php.

References NFKC().

Referenced by normalize_form_kc(), and normalize_form_kc_php().

Here is the call graph for this function:

Here is the caller graph for this function:

static UtfNormal::toNFKD ( string) [static]

Convert a UTF-8 string to normal form KD, compatibility decomposition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters:
$stringString: a valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form KD

Definition at line 176 of file UtfNormal.php.

References NFKD().

Referenced by normalize_form_kd(), and normalize_form_kd_php().

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

UtfNormal::$utfCanonicalComp = null [static]

Definition at line 61 of file UtfNormal.php.

Referenced by CleanUpTest\XtestAllChars().

UtfNormal::$utfCanonicalDecomp = null [static]

Definition at line 62 of file UtfNormal.php.

Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().

UtfNormal::$utfCheckNFC [static]

Definition at line 67 of file UtfNormal.php.

UtfNormal::$utfCombiningClass = null [static]

Definition at line 60 of file UtfNormal.php.

UtfNormal::$utfCompatibilityDecomp = null [static]

Definition at line 65 of file UtfNormal.php.

const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC

Definition at line 58 of file UtfNormal.php.

Definition at line 57 of file UtfNormal.php.

Definition at line 55 of file UtfNormal.php.

Referenced by donorm(), and Installer\envCheckLibicu().

Definition at line 53 of file UtfNormal.php.

Definition at line 56 of file UtfNormal.php.

Definition at line 54 of file UtfNormal.php.

For using the ICU wrapper.

Definition at line 52 of file UtfNormal.php.


The documentation for this class was generated from the following file: